CN-122023416-A - Medical image processing method and related equipment based on prompt learning

CN122023416ACN 122023416 ACN122023416 ACN 122023416ACN-122023416-A

Abstract

The application provides a medical image processing method and related equipment based on prompt learning, which are characterized in that effective attribute knowledge guidance and category knowledge guidance are provided in the training process of a medical image model by designing a leachable attribute prompt and a leachable category prompt, and semantic environments of the attribute prompt and the category prompt under specific tasks are captured by continuous learning and optimization, so that the capability of the attribute knowledge guidance and the category knowledge guidance is better, the optimized attribute prompt and the category prompt are directly fixed in related modules of the model in the form of semantic vectors after training is finished, and in the reasoning stage, the modules can directly call the semantic vectors to conduct attribute knowledge guidance and category knowledge guidance on the medical image to be processed, so that focus segmentation results and disease category prediction results with higher accuracy can be obtained under the condition that only the medical image to be processed is input, the time for waiting for completion of medical report is not needed, and the compromise of efficiency and accuracy is realized.

Inventors

WANG YUQING
CAO KAI
ZHANG DINGWEN
YANG XIANGYONG
ZHOU JIE
CAO ZHONGRU
YU QINGHUI
OuYang Haixiang
ZHAO SHUAI

Assignees

上海阅微知几医疗科技有限责任公司
中国人民解放军海军军医大学

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (10)

1. A medical image processing method based on prompt learning, the method comprising: acquiring a medical image processing model and a training data set, wherein the medical image processing model comprises a prompt generation module, an image encoder, a text encoder, an image decoder, a characteristic characterization enhancement module and a dense prediction module, and the training data set comprises a first medical image of various digestive tract diseases, a first medical report corresponding to the first medical image and category reference information of various digestive tract diseases; Processing the first medical report through the prompt generation module to obtain a learnable attribute prompt corresponding to a focus in the first medical image, and processing the category reference information through the prompt generation module to obtain a learnable category prompt corresponding to various digestive tract diseases; Inputting the first medical image into the image encoder to obtain a first original visual feature map, inputting the learnable attribute prompt into the text encoder to obtain an original attribute prompt text vector set, inputting the learnable category prompt into the text encoder to obtain an original category semantic center vector corresponding to each category of digestive tract diseases, and constructing an original category semantic weight matrix based on all the original category semantic center vectors; Inputting the first original visual feature map into the feature characterization enhancement module to obtain a first original visual feature recognition result based on attribute dimensions, inputting the original attribute prompt text vector set into the feature characterization enhancement module, fusing the first original visual feature recognition result and the original attribute prompt text vector set, generating a first channel attention weight of the first original visual feature map according to the fusion result, and enhancing the first original visual feature map based on the first channel attention weight to obtain a first enhanced visual feature map; Inputting the first enhanced visual feature map into the image decoder to obtain a first enhanced decoding feature map, inputting the first enhanced decoding feature map into the dense prediction module to obtain first enhanced visual feature vectors of each pixel point, inputting the original category semantic weight matrix into the dense prediction module, matching each first enhanced visual feature vector with each original category semantic center vector, and obtaining focus segmentation results and disease category prediction results of the first medical image according to matching results; Training the medical image processing model based on a first loss function, calculating the accuracy of the first original visual feature recognition result, learning and updating the original attribute prompt text vector set in the training process, training the medical image processing model based on a second loss function, calculating the accuracy of the focus segmentation result and the disease category prediction result, and learning and updating the original category semantic weight matrix in the training process; When the weighted sum of the first loss function and the second loss function is minimum, storing the updated target attribute prompt text vector set to the feature characterization enhancement module, and storing the updated target category semantic weight matrix to the dense prediction module to obtain a medical image processing model after training is completed; inputting the medical image to be processed into an image encoder of the medical image processing model after training is completed, and calling the target attribute prompt text vector set and the target category semantic weight matrix to obtain a target focus segmentation result and a target disease category prediction result of the medical image to be processed.
2. The medical image processing method according to claim 1, further comprising, prior to the step of acquiring the medical image processing model and the training dataset: Acquiring a medical base model and a pre-training data set comprising a plurality of types of digestive tract diseases, the medical base model comprising the image encoder and the text encoder, the pre-training data set comprising a plurality of image-text pairs, each of the image-text pairs comprising a second medical image and a corresponding second medical report; Inputting a second medical image of each image-text pair into the image encoder to obtain a first visual feature vector, and simultaneously inputting a corresponding second medical report into the text encoder to obtain a first text feature vector; Pre-training the medical base model based on a third loss function, calculating the similarity of the first visual feature vector and the first text feature vector, and iteratively updating the weight of the image encoder and the weight of the text encoder in the pre-training process until the pre-training is completed; the medical image processing model is constructed based on the image encoder and the text encoder after the pre-training is completed.
3. The medical image processing method according to claim 1, wherein the step of processing the first medical report by the cue generation module to obtain a learnable attribute cue corresponding to a focus in the first medical image, and processing the category reference information by the cue generation module to obtain a learnable category cue corresponding to various digestive tract diseases comprises: Extracting an attribute prompt word vector corresponding to a focus in the first medical image from the first medical report through the prompt generation module, and splicing the attribute prompt text vector with a first leachable prompt vector to obtain a leachable attribute prompt corresponding to the focus in the first medical image; and generating category prompt word vectors corresponding to various digestive tract diseases based on the category reference information through the prompt generation module, and splicing the category prompt word vectors with the second learnable prompt vectors to obtain learnable category prompts corresponding to various digestive tract diseases.
4. The medical image processing method according to claim 1, wherein inputting the first original visual feature map into the feature characterization enhancement module to obtain a first original visual feature recognition result based on an attribute dimension, inputting the set of original attribute-prompted text vectors into the feature characterization enhancement module, fusing the first original visual feature recognition result and the set of original attribute-prompted text vectors, generating a first channel attention weight of the first original visual feature map according to the fusion result, enhancing the first original visual feature map based on the first channel attention weight, and obtaining a first enhanced visual feature map, comprising: Inputting the first original visual feature map into a plurality of classifiers of the feature characterization enhancement module, wherein each classifier corresponds to one attribute dimension to obtain the logarithmic probability of the first original visual feature in each attribute dimension, and activating each logarithmic probability through a softMax function to obtain a confidence probability set of the first original visual feature in each attribute dimension; Inputting the original attribute prompt text vector set into the feature characterization enhancement module, weighting corresponding attribute prompt text vectors in the original attribute prompt text vector set by using confidence probability in each attribute dimension, and obtaining a first fusion embedded vector according to all weighted results; Inputting the first fusion embedded vector into a convolution module of the feature characterization enhancement module to generate a first channel attention weight of the first original visual feature map; And enhancing the first original visual feature map based on the first channel attention weight to obtain a first enhanced visual feature map.
5. The medical image processing method according to claim 1, wherein the step of inputting the first enhanced decoding feature map to the dense prediction module to obtain a first enhanced visual feature vector for each pixel point includes: inputting the first enhancement decoding feature map into a feature adapter of the dense prediction module, and obtaining a specific task feature map through a weighted residual error connection algorithm; and acquiring a first enhanced visual feature vector of each pixel point from the task-specific feature map.
6. The medical image processing method according to claim 5, wherein the step of inputting the original category semantic weight matrix into the dense prediction module, matching each of the first enhanced visual feature vectors with each of the original category semantic center vectors, and obtaining a lesion segmentation result and a disease category prediction result of the first medical image based on the matching result, comprises: inputting the original category semantic weight matrix into the dense prediction module, and calculating the similarity between each first enhanced visual feature vector and each original category semantic center vector to obtain a scoring graph of category dimension; activating the scoring graph through a softMax function to obtain the probability that each pixel belongs to various diseases; and generating a focus segmentation mask according to all pixel points corresponding to the disease category with the highest probability, and obtaining a disease category prediction result of the first medical image according to the disease category with the highest probability.
7. The medical image processing method according to claim 1, wherein the step of inputting the medical image to be processed into an image encoder of the trained medical image processing model and calling the target attribute prompt text vector set and the target category semantic weight matrix to obtain a target lesion segmentation result and a target disease category prediction result of the medical image to be processed comprises: Inputting the medical image to be processed into an image encoder of the medical image processing model after training is completed, so as to obtain a second original visual feature map; Inputting the second original visual feature map into a feature characterization enhancement module after training is completed, obtaining a second original visual feature recognition result based on attribute dimensions, calling the target attribute prompt text vector set, fusing the second original visual feature recognition result and the target attribute prompt text vector set, generating a second channel attention weight of the second original visual feature map according to the fusion result, and enhancing the second original visual feature map based on the second channel attention weight to obtain a second enhanced visual feature map; Inputting the second enhanced visual feature map into the image decoder to obtain a second enhanced decoding feature map, inputting the second enhanced decoding feature map into the dense prediction module to obtain second enhanced visual feature vectors of each pixel point, calling the target category semantic weight matrix, matching each second enhanced visual feature vector with each target category semantic center vector in the target category semantic weight matrix, and obtaining a target focus segmentation result and a target disease category prediction result of the medical image to be processed according to the matching result.
8. A medical image processing apparatus based on prompt learning, the apparatus comprising: The system comprises an acquisition module, a training data set and a processing module, wherein the medical image processing module comprises a prompt generation module, an image encoder, a text encoder, an image decoder, a characteristic characterization enhancement module and a dense prediction module, and the training data set comprises a first medical image of a plurality of types of digestive tract diseases, a first medical report corresponding to the first medical image and category reference information of various types of digestive tract diseases; The first obtaining module is used for processing the first medical report through the prompt generating module to obtain a learnable attribute prompt corresponding to a focus in the first medical image, and processing the category reference information through the prompt generating module to obtain a learnable category prompt corresponding to various digestive tract diseases; The second obtaining module is used for inputting the first medical image into the image encoder to obtain a first original visual feature map, inputting the learnable attribute prompt into the text encoder to obtain an original attribute prompt text vector set, inputting the learnable category prompt into the text encoder to obtain an original category semantic center vector corresponding to each type of digestive tract diseases, and constructing an original category semantic weight matrix based on all the original category semantic center vectors; The third obtaining module is configured to input the first original visual feature map to the feature characterization enhancement module to obtain a first original visual feature recognition result based on attribute dimensions, input the original attribute prompt text vector set to the feature characterization enhancement module, fuse the first original visual feature recognition result with the original attribute prompt text vector set, generate a first channel attention weight of the first original visual feature map according to the fusion result, and enhance the first original visual feature map based on the first channel attention weight to obtain a first enhanced visual feature map; A fourth obtaining module, configured to input the first enhanced visual feature map to the image decoder to obtain a first enhanced decoding feature map, input the first enhanced decoding feature map to the dense prediction module to obtain a first enhanced visual feature vector of each pixel point, input the original category semantic weight matrix to the dense prediction module, match each first enhanced visual feature vector with each original category semantic center vector, and obtain a focus segmentation result and a disease category prediction result of the first medical image according to a matching result; the training module is used for training the medical image processing model based on a first loss function, calculating the accuracy of the first original visual feature recognition result, learning and updating the original attribute prompt text vector set in the training process, training the medical image processing model based on a second loss function, calculating the accuracy of the focus segmentation result and the disease category prediction result, and learning and updating the original category semantic weight matrix in the training process; A fifth obtaining module, configured to store, when a weighted sum of the first loss function and the second loss function is minimum, the updated target attribute prompt text vector set to the feature characterization enhancement module, and store, to the dense prediction module, the updated target category semantic weight matrix to obtain a medical image processing model after training is completed; and a sixth obtaining module, configured to input a medical image to be processed into an image encoder of the medical image processing model after training is completed, and call the target attribute prompt text vector set and the target category semantic weight matrix to obtain a target focus segmentation result and a target disease category prediction result of the medical image to be processed.
9. An electronic device comprising a memory storing an application and a processor for running the application in the memory to perform the steps of the medical image processing method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which is executed by a processor to implement the steps in the medical image processing method of any of claims 1 to 7.

Description

Medical image processing method and related equipment based on prompt learning Technical Field The application relates to the technical field of medical image processing, in particular to a medical image processing method based on prompt learning and related equipment. Background Currently, computer-aided diagnosis methods for cancer based on single-disease medical image processing (e.g., CT, MRI) have been widely used and exhibit excellent effects in hospitals and physical examination institutions. The computer aided diagnosis technology mainly extends around a deep learning model, and can be applied to two directions, namely a classification and detection model based on a convolutional neural network, wherein the direction realizes automatic classification (such as canceration/non-canceration, adenoma/hyperplasia and the like) or target detection (such as positioning polyps and early cancer areas) of lesions on a large-scale marked medical image data set (such as chest CT, white light endoscope and fundus image) or target detection (such as positioning polyps and early cancer areas) by training a CNN model on the medical image data set, and the direction utilizes a U-Net and deep Lab partition network to accurately divide the focus areas in a pixel level in a medical image, and meanwhile realizes classification (such as whether cancer is at high risk or low risk) of the patient level based on the focus level division result and the whole image characteristics. Although the technology has significantly progressed, in practical clinical application and popularization, a series of defects still exist that firstly, the existing segmentation model depends on a large number of pixel-level labels, so that the training cost is high, the period is long, meanwhile, different experts possibly have inconsistencies on labels of the same lesion, the noise also can influence the upper limit of the model performance, and secondly, as the model learns that the corresponding pixel information is labeled in a data set, only the texture characteristics of a focus area are often concerned, and the performance of the model is not ideal when the disease is predicted. In order to solve the problems, a multi-mode fusion model is currently adopted for focus segmentation and disease prediction, a medical image and a medical report are input as models, the models learn fusion characteristics of the medical image and the medical report by fusing the image characteristics and the text characteristics, and focus segmentation tasks and disease prediction tasks are performed based on the fusion characteristics. The method does not need a large number of pixel-level labels, and the text in the medical report can provide global semantic information, so that the accuracy of disease prediction is improved. However, in actual clinic, the completion time of the medical report is usually later than that of the medical image, and when the multi-mode fusion model is adopted to perform focus segmentation and disease prediction, the medical report needs to be waited for to be completed, which greatly delays the time of the actual application of the model, and results in lower efficiency of completing the tasks. Therefore, how to obtain a lesion segmentation result and a disease prediction result with high accuracy only when obtaining a medical image is a technical problem that needs to be solved. Disclosure of Invention The embodiment of the application provides a medical image processing method and related equipment based on prompt learning, which are used for relieving the technical problem that the accuracy and the efficiency of the current focus segmentation and disease prediction are difficult to consider. In order to solve the technical problems, the embodiment of the application provides the following technical scheme: the embodiment of the application provides a medical image processing method based on prompt learning, which comprises the following steps: acquiring a medical image processing model and a training data set, wherein the medical image processing model comprises a prompt generation module, an image encoder, a text encoder, an image decoder, a characteristic characterization enhancement module and a dense prediction module, and the training data set comprises a first medical image of various digestive tract diseases, a first medical report corresponding to the first medical image and category reference information of various digestive tract diseases; Processing the first medical report through the prompt generation module to obtain a learnable attribute prompt corresponding to a focus in the first medical image, and processing the category reference information through the prompt generation module to obtain a learnable category prompt corresponding to various digestive tract diseases; Inputting the first medical image into the image encoder to obtain a first original visual feature map, inputting the learnable attribute prompt in