CN-121982724-A - Image processing method and device, electronic equipment and storage medium
Abstract
The invention discloses an image processing method, an image processing device, electronic equipment and a storage medium, wherein the method comprises the steps of obtaining a first medical image to be processed and a first prompt text; the method comprises the steps of providing a first image processing task to be executed, inputting a first medical image and a first prompt text into a feature extraction model to obtain a first image feature and a first text feature, wherein the feature extraction model at least comprises a visual encoder, a plurality of text encoders and a cross-language alignment module, different text encoders are used for processing the prompt texts of different language types, the cross-language alignment module is used for mapping the input text feature to a target semantic space, and executing a text generation task based on the first image feature and the first text feature by adopting a text generation model to obtain a first descriptive text. According to the scheme, on the basis of visual understanding of the medical image, description texts in different languages can be generated, and accurate alignment of the texts in different languages and the image is achieved.
Inventors
- ZHANG XINGLIN
- LIU JIAQI
- MA BINGQI
- LIU YONGHUI
Assignees
- 上海影禾医脉智能科技有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260408
Claims (10)
- 1. An image processing method, the method comprising: Acquiring a first medical image to be processed and a first prompt text, wherein the first prompt text is used for indicating a first image processing task to be executed on the first medical image; inputting the first medical image and the first prompt text into a feature extraction model to obtain a first image feature and a first text feature, wherein the feature extraction model at least comprises a visual encoder, a plurality of text encoders and a cross-language alignment module, wherein different text encoders are used for processing prompt texts of different language types; And executing a text generation task based on the first image feature and the first text feature by adopting a text generation model to obtain a first description text, wherein the first description text comprises image description information, and the image description information is used for describing image content associated with the first image processing task in the first medical image.
- 2. The method of claim 1, wherein the inputting the first medical image and the first prompt text into a feature extraction model to obtain a first image feature and a first text feature comprises: Inputting the first medical image into the visual encoder of the feature extraction model to obtain the first image feature; determining a target text encoder from a plurality of text encoders based on the language type of the first prompt text, and inputting the first prompt text into the target text encoder to obtain a second text feature; and mapping the second text feature to a target semantic space through the cross-language alignment module to obtain the first text feature.
- 3. The method of claim 1, wherein the feature extraction model is trained by: acquiring multiple sets of training data, wherein the training data comprises a second medical image, a second prompt text and a second description text, the second prompt text is used for indicating a second image processing task to be executed on the second medical image, and the second description text is used for describing image content associated with the second image processing task in the second medical image; Inputting the training data into a first machine learning model to obtain a second image feature and a third text feature corresponding to multiple languages, wherein the first machine learning model comprises a visual encoder, a plurality of text encoders and a cross-language alignment module; and determining a feature extraction loss of the first machine learning model based on the second image features and the third text features corresponding to the plurality of sets of training data, and adjusting parameters of a visual encoder, a text encoder and a cross-language alignment module in the first machine learning model based on the feature extraction loss to obtain the feature extraction model.
- 4. A method according to claim 3, wherein the second image processing task comprises a graph-text matching task, and wherein the determining a feature extraction penalty of the first machine learning model based on the second image features and the third text features corresponding to the plurality of sets of training data comprises: Performing dot product operation on the second image feature and the third text feature corresponding to the training data aiming at least part of the training data with the number of groups to obtain a dot product result corresponding to the training data, and determining an actual image-text matching result of the training data based on the dot product result, wherein the actual image-text matching result is used for indicating the matching degree between the second image feature and the third text feature; And determining the feature extraction loss of the first machine learning model based on the actual image-text matching result and the expected image-text matching result of the plurality of groups of training data.
- 5. The method of claim 3, wherein the second image processing task comprises a cross-language retrieval task, the determining a feature extraction penalty of the first machine learning model based on the second image features and the third text features corresponding to the sets of training data comprising: taking the third text feature as an anchor point, performing text retrieval image tasks based on each group of training data pairs, and determining a first positive sample and a preset number of first hard negative samples based on a plurality of candidate medical images obtained by retrieval, wherein the similarity between the image feature corresponding to each first hard negative sample and the third text feature is larger than a first preset similarity threshold value; Determining a first triplet loss based on the third text feature, the first positive sample, and a preset number of the first hard negative samples; Taking the second image features as anchor points, performing image retrieval text tasks based on each group of training data pairs, and determining a second positive sample and a preset number of second hard negative samples based on a plurality of candidate descriptive texts obtained by retrieval, wherein the similarity between the text features corresponding to each second hard negative sample and the second image features is larger than a second preset similarity threshold value; Determining a second triplet loss based on the second image feature, the second positive samples, and a preset number of the second hard negative samples; A bi-directional triplet loss is determined based on the first triplet loss and the second triplet loss and the bi-directional triplet loss is taken as a feature extraction loss of the first machine learning model.
- 6. A method according to claim 3, wherein the second image processing task comprises a semantic consistency constraint task, the determining a feature extraction penalty of the first machine learning model based on the second image features and the third text features corresponding to the sets of training data comprising: And calculating cosine similarity between the second image features in each group of training data and third image features corresponding to multiple languages, and determining feature extraction loss of the first machine learning model based on the cosine similarity.
- 7. The method of claim 3, wherein the determining a feature extraction penalty for the first machine learning model based on the second image features and the third text features corresponding to the plurality of sets of training data comprises: determining task execution loss corresponding to each second image processing task based on the second image features and the third text features corresponding to multiple groups of training data; and carrying out weighted summation on task execution losses corresponding to a plurality of second image processing tasks based on preset weight coefficients to obtain the feature extraction loss of the first machine learning model.
- 8. An image processing apparatus, characterized in that the apparatus comprises: the system comprises a data acquisition module, a first prompt text and a second prompt text, wherein the data acquisition module is used for acquiring a first medical image to be processed and the first prompt text, and the first prompt text is used for indicating a first image processing task to be executed on the first medical image; The device comprises a feature extraction module, a cross-language alignment module, a target semantic space and a feature extraction module, wherein the feature extraction module is used for inputting the first medical image and the first prompt text into a feature extraction model to obtain a first image feature and a first text feature; The text generation module is used for executing a text generation task based on the first image feature and the first text feature by adopting a text generation model to obtain a first description text, wherein the first description text comprises image description information, and the image description information is used for describing image content associated with the first image processing task in the first medical image.
- 9. An electronic device, the electronic device comprising: One or more processors; storage means for storing one or more programs, The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the image processing method of any of claims 1-7.
- 10. A storage medium containing computer executable instructions which, when executed by a computer processor, are for performing the image processing method of any of claims 1-7.
Description
Image processing method and device, electronic equipment and storage medium Technical Field The present invention relates to the field of computer processing technologies, and in particular, to an image processing method, an image processing device, an electronic device, and a storage medium. Background Current vision-language models face significant language imbalance problems, and existing image-text alignment methods are mainly trained based on english data, which results in a significant degradation of the model when processing non-english languages. Moreover, the expression modes of different languages for the same visual concept have obvious differences, and the different cultural backgrounds and language habits can cause different emphasis points when describing the same visual concept, so that the accurate alignment of the graph-Wen Yuyi cannot be ensured if the translation between the different languages is directly carried out. Therefore, there is an urgent need for a graphic alignment method that can enhance the visual understanding and text generation capabilities of low-resource languages. Disclosure of Invention The invention provides an image processing method, an image processing device, electronic equipment and a storage medium, which can generate description texts of different language types on the basis of visual understanding of medical images, and realize accurate alignment of the texts of different languages and the images. In a first aspect, the present invention provides an image processing method, including: Acquiring a first medical image to be processed and a first prompt text, wherein the first prompt text is used for indicating a first image processing task to be executed on the first medical image; inputting the first medical image and the first prompt text into a feature extraction model to obtain a first image feature and a first text feature, wherein the feature extraction model at least comprises a visual encoder, a plurality of text encoders and a cross-language alignment module, wherein different text encoders are used for processing prompt texts of different language types; And executing a text generation task based on the first image feature and the first text feature by adopting a text generation model to obtain a first description text, wherein the first description text comprises image description information, and the image description information is used for describing image content associated with the first image processing task in the first medical image. In a second aspect, the present invention also provides an image processing apparatus including: the system comprises a data acquisition module, a first prompt text and a second prompt text, wherein the data acquisition module is used for acquiring a first medical image to be processed and the first prompt text, and the first prompt text is used for indicating a first image processing task to be executed on the first medical image; The device comprises a feature extraction module, a cross-language alignment module, a target semantic space and a feature extraction module, wherein the feature extraction module is used for inputting the first medical image and the first prompt text into a feature extraction model to obtain a first image feature and a first text feature; The text generation module is used for executing a text generation task based on the first image feature and the first text feature by adopting a text generation model to obtain a first description text, wherein the first description text comprises image description information, and the image description information is used for describing image content associated with the first image processing task in the first medical image. In a third aspect, an embodiment of the present invention further provides an electronic device, including: One or more processors; A storage means for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the image processing methods as provided in any of the embodiments of the present invention. In a fourth aspect, there is also provided in an embodiment of the invention a storage medium containing computer executable instructions which, when executed by a computer processor, are used to perform an image processing method as provided in any embodiment of the invention. The technical scheme of the embodiment of the invention comprises the steps of obtaining a first medical image and a first prompt text to be processed, wherein the first prompt text is used for indicating a first image processing task to be executed on the first medical image, inputting the first medical image and the first prompt text into a feature extraction model to obtain first image features and first text features, wherein the feature extraction model at least comprises a visual encoder, a plurality of text encoders and a cross-language alignment module, differen