CN-121996805-A - Medical image question-answering method and system based on external knowledge enhancement

CN121996805ACN 121996805 ACN121996805 ACN 121996805ACN-121996805-A

Abstract

The invention discloses a medical image question-answering method and system based on external knowledge enhancement, which comprises the following steps of S1, S2, S3, dynamically obtaining relevant knowledge vectors from an external medical knowledge base based on a heuristic stepwise retrieval mechanism, S4, generating unified cross-modal joint vectors based on a multi-modal interaction mechanism, and S5, decoding the cross-modal joint vectors by using a language decoder based on a Transformer architecture to generate natural language answers for the input medical images and the questions. According to the invention, by introducing the medical knowledge base, comprehensive reasoning can be performed by combining professional medical knowledge, so that the accuracy and reliability of medical question and answer are better improved.

Inventors

WANG XIJIN
JIN WENJUN
SONG JUN
TANG XIAOHUI
DING YONG
WANG GUOXIANG
TANG SHAN

Assignees

上海市同济医院
上海爱照护医疗科技有限公司
上海市嘉定区江桥医院

Dates

Publication Date: 20260508
Application Date: 20251219

Claims (7)

1. An external knowledge-based enhanced medical image question-answering method is characterized by comprising the following steps: Step S1, extracting features of an input medical image by utilizing a pre-trained medical image coding model to obtain a multi-scale visual semantic representation; s2, carrying out semantic coding on the input problem by adopting a text coding model optimized in the medical field to obtain high-dimensional semantic representation of the problem; Step S3, dynamically acquiring related knowledge vectors from an external medical knowledge base based on a heuristic step-by-step retrieval mechanism; step S4, based on a multi-modal interaction mechanism, fusing the image features, the problem features and the retrieved knowledge representation, and modeling semantic relevance among modes by adopting an attention mechanism to generate a unified cross-modal joint vector; Step S5, decoding the cross-modal joint vector by using a language decoder based on a transducer architecture to generate a natural language answer for the input medical image and the question.
2. The medical image question-answering method based on external knowledge enhancement according to claim 1, wherein the step S1 includes the steps of: Step S101, inputting a medical image corresponding to the question to be answered to an image coding module based on a convolutional neural network structure, where the medical image includes an X-ray image, a CT image, an MRI image or other medical image data. Step S102, extracting multi-level visual features of a medical image through an image coding model, wherein the visual features comprise texture information, focus areas and space structures; and step S103, carrying out global average pooling or flattening processing on the extracted features to obtain image feature vector representation with fixed dimension, wherein the image feature vector representation is used as the input of a subsequent fusion module.
3. The medical image question-answering method based on external knowledge enhancement according to claim 1, wherein step S2 includes the steps of: step S201, inputting a natural language problem into a pre-trained text coding model, wherein the model belongs to a transducer language model finely tuned on the corpus in the medical field; Step S202, word segmentation, word embedding and encoding are carried out on an input problem, and context-related representation of each word is extracted; In step S203, the problem feature expression vector obtained by using the vector corresponding to the special mark position is used as the high-dimensional feature expression for the problem.
4. The medical image question-answering method based on external knowledge enhancement according to claim 1, wherein step S3 includes the steps of: step S301, an external knowledge base in the medical field is constructed, and knowledge content sources comprise a structured medical term system, a disease diagnosis guide and an image report template; Step S302, each knowledge in the knowledge base is encoded, and a text encoding model which is the same as or compatible with the problem encoding is used for converting each knowledge text into a knowledge vector with a fixed length; step S303, adopting a heuristic chain search strategy, taking a problem vector as an initial query, calculating the similarity between the problem vector and all knowledge vectors, and selecting the most relevant first knowledge; Step S304, constructing a new query expression by combining the initial problem vector and the retrieved knowledge item, continuing to perform similarity matching with other knowledge vectors in the knowledge base, and recursively executing retrieval operation; step S305, stopping searching when the similarity of knowledge obtained by continuous searching is lower than a set threshold value or the maximum iteration number is reached; Step S306, all the retrieved knowledge vectors are used as relevant knowledge supplement of the current problem for the subsequent fusion module.
5. The medical image question-answering method based on external knowledge enhancement according to claim 1, wherein step S4 includes the steps of: Step S401, the image features and the problem features obtained in the step S1 and the step S2 are spliced with all the retrieved knowledge vectors and input into a fusion module; Step S402, modeling multi-mode semantic association among images, questions and knowledge parties by adopting a cross self-attention mechanism in a fusion module; step S403, normalizing and transforming the fused representation to form a unified cross-modal joint vector as the input of a subsequent natural language generation module.
6. The medical image question-answering method based on external knowledge enhancement according to claim 1, wherein step S5 includes the steps of: step S501, the cross-modal joint vector obtained in the step S4 is taken as input and is sent into a language decoder model based on a transducer architecture; Step S502, adopting an autoregressive generation mode to gradually output natural language answers. Predicting probability distribution of the current word in each step, and selecting the word with the highest probability as output until a complete answer text is generated or a terminator is encountered; Step S503, outputting the final natural language answer as the answer result of the medical image question-answering system, wherein the answer can be combined with the medical image, the question semantics and the external knowledge to carry out reasoning and comprehensive judgment.
7. The medical image question-answering system based on external knowledge enhancement is characterized by comprising an image processing module, a text processing module, a knowledge retrieval module, a fusion module and an output module, wherein: The image processing module performs feature extraction on the input medical image by utilizing a pre-trained medical image coding model to obtain a multi-scale visual semantic representation; The text processing module adopts a text coding model optimized in the medical field to carry out semantic coding on the input problem, and high-dimensional semantic representation of the problem is obtained; the knowledge retrieval module is used for dynamically acquiring related knowledge vectors from an external medical knowledge base based on a heuristic step-by-step retrieval mechanism; The fusion module fuses the image features, the problem features and the retrieved knowledge representation based on a multi-modal interaction mechanism, models semantic relevance among modes by adopting an attention mechanism, and generates a unified cross-modal joint vector; the output module decodes the cross-modal joint vector by using a language decoder based on a transducer architecture to generate a natural language answer for the input medical image and the question.

Description

Medical image question-answering method and system based on external knowledge enhancement Technical Field The invention relates to the field of medical information, in particular to a medical image question-answering method and system based on external knowledge enhancement. Background With the continuous development of medical imaging technology and intelligent diagnosis auxiliary systems, medical image data shows explosive growth, and the requirements of doctors on efficient diagnosis and knowledge acquisition cannot be met by simply manually reading images and retrieving medical data. The medical image question-answering technology is used as an emerging multi-mode intelligent interaction mode, can combine complex medical images with natural language questions, output answers conforming to medical logic, assist doctors in decision analysis and knowledge retrieval, and has wide application prospects in multiple scenes such as medical image aided diagnosis, medical education, remote question-making and the like. Early medical image question-answering methods were based primarily on traditional visual question-answering techniques. For example, some methods generally employ convolutional neural networks to extract medical image features, encode natural language questions using the convolutional neural networks, and then splice the two features into a classification model to generate answers. The method can not effectively introduce medical knowledge, so that the medical professional semantics behind the problem are difficult to understand, and accurate judgment on fine-granularity focus in the image is difficult to make, so that the generated answer is insufficient in accuracy and interpretability. In recent years, some studies have attempted to introduce external medical knowledge bases to enhance the reasoning capabilities of question-answering systems. For example, the question vector is vector matched with the items in the knowledge base, the top-k knowledge item is selected, then the image feature and the question feature are spliced and fused with the knowledge vectors, and the result is sent into the language model for answer generation. The knowledge enhancement method of the static matching type improves the system performance to a certain extent, but has several key problems that (1) the knowledge retrieval only depends on the problem itself as query, potential association between medical knowledge is ignored, important inference chains are easy to miss, and (2) a dynamic knowledge screening strategy is lacking, redundant or irrelevant contents exist in the introduced top-k knowledge, and the accuracy of generating answers by a model is disturbed. In addition, current methods do not have the ability to model and infer relationships across knowledge items, especially when dealing with problems that require multiple inferences or combining multiple knowledge points. Therefore, how to provide a medical image question-answering method and system with higher accuracy and practicability is an important problem to be solved in the field of current medical artificial intelligence. Disclosure of Invention The invention aims to provide a medical image question-answering method and system based on external knowledge enhancement, so as to solve the problems in the background technology. To achieve the above object, one aspect of the present invention provides a medical image question-answering method based on external knowledge enhancement, comprising the steps of: Step S1, extracting features of an input medical image by utilizing a pre-trained medical image coding model to obtain a multi-scale visual semantic representation; s2, carrying out semantic coding on the input problem by adopting a text coding model optimized in the medical field to obtain high-dimensional semantic representation of the problem; Step S3, dynamically acquiring related knowledge vectors from an external medical knowledge base based on a heuristic step-by-step retrieval mechanism; step S4, based on a multi-modal interaction mechanism, fusing the image features, the problem features and the retrieved knowledge representation, and modeling semantic relevance among modes by adopting an attention mechanism to generate a unified cross-modal joint vector; Step S5, decoding the cross-modal joint vector by using a language decoder based on a transducer architecture to generate a natural language answer for the input medical image and the question. Further, the step S1 includes the following steps: Step S101, inputting a medical image corresponding to the question to be answered to an image coding module based on a convolutional neural network structure, where the medical image includes an X-ray image, a CT image, an MRI image or other medical image data. Step S102, extracting multi-level visual features of a medical image through an image coding model, wherein the visual features comprise texture information, focus areas and space structure