CN-122025075-A - Fundus disease diagnosis method and system based on graph retrieval enhancement generation
Abstract
A fundus disease diagnosis method based on graph retrieval enhancement generation comprises the steps of 1, constructing a multi-mode fundus medical data set and a dynamic medical knowledge graph, 2, constructing a multi-mode alignment model which comprises a visual coding unit, a text coding unit, a knowledge coding unit and a multi-mode projection unit which are parallel to each other, 3, constructing a multi-mode knowledge graph retrieval module, retrieving similar cases based on input visual features, constructing a personalized knowledge subgraph, 4, inputting a training set into the multi-mode alignment model and the multi-mode knowledge graph retrieval module, performing knowledge enhancement four-fold contrast learning training, verifying by using a verification set to obtain a final multi-mode alignment model, and meanwhile, finely adjusting a large language model by using instructions to obtain a diagnosis report generator, 5, inputting a test set into the final multi-mode alignment model, obtaining visual word elements and knowledge elements, inputting the diagnosis report generator, and generating a structured diagnosis report.
Inventors
- GAO YANG
- SHI JIANZHENG
- WU LEI
- LI WENYUAN
- Tang Xuyuan
- WANG HAISHUAI
- HAN WEI
Assignees
- 浙江大学
Dates
- Publication Date
- 20260512
- Application Date
- 20251222
Claims (10)
- 1. The fundus disease diagnosis method based on image retrieval enhancement generation is characterized by comprising the following steps: Step 1, constructing a multi-mode fundus medical data set and a dynamic medical knowledge graph, wherein the multi-mode fundus medical data set comprises left eye images, right eye images and corresponding clinical diagnosis reports of a plurality of patients, and the dynamic medical knowledge graph comprises a static medical knowledge layer and a dynamic instance data layer; The method comprises the steps of 2, constructing a multi-modal alignment model, wherein the multi-modal alignment model is structurally characterized by comprising a visual coding unit, a text coding unit, a knowledge coding unit and a multi-modal projection unit, wherein the visual coding unit, the text coding unit and the knowledge coding unit are mutually parallel, the multi-modal projection unit is connected with the visual coding unit and the knowledge coding unit, the visual coding unit is used for respectively extracting the characteristics of left eye images and right eye images and fusing the characteristics to obtain patient-level visual characteristics, the text coding unit is used for extracting global semantic characteristics of clinical diagnosis reports and decoupling the global semantic characteristics into layered text characteristics, the knowledge coding unit is used for coding personalized knowledge subgraphs obtained based on graph retrieval to obtain knowledge characteristics, and the multi-modal projection unit is used for projecting the visual characteristics and the knowledge characteristics into a word element space of a large language model; Step 3, constructing a multi-mode knowledge graph retrieval module, wherein the retrieval module is configured to retrieve similar cases in the dynamic medical knowledge graph based on input visual characteristics and correlate related medical knowledge to construct a personalized knowledge subgraph; Step 4, inputting the training set into a multi-modal alignment model and a multi-modal knowledge graph retrieval module, performing knowledge-enhanced quadruple contrast learning training to obtain a trained multi-modal alignment model, verifying the trained multi-modal alignment model by utilizing the verification set to obtain a final multi-modal alignment model; And 5, inputting the test set into a final multi-mode alignment model, extracting visual characteristics through the visual coding unit, extracting knowledge characteristics through the multi-mode knowledge graph retrieval module and the knowledge coding unit, acquiring visual words and knowledge words through the multi-mode projection unit, and inputting the visual words and the knowledge words into the diagnosis report generator to complete the generation of a structured diagnosis report of fundus diseases of patients in the test set.
- 2. The fundus disease diagnosis method based on graph retrieval enhancement generation according to claim 1, wherein said step 1 comprises: Step 1.1, collecting left eye images, right eye images and corresponding clinical diagnosis reports of a plurality of patients to form original triplet data, preprocessing the left eye images and the right eye images, uniformly adjusting the images to a preset size, performing text standardization processing on the clinical diagnosis reports, removing redundant characters and uniformly using terms; Step 1.2, constructing the dynamic medical knowledge graph, wherein a static medical knowledge layer is an ontology network constructed by extracting entities and relations from medical teaching materials and guidelines, and a dynamic instance data layer is constructed by instantiating case data in the training set into image instance nodes and establishing association relations between the image instance nodes and concept nodes in the static medical knowledge layer; Step 1.3, establishing a high-dimensional vector index for all image instance nodes in the dynamic instance data layer for subsequent instance retrieval, dividing the processed multi-mode fundus medical dataset into a training set, a verification set and a test set according to a set proportion, and setting the batch size during model training.
- 3. The fundus disease diagnosis method based on graph retrieval enhancement generation according to claim 1, wherein the specific structure of the multimodal alignment model constructed in step 2 comprises: The visual coding unit is sequentially connected with a Vision Transformer main network, a characteristic splicing layer and a characteristic fusion layer, wherein the Vision Transformer main network is used for respectively processing a left eye image and a right eye image and outputting a left eye characteristic vector And right eye feature vector The characteristic splicing layer is used for splicing And Splicing, wherein the characteristic fusion layer consists of a plurality of layers of perceptrons and is used for mapping the spliced characteristics into patient-level image characteristics The text coding unit is sequentially connected with a pre-training language model backbone network and a feature decoupling module, wherein the pre-training language model backbone network is used for extracting global semantic features of a clinical diagnosis report The characteristic decoupling module comprises three parallel multi-layer perceptron branches, wherein the first branch is used for coupling Mapping to left eye text features The second branch is used for Mapping to right eye text features The third branch is used for Mapping to patient-level text features The knowledge coding unit adopts a graph transform network structure and comprises a plurality of stacked graph self-attention layers and a reading layer, which are used for aggregating node information in personalized knowledge subgraphs and outputting dense knowledge feature vectors The multi-modal projection unit comprises two independent linear projection layers, and a first linear projection layer is connected with the output end of the visual coding unit and is used for characterizing the patient-level image Mapping into visual tokens A second linear projection layer connected with the output end of the knowledge coding unit for transmitting the knowledge feature vector Mapping to knowledge lemmas 。
- 4. A fundus disease diagnosis method based on graph retrieval enhancement generation according to claim 3, wherein the process of constructing the multimodal alignment model further comprises: Setting Vision Transformer batch size and transducer layer number according to resolution of input image and specification of pre-training model, setting maximum sequence length of pre-training language model according to length distribution of clinical diagnosis report, setting said feature vector 、 、 、 、 Is a unified feature dimension And setting the number of layers and the number of hidden layer nodes of the multi-layer perceptron in the characteristic decoupling module.
- 5. The fundus disease diagnosis method based on graph retrieval enhancement generation according to claim 1, wherein said step 4 comprises: Step 4.1, inputting the training set into a multi-mode alignment model in batches according to the set batch size, and extracting the patient-level image features of samples of each batch by using the visual coding unit Left eye characteristics And right eye features Extracting patient-level text features using the text encoding unit Text feature for left eye And right eye text features ; Step 4.2, utilizing the multi-mode knowledge graph retrieval module to feature the patient-level image Searching Top-K similar historical case nodes for inquiring vectors, expanding the historical case nodes serving as starting points in a knowledge graph to obtain personalized knowledge subgraphs, inputting the personalized knowledge subgraphs into the knowledge coding unit, and extracting knowledge feature vectors ; Step 4.3 calculating a knowledge-enhanced four-fold contrast learning loss function The loss function Loss of contrast by left eye level Loss of right eye level contrast Patient-level contrast loss Loss of image-knowledge contrast The weighted summation is obtained by the formula Wherein, the method comprises the steps of, Respectively obtaining by calculating InfoNCE losses between the image features and the text features of the corresponding level; By calculating patient-level image features And knowledge feature vector The InfoNCE losses between them are obtained; is a balance super parameter; Step 4.4 iterative training of the multimodal alignment model using a random gradient descent method or AdamW optimizer to minimize the loss function In the training process, when the loss function is not lowered in the continuous M rounds of iteration, the network is considered to be converged, training is stopped, and a trained multi-mode alignment model is obtained; And 4.5, constructing an instruction fine-tuning data set, wherein the data set comprises instruction samples consisting of visual word elements, knowledge word elements and corresponding structured diagnostic reports, and training a large language model by utilizing a parameter efficient fine-tuning method LoRA so that the large language model can generate texts meeting medical specifications according to the input visual and knowledge word elements to obtain a final diagnostic report generator.
- 6. The fundus disease diagnosis method based on graph retrieval enhancement generation according to claim 5, wherein the retrieval process in step 4.2 specifically comprises: indexing image instance node features in a dynamic instance data layer using HNSW indexing algorithm For the query vector, performing an approximate nearest neighbor search in the index, returning the cosine with highest similarity Image instance node, in this way And taking the image instance nodes as starting points, performing multi-hop traversal in the dynamic medical knowledge graph, and collecting all entity nodes and relation edges passing through the path to form the personalized knowledge subgraph. 。
- 7. The method for diagnosis of ocular fundus diseases based on enhanced generation of map retrieval as claimed in claim 5, wherein the step 4.3 calculates a loss function The left eye level contrast loss By calculating the characteristic matrix of the left eye image in the batch Text feature matrix for left eye The two-way cosine similarity between the two-way cosine and the two-way cosine is calculated based on InfoNCE formula, and the image-knowledge contrast loss By calculating a patient-level image feature matrix within a batch And knowledge feature matrix And (3) the two-way cosine similarity between the two-way cosine coefficients is obtained by calculating symmetrical cross entropy loss after the temperature coefficient is introduced for scaling.
- 8. The method for diagnosing fundus diseases based on graph retrieval enhancement generation according to claim 1, wherein the large language model of the diagnosis report generator adopts Llama3 or Qwen series models, and the parameter efficient fine tuning method LoRA only updates the weight matrix of the attention layer of the large language model in a low rank.
- 9. A fundus disease diagnosis system based on graph retrieval enhancement generation for implementing the method according to any one of claims 1-8, characterized in that the system comprises: The system comprises a data processing module, a model construction module, a training module, a diagnosis generation module and a diagnosis generation module, wherein the data processing module is used for constructing a multi-modal data set and a dynamic knowledge map and carrying out data division, the model construction module is used for constructing a multi-modal alignment model comprising a visual coding unit, a text coding unit, a knowledge coding unit and a projection unit, the retrieval module is used for retrieving similar cases based on image features and constructing knowledge subgraphs, the training module is used for training the multi-modal alignment model by using a quadruple contrast learning strategy and utilizing instructions to finely tune a training diagnosis report generator, and the diagnosis generation module is used for receiving test data and generating a structured fundus disease diagnosis report through the trained model and the retrieval module.
- 10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1-8 when the program is executed.
Description
Fundus disease diagnosis method and system based on graph retrieval enhancement generation Technical Field The invention relates to the field of computer artificial intelligence and medical image analysis, in particular to an intelligent fundus disease diagnosis method and system based on dynamic knowledge graph and retrieval enhancement generation technology. The system realizes the deep fusion of fundus images, clinical text reports and structural medical knowledge by constructing a dynamically-growing multi-modal fundus medical knowledge map, thereby improving the accuracy, the interpretability and the clinical practicability of fundus disease diagnosis. Background Fundus disease is a major cause of global blindness, and early diagnosis is of paramount importance. The existing diagnosis method mainly depends on manual film reading of ophthalmologists, is time-consuming and labor-consuming, has subjectivity and inconsistency, and is difficult to meet the requirement of large-scale screening. Therefore, artificial intelligence technology represented by deep learning is introduced, but has the problems that firstly, the decision process is opaque, namely, the problem of 'black box', so that the clinical trust is low, secondly, the clinical diagnosis report only depends on image single-mode information, precious text information in the clinical diagnosis report is ignored, and thirdly, the generalization capability is insufficient when the clinical diagnosis report faces rare diseases and complex cases. The vision-language base model (e.g., CLIP, BLIP) significantly improves performance by aligning image to text features and using triple contrast learning objectives (left eye, right eye, patient level) to align the text features. However, such models have the fundamental drawbacks that firstly, the alignment is limited to unstructured text reports, and the lack of explicit integration with structured medical knowledge results in the lack of medical logic basis for reasoning, and secondly, the model knowledge is static, is fixed after pre-training, is difficult to adapt to a continuously developed medical knowledge system, and has high updating cost. Disclosure of Invention Aiming at the problems of opaque decision, lack of medical logic reasoning, difficult knowledge updating, difficult processing of complex multi-mode data and the like in the prior art, the invention provides a fundus disease diagnosis method and a fundus disease diagnosis system based on image retrieval enhancement generation The invention provides a fundus disease diagnosis method based on image retrieval enhancement generation, which comprises the following steps: and1, constructing a multi-mode fundus medical data set and a dynamic medical knowledge graph. The multi-mode fundus medical data set comprises left eye images, right eye images and corresponding clinical diagnosis reports of a plurality of patients, the dynamic medical knowledge graph comprises a static medical knowledge layer and a dynamic instance data layer, and the multi-mode fundus medical data set is divided into a training set, a verification set and a test set according to a set proportion. And 2, constructing a multi-mode alignment model. The multi-modal alignment model is structurally characterized by comprising a visual coding unit, a text coding unit and a knowledge coding unit which are parallel to each other, and a multi-modal projection unit connected with the visual coding unit and the knowledge coding unit, wherein the visual coding unit is used for respectively extracting the characteristics of left eye images and right eye images and fusing the characteristics to obtain patient-level visual characteristics, the text coding unit is used for extracting global semantic characteristics of clinical diagnosis reports and decoupling the global semantic characteristics into layered text characteristics, the knowledge coding unit is used for coding personalized knowledge subgraphs obtained based on graph retrieval to obtain knowledge characteristics, and the multi-modal projection unit is used for projecting the visual characteristics and the knowledge characteristics into a word element space of a large language model. And 3, constructing a multi-mode knowledge graph retrieval module. The retrieval module is configured to retrieve similar cases in the dynamic medical knowledge graph based on the input visual features and correlate related medical knowledge to construct a personalized knowledge subgraph. And 4, model training and fine tuning. The training set is input into a multi-modal alignment model and a multi-modal knowledge graph retrieval module, a knowledge-enhanced quadruple comparison learning training is carried out, a trained multi-modal alignment model is obtained, the trained multi-modal alignment model is verified by utilizing the verification set, a final multi-modal alignment model is obtained, and meanwhile, a large language model is subjected to fi