CN-122025088-A - Three-dimensional brain CT medical visual question-answering method and system based on anatomical memory matrix

CN122025088ACN 122025088 ACN122025088 ACN 122025088ACN-122025088-A

Abstract

The invention discloses a three-dimensional brain CT medical visual question answering method and system based on an anatomical memory matrix, and aims to solve the problem that the existing medical visual question answering model is difficult to fully process the correlation between the spatial continuity of three-dimensional brain CT images and anatomical structures. According to the invention, through the association of the explicit modeling three-dimensional anatomy priori and cross-slice semantics, the spatial understanding capability of the model on brain structures and pathological features is enhanced, and the accuracy and reliability of medical visual questions and answers in a three-dimensional brain CT scene are improved.

Inventors

JI JUNZHONG
SONG WENHONG
ZHANG XIAODAN

Assignees

北京工业大学

Dates

Publication Date: 20260512
Application Date: 20260127

Claims (8)

1. The three-dimensional brain CT medical visual question-answering method based on the anatomical memory matrix is characterized by comprising the following steps of: S1, extracting visual features of a three-dimensional brain CT image; s2, initializing an anatomical feature memory matrix based on anatomical entity information extracted from the medical report; s3, dynamically updating the anatomical feature memory matrix through a pathology memory updating module according to the visual features and the medical visual question-answer monitoring signals; S4, carrying out feature fusion through a multi-dimensional feature memory fusion module according to the visual features and the updated memory matrix to obtain three-dimensional fusion features; And S5, generating a medical question and answer result according to the three-dimensional fusion characteristic and the question text.
2. The three-dimensional brain CT medical visual question-answering method based on the anatomical memory matrix, according to claim 1, is characterized in that in S1, the method for extracting visual features comprises the steps of encoding a slice sequence of three-dimensional brain CT images by adopting a Vision Transformer visual encoder to obtain a high-dimensional visual feature representation.
3. The three-dimensional brain CT medical visual question-answering method based on anatomical memory matrix according to claim 1, wherein in S2, the method of initializing anatomical feature memory matrix comprises: Extracting anatomical entities from the medical report and organizing according to predefined anatomical levels; For each CT slice, encoding the extracted anatomical entity as a feature vector, and filling the feature vector into a fixed number to form a planar feature matrix; The planar feature matrices of all the slices are stacked in the layer dimension to form an initial anatomical feature memory matrix.
4. The three-dimensional brain CT medical visual question-answering method based on anatomical memory matrix according to claim 1, wherein in S3, the method for dynamically updating the anatomical feature memory matrix comprises: generating a query vector through pooling operation according to the visual characteristics; Generating a key vector through pooling and linear mapping according to the memory matrix; calculating the attention relationship between the query vector and the key vector in the unified semantic space; based on the supervision signals of the medical visual questions and answers, the parameters of the memory matrix are updated by optimizing the binary cross entropy loss function.
5. The three-dimensional brain CT medical visual question-answering method based on anatomical memory matrix according to claim 1, wherein in S4, the method for performing feature fusion comprises: calculating similarity weights between the visual features and all memory entity features in the memory matrix; weighting and fusing a plurality of memory entity features in the same spatial position according to the similarity weight to obtain weighted features of each position; and splicing the weighted features of all the positions along the dimension of the slice to obtain the three-dimensional fusion feature.
6. The three-dimensional brain CT medical visual question-answering method based on the anatomical memory matrix according to claim 5, wherein the method for calculating the similarity weight comprises the steps of calculating the similarity between the image features and the memory entity features by adopting cosine similarity and carrying out normalization processing by a Softmax function.
7. The three-dimensional brain CT medical visual question-answering method based on anatomical memory matrix according to claim 1, wherein in S5, the method of generating medical question-answering results includes: splicing the three-dimensional fusion features with the original visual features, and mapping through a projection layer; And the mapped visual features and the embedded vectors of the question text are input into a large language model together, and a text decoder generates corresponding medical question-answering results.
8. A three-dimensional brain CT medical visual question-answering system based on an anatomical memory matrix for implementing the method of any one of claims 1-7, comprising: the extraction module is used for extracting visual features of the three-dimensional brain CT image; The initialization module is used for initializing an anatomical feature memory matrix based on anatomical entity information extracted from the medical report; the pathology memory updating module is used for dynamically updating the anatomic feature memory matrix according to the visual features and the medical visual question-answer monitoring signals; The multidimensional feature memory fusion module is used for obtaining three-dimensional fusion features by carrying out feature fusion according to the visual features and the updated memory matrix; And the generating module is used for generating a medical question and answer result according to the three-dimensional fusion characteristic and the question text.

Description

Three-dimensional brain CT medical visual question-answering method and system based on anatomical memory matrix Technical Field The invention relates to the technical field of artificial intelligence and medical image intersection, in particular to a three-dimensional brain CT medical visual question-answering method and system based on an anatomical memory matrix. Background Brain CT diagnosis has an important role in clinical diagnosis and treatment of neurological diseases. The brain CT image is composed of a plurality of continuous slices, can comprehensively present the brain structural characteristics from the space level, and provides key basis for rapid screening and evaluation of critical diseases such as cerebral hemorrhage, cerebral infarction and the like. Through systematic analysis of brain CT images, doctors can find abnormal focus in time and formulate corresponding treatment schemes, and the method has important significance for guaranteeing life safety of patients and improving prognosis. However, with the popularization of medical imaging equipment and the continuous increase of clinical examination demands, brain CT data size has exponentially increased, and radiologists need to complete a large number of image reading and report writing works in a limited time. The long-term high-strength manual film reading not only obviously increases the workload of doctors, but also reduces the diagnosis efficiency to a certain extent and increases the risks of misdiagnosis and missed diagnosis. Therefore, there is an urgent real need to develop automated methods that can assist doctors in understanding brain CT images and provide intelligent analysis support. The medical visual question-answering (MedicalVisualQuestionAnswering, med-VQA) is used as a multi-mode task for fusing medical image understanding and natural language reasoning, and provides a new technical path for relieving the workload of radiologists and improving the clinical diagnosis and treatment efficiency. In recent years, the multi-modal large language model shows strong capability in the aspects of cross-modal semantic alignment and reasoning modeling, promotes the rapid development of Med-VQA research, and enables the model to answer professional questions around medical image content. However, most of the existing Med-VQA researches are developed around two-dimensional medical images, and the existing Med-VQA researches are difficult to fully adapt to actual clinical requirements in a three-dimensional brain CT scene. The current mainstream medical visual question-answering method mostly uses a two-dimensional image processing paradigm. For example, the models Med-Flamingo, biomedGPT, llava-Med and the like achieve good effects in planar medical image understanding and question-answering tasks, but a visual coding and cross-mode fusion mechanism of the models is mainly designed for a single or a small number of two-dimensional images, and the models are difficult to directly expand to three-dimensional brain CT volume data formed by a plurality of continuous slices. When the two-dimensional model processes three-dimensional images, spatial continuity and structure dependency relationship among slices are often ignored, and depth distribution and volume characteristics of a focus and three-dimensional association between the focus and surrounding anatomical structures are difficult to accurately model, so that understanding capability of the model on pathological features of a complex brain is limited. To overcome the limitation of the two-dimensional modeling paradigm, some researches attempt to introduce a three-dimensional visual encoder to enhance the capacity of characterizing the volume data, such as M3D-LaMed, radFM, med3dvlm and other methods to implement the preliminary modeling of the three-dimensional medical image through three-dimensional convolution or voxel-level feature extraction. However, such methods still suffer from deficiencies in cross-slice long-range dependent modeling and fine-grained anatomical semantic correlation. On one hand, the high complexity of the three-dimensional encoder in calculation and parameter scale limits the efficient integration of the model to multi-slice information, and on the other hand, the existing method mostly relies on implicit feature learning and lacks explicit modeling of brain anatomy priori and pathological structure relation, so that the model is still difficult to form stable and interpretable spatial semantic understanding in a complex three-dimensional brain CT scene, and the accuracy of a question-answer result and the medical reference value are affected. Disclosure of Invention Aiming at the problems, the invention focuses on the three-dimensional brain CT medical visual question-answering task, expands in-depth research on the model design level, and provides a three-dimensional brain CT medical visual question-answering model 3D-MemVQA based on an anatomical m