US-12626798-B1 - Systems and methods for correlating medical images with medical reports

US12626798B1US 12626798 B1US12626798 B1US 12626798B1US-12626798-B1

Abstract

Described herein are method, system, and apparatus for correlating medical images with a medical report. An apparatus may obtain a plurality of medical images of a patient and a medical report for the patient, wherein the plurality of medical images may be associated with respective sequence labels and respective view labels, and wherein the medical report may include multiple diagnoses. The apparatus may determine and fuse respective image-type embeddings and image-content embeddings for the plurality of medical images into a combined image feature representation. The apparatus may further encode the features of the medical report into a text feature representation and calculate a similarity score based on the image feature representation and the text feature representation to indicate a correlation between the medical images and the medical report.

Inventors

Meiyun WANG
Dinggang Shen
Yaping Wu
Yan Bai
Wei Wei
Lingzhi HU
Tuoyu Cao
Jianmin Yuan

Assignees

SHANGHAI UNITED IMAGING INTELLIGENCE CO., LTD.
HENAN PROVINCIAL PEOPLE'S HOSPITAL

Dates

Publication Date: 20260512
Application Date: 20251216
Priority Date: 20251203

Claims (20)

1 . An apparatus, comprising: one or more processors configured to: obtain a plurality of magnetic resonance imaging (MRI) images of a patient and a medical report for the patient, wherein the plurality of MRI images is associated with respective sequence labels and respective view labels, and wherein the medical report includes multiple diagnoses; determine respective image-type embeddings for the plurality of MRI images based on the respective sequence labels and the respective view labels associated with the plurality of MRI images; encode, using a universal image encoder, respective features of the plurality of MRI images into respective image-content embeddings, wherein the universal image encoder is configured to encode two or more types of MRI images using a same set of parameters; fuse the respective image-type embeddings and the respective image-content embeddings of the plurality of MRI images into a first image feature representation for the plurality of MRI images; encode, using a text encoder, textual features of the medical report into a text feature representation for the medical report; calculate a first similarity score based on the first image feature representation for the plurality of MRI images and the text feature representation for the medical report; fuse the respective image-type embeddings and the respective image-content embeddings for a subset of the plurality of MRI images into a second image feature representation, wherein the subset of MRI images includes at least one fewer MRI image than the plurality of MRI images; calculate a second similarity score based on the second image feature representation for the subset of MRI images and the text feature representation for the medical report; determine, based on a difference between the first similarity score and the second similarity score, a contribution of the at least one MRI image missing from the subset of MRI images to the multiple diagnoses included in the medical report; and provide an indication of the determined contribution.
2 . The apparatus of claim 1 , wherein the one or more processors being configured to fuse the respective image-type embeddings and the respective image-content embeddings of the plurality of MRI images into the first image feature representation comprises the one or more processors being configured to sum or average the respective image-type embedding and the respective image-content embedding of each of the plurality of MRI images into a respective combined image embedding for the each of the plurality of MRI images.
3 . The apparatus of claim 2 , wherein the one or more processors being configured to fuse the respective image-type embeddings and the respective image-content embeddings of the plurality of MRI images into the first image feature representation further comprises the one or more processors being configured to aggregate the respective combined image embedding determined for the each of the plurality of MRI images into the first image feature representation via average pooling.
4 . The apparatus of claim 1 , wherein the plurality of MRI images comprises a first MRI image and a second MRI image, the sequence label of the first MRI image indicating that the first MRI image is a T1 image, the sequence label of the second MRI image indicating that the second MRI image is a T2 image.
5 . The apparatus of claim 4 , wherein the view label of the first MRI image indicates that the first MRI image includes an axial view, and wherein the view label of the second MRI image indicates that the second MRI image includes a sagittal view.
6 . The apparatus of claim 1 , wherein the medical report comprises respective textual descriptions for the plurality of MRI images, and wherein the multiple diagnoses in the medical report are made based on two or more of the plurality of MRI images.
7 . The apparatus of claim 1 , wherein the first similarity score is calculated based on a cosine similarity between the first image feature representation and the text feature representation, and wherein the second similarity score is calculated based on a cosine similarity between the second image feature representation and the text feature representation.
8 . The apparatus of claim 1 , wherein the indication comprises a visualization of the respective contributions determined for the plurality of MRI images together with the respective sequence labels and view labels of the plurality of MRI images.
9 . The apparatus of claim 1 , wherein the universal image encoder is trained using a contrastive learning technique to match the first image feature representation and the text feature representation with respect to an anatomical abnormality of the patient.
10 . The apparatus of claim 1 , wherein the text encoder is a bidirectional text encoder implemented using a transformer architecture.
11 . The apparatus of claim 1 , wherein the universal image encoder is trained using multiple sets of training MRI images and multiple training medical reports associated with a training batch, each of training medical report being associated with a corresponding set of training MRI images, and wherein, during the training of the universal image encoder: respective hash values are calculated for the multiple training medical reports in the training batch and used to eliminate duplicate medical reports in the training batch; and a similarity between each set of training MRI images and a corresponding training medical report is determined in a shared feature space based on an average of a first similarity score and a second similarity score calculated for the set of training MRI images and the corresponding training medical report.
12 . A method for sorting medical images, the method comprising: obtaining a plurality of magnetic resonance imaging (MRI) images of a patient and a medical report for the patient, wherein the plurality of MRI images is associated with respective sequence labels and respective view labels, and wherein the medical report includes multiple diagnoses; determining respective image-type embeddings for the plurality of MRI images using a look-up table and the respective sequence labels and the respective view labels associated with the plurality of MRI images; encoding, using a universal image encoder, respective features of the plurality of MRI images into respective image-content embeddings, wherein the universal image encoder is configured to encode two or more types of MRI images using a same set of parameters; fusing the respective image-type embeddings and the respective image-content embeddings into a first image feature representation for the plurality of MRI images; encoding, using a text encoder, text features of the medical report into a text feature representation; calculating a first similarity score based on the first image feature representation for the plurality of MRI images and the text feature representation for the medical report; fusing the respective image-type embeddings and the respective image-content embeddings for a subset of the plurality of MRI images into a second image feature representation, wherein the subset of MRI images includes at least one fewer MRI image than the plurality of MRI images; calculating a second similarity score based on the second image feature representation for the subset of MRI images and the text feature representation for the medical report; determining a contribution of the at least one MRI image missing from the subset of MRI images to the multiple diagnoses based on a difference between the first similarity score and the second similarity score; and providing an indication of the determined contribution.
13 . The method of claim 12 , wherein fusing the respective image-type embeddings and the respective image-content embeddings into the first image feature representation comprises summing or averaging the respective image-type embedding and the respective image-content embedding of each of the plurality of MRI images to form a respective combined image embedding for the each of the plurality of MRI images.
14 . The method of claim 13 , wherein fusing the respective image-type embeddings and the respective image-content embeddings into the first image feature representation further comprises aggregate the respective combined image embedding determined for the each of the plurality of MRI images into the first image feature representation via average pooling.
15 . The method of claim 12 , wherein the plurality of MRI images comprises a first MRI image and a second MRI image, the sequence label of the first MRI image indicating that the first MRI image is a T1 image, the sequence label of the second MRI image indicating that the second MRI image is a T2 image.
16 . The method of claim 15 , wherein the view label of the first MRI image indicates that the first MRI image includes an axial view, and wherein the view label of the second MRI image indicates that the second MRI image includes a sagittal view.
17 . The method of claim 12 , wherein the first similarity score is calculated based on a cosine similarity between the first image feature representation and the text feature representation, and wherein the second similarity score is calculated based on a cosine similarity between the second image feature representation and the text feature representation.
18 . The method of claim 12 , wherein the indication comprises a visualization of the respective contributions determined for the plurality of MRI images together with the respective sequence labels and view labels of the plurality of MRI images.
19 . The method of claim 12 , wherein the universal image encoder is trained using a contrastive learning technique to match the first image feature representation and the text feature representation with respect to an anatomical abnormality of the patient, and wherein the text encoder is a bidirectional text encoder implemented based on a transformer architecture.
20 . The method of claim 12 , wherein the universal image encoder is trained using multiple sets of training MRI images and multiple training medical reports associated with a training batch, each of training medical report being associated with a corresponding set of training MRI images, and wherein, during the training of the universal image encoder: respective hash values are calculated for the multiple training medical reports in the training batch and used to eliminate duplicate medical reports in the training batch; and a similarity between each set of training MRI images and a corresponding training medical report is determined in a shared feature space based on an average of a first similarity score and a second similarity score calculated for the set of training MRI images and the corresponding training medical report.

Description

BACKGROUND Radiological assessment of medical images such as magnetic resonance imaging (MRI) images typically requires a clinician to synthesize information across several imaging sequences (e.g., T1-weighted, T2-weighted, FLAIR, diffusion-weighted imaging (DWI), apparent diffusion coefficient (ADC), post-contrast T1, etc.) and across different views (e.g., axial/transverse, sagittal, coronal, etc.). As part of the efforts to automate this process, attempts have been made in recent years to use artificial intelligence (AI) technologies to learn the correlation between certain medical images and a corresponding medical report, with the hope to gain insight into what features in the medical images may have led to a particular clinical diagnosis or medical decision. Some of the technologies that have been attempted rely on registering images from different sequences or views at the pixel level, and then aligning the registered images with texts from a medical report. In routine clinical practice, however, the set and quality of medical image sequences may vary by patient, scanner, and site protocol. For example, a DWI image may exhibit geometric distortion, and thick axial slices may obscure findings that would be more apparent in sagittal views. This heterogeneity, along with frequently missing sequences, makes registration-based approaches unreliable, computationally burdensome, and susceptible to artifacts that can degrade subsequent tasks. Other attempted approaches employ static neural network architectures and often require a fixed number and ordering of image sequences as inputs and a separate processing module for each input sequence, leading to excessive parameters, poor extensibility, and reduced robustness (e.g., when one or more sequences are missing). Correlating complex medical reports with specific image features presents additional challenges. This is because clinical medical reports typically describe a medical examination as a whole, with interleaving statements related to findings from multiple sequences and/or views. On the other hand, methods that require splitting a medical report into sentence-level or sequence-level captioning of each individual image demand significant manual efforts and can fracture the clinical context that often guides real-world decision-making. Accordingly, there is a need to develop techniques that can (i) accept dynamic multi-sequence, multi-view inputs with variable counts, (ii) avoid reliance on pixel-level image registration, and (iii) accurately align image features with specific diagnoses a comprehensive medical report. SUMMARY Described herein are systems, methods, and apparatus for aligning medical images (MRI images) with medical reports based on dynamic multi-sequence, multi-view inputs. According to embodiments of the present disclosure, an apparatus may obtain a plurality of MRI images of a patient and a medical report for the patient, wherein the plurality of MRI images may be associated with respective sequence labels and respective view labels, and wherein the medical report may include multiple diagnoses. The apparatus may determine respective image-type embeddings for the plurality of MRI images (e.g., using a look-up table) based on the respective sequence labels and the respective view labels associated with the plurality of MRI images. The apparatus may further encode, using a universal image encoder, respective features of the plurality of MRI images into respective image-content embeddings, wherein the universal image encoder may be configured to encode two or more types of MRI images using a same set of parameters. The apparatus may fuse the respective image-type embeddings and the respective image-content embeddings associated with the plurality of MRI images into a first image feature representation for the plurality of MRI images. The apparatus may further encode, using a text encoder, textual features of the medical report into a text feature representation. The apparatus may then calculate a first similarity score based on the first image feature representation derived for the plurality of MRI image and the text feature representation derived for the medical report, wherein the first similarity score may indicate a match between the first image feature representation and the text feature representation. Additionally, the apparatus may fuse the respective image-type embeddings and the respective image-content embeddings for a subset of the plurality of MRI images into a second image feature representation, wherein the subset of MRI images may include at least one fewer MRI image than the plurality of MRI images. The apparatus may calculate a second similarity score based on the second image feature representation derived for the subset of MRI images and the text feature representation derived for the medical report, wherein the second similarity score may indicate a match between the second image feature representation and the text fea