KR-20260067786-A - Chest X-ray Visual Question Answering System Using Multi-modal Large Language Models

KR20260067786AKR 20260067786 AKR20260067786 AKR 20260067786AKR-20260067786-A

Abstract

The present invention relates to a chest X-ray visual question-answering system. A system for providing visual descriptions and question-answering based on chest X-ray images according to an embodiment of the present invention comprises: a dataset generation module for generating a dataset as a method for generating QA for a dataset containing both class and bounding box labels among open datasets of chest X-ray images; a vision encoder for extracting visual features such as the location, size, and shape of a disease from a chest X-ray image; a projector module for converting the extracted visual features into a form that can be combined with a text encoder; and a text encoder for understanding a given question and combining it with a visual feature map to generate an answer regarding the disease.

Inventors

윤대명

Assignees

주식회사 뉴메스

Dates

Publication Date: 20260513
Application Date: 20241106

Claims (3)

In a system for providing visual explanations and question-and-answer based on chest X-ray images, A dataset generation module for generating a dataset as a method for generating QA for a dataset containing both class and bounding box labels among open datasets of chest X-ray images; A vision encoder that extracts visual features such as the location, size, and shape of lesions from chest X-ray images; A projector module that converts extracted visual features into a form that can be combined with a text encoder; and A system characterized by including a text encoder that understands a given question and combines it with a visual feature map to generate an answer regarding the symptoms.
In claim 1, A system characterized by a vision encoder being a pre-trained CLIP vision encoder configured to effectively extract visual information from chest X-ray images.
In claim 1 or claim 2, A system characterized by an LLM that generates answers to questions, wherein the generated answers include information regarding the type, location, size, and severity of the disease.

Description

Chest X-ray Visual Question Answering System Using Multi-modal Large Language Models The present invention relates to a chest X-ray visual question-answering system, and more specifically, discloses a technology that enables responses to open-ended questions requiring complex visual reasoning using a multimodal large language model. Existing chest X-ray interpretation systems have limitations in reliably generating detailed and specific answers to various medical questions. Although systems capable of simultaneously processing images and text have been introduced through recent advancements in Large Language Models (LLM) and Multimodal Large Language Models (LVLM), generating accurate and reliable free-form question responses specialized for the medical domain remains a challenge. In particular, open-ended questions requiring complex visual reasoning frequently occur in clinical settings, in addition to closed-ended (short-answer) questions; this is a critical factor in providing reliable answers to clinicians. Unlike general VQA, Med-VQA (Visual Question Answering) requires domain-specific knowledge. Due to the complexity and expertise of medical imaging, Med-VQA systems demand in-depth medical understanding and reasoning capabilities that go beyond simple image recognition. It is garnering attention as a key tool that can contribute to reducing the workload of medical professionals and improving communication with patients. The most critical task in Med-VQA is to understand open-ended questions and generate reliable answers based on vast medical knowledge and image data. However, there are several limitations to the research and development of Med-VQA. Data on image-text pairs required for training in the medical domain is significantly scarcer compared to the general domain. While vast amounts of image-text pairs can be collected in the general domain through web-based data aggregation, such an approach is difficult in the medical domain due to the specificity and sensitivity of the data. Although models such as LLaVA-Med have recently been developed to address this issue, they also have limitations in that they lack specificity for the medical domain, as they are based on extensive biomedical data crawled from the web. Among the prior art, Korean Patent Publication No. 10-2024-0064989 discloses a medical decision support device and method for generating an answer containing evidence in response to a user question regarding a medical imaging image. However, most existing Med-VQA datasets consist of short-answer QA pairs, making them unsuitable for tasks that describe complex medical situations or require detailed reasoning. Consequently, while Med-VQA models may be suitable for classification problems, they have limitations in generating flexible and detailed responses to open-ended questions. FIG. 1 is a block diagram of a chest X-ray visual question-answering system using a multimodal large language model according to an embodiment of the present invention. Figure 2 is a flowchart of a fine-tuning method in a system according to Figure 1. Figures 3, 4, and 5 are example diagrams for explaining a method of generating a dataset in a system according to Figure 1. Figure 6 is a result table of quantitative evaluation in the system according to Figure 1. Figure 7 is a result table of qualitative evaluation in the system according to Figure 1. Hereinafter, embodiments of the present invention will be described in detail with reference to the attached drawings. The terms used are selected considering their functions in the embodiments, and their meanings may vary depending on the intent of the user or operator, or case law. Therefore, the meaning of the terms used in the embodiments described below shall follow the definitions specifically defined in this specification if such definitions exist, and shall be interpreted according to the meaning generally recognized by those skilled in the art if no specific definitions exist. FIG. 1 is a block diagram of a chest X-ray visual question-answering system using a multimodal large language model according to an embodiment of the present invention, FIG. 2 is a flowchart of a fine-tuning method in the system according to FIG. 1, FIG. 3, FIG. 4, and FIG. 5 are example diagrams for explaining a dataset generation method in the system according to FIG. 1, FIG. 6 is a result table of quantitative evaluation in the system according to FIG. 1, and FIG. 7 is a result table of qualitative evaluation in the system according to FIG. 1. Referring to FIGS. 1 to 7, a chest X-ray visual question-answering system (100) using a multimodal large language model according to an embodiment of the present invention includes a dataset generation module (110), a vision encoder (120), a projector module (130), and a text encoder (140). The dataset generation module (110) selects the NIH Chest X-ray dataset from among open CXR datasets to generate a new medical CXR VQA dataset. The NIH Chest