CN-122019707-A - Visual information retrieval question-answering system and method based on multi-mode embedding and visual language model

CN122019707ACN 122019707 ACN122019707 ACN 122019707ACN-122019707-A

Abstract

The invention discloses a visual information retrieval question-answering system and method based on multi-mode embedding and visual language models, and relates to the technical field of information retrieval and natural language processing. The system comprises an offline index construction module and an online search question-answering module, wherein a user text question and image data in a database are mapped to the same high-dimensional vector space in a unified mode through a multi-mode embedded model, so that efficient and accurate vector similarity search is realized, and then the searched related images and the original question are input into a visual large language model to directly generate accurate answers. The method skips the conversion link from the traditional image to the text, completely reserves visual context information, improves the accuracy and efficiency of the search question and answer, supports multi-mode data processing and privacy deployment, and is suitable for a plurality of professional fields such as finance, medical treatment, business analysis and the like.

Inventors

HE LONGJI

Assignees

广东天耘科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260112

Claims (10)

1. A visual information search question-answering method based on multi-mode embedding and visual language model is characterized in that the method comprises an off-line index construction stage and an on-line search question-answering stage, the off-line index construction stage comprises an image traversing step, an image vectorizing step and an index library construction step, the on-line search question-answering stage comprises a question input step, a question vectorizing step, a vector similarity search step, a multi-mode prompt construction step, a model reasoning step and an answer generation step, In the off-line index construction stage, the image traversing step traverses all images in an image document database, the image vectorizing step calls a multi-mode embedding model to extract features and convert vectors of each traversed image to generate high-dimensional image vectors capable of representing the core content of the image, and the index library construction step stores all the high-dimensional image vectors into a vector database to construct a vector index library; In the on-line searching question-answering stage, a question input step allows a user to input a text question through a system interaction interface, a question vectorization step calls a multi-mode embedded model which is the same as that in the off-line index construction stage, the text question is converted into a high-dimensional query vector, a vector similarity searching step carries out high-speed similarity calculation in a pre-constructed vector index library based on the high-dimensional query vector, similarity values of the query vector and each image vector are calculated, after the similarity values are ordered from high to low, the top N most similar images are selected as target images, a multi-mode prompt construction step combines the text question originally input by the user with the N retrieved target images to generate a multi-mode input prompt, a model reasoning step transmits the multi-mode input prompt to a visual large language model, the model analyzes the intention of the text question, simultaneously carries out comprehensive reasoning on the target image depth, the visual large language model generates a final answer in a natural language format based on the comprehensive reasoning result.
2. The method of claim 1, wherein the multi-modal embedded model has cross-modal feature mapping capability to ensure that the image vector accurately reflects visual information.
3. The method of claim 1, wherein in the index library constructing step, corresponding image vectors and vector index libraries are updated synchronously when image data is newly added, deleted or modified in the image document database.
4. The method of claim 1, wherein the text question comprises a data query, a logical judgment, a relationship identification in an image.
5. The method of claim 1, wherein in the question vectorizing step, the query vector is in the same high-dimensional vector space as the image vector.
6. The method according to claim 1, wherein in the vector similarity retrieving step, N is a positive integer between 1 and 10, and the positive integer N is set according to actual requirements.
7. The method according to claim 1, wherein in the model reasoning step, the visual large language model performs deep understanding on chart data, layout structure and logic relation in the target image, and identifies row and column meanings of tables in the image, extracts specific numerical values in the histogram and organizes logic links in the flow chart.
8. The method of claim 1, the multimodal embedding model comprising an image encoder responsible for image vectorization and a text encoder responsible for text question vectorization, the multimodal embedding model being capable of projecting outputs of the image encoder and the text encoder to a unified dimensional feature space by contrast learning.
9. A visual information retrieval question-answering system based on multimodal embedding and visual language models for performing the method according to any one of claims 1-8.
10. The system of claim 9, wherein the visual information retrieval question-answering system is deployed in a private cloud environment or a local machine room, and the image vectorization and model reasoning processes are all completed in a physically isolated internal network.

Description

Visual information retrieval question-answering system and method based on multi-mode embedding and visual language model Technical Field The invention relates to the technical field of information retrieval and natural language processing, in particular to a visual information retrieval question-answering system and method based on multi-mode embedding and visual language models. Background In various scenarios such as project management, business analysis, financial services, medical diagnostics, etc., a large amount of valuable information is encapsulated in visual elements of images, charts, tables, and complex documents (e.g., PDF reports, PPT presentations). The unstructured visual information carries key data, logical relations and decision basis, and efficient and accurate retrieval and interpretation are critical to business development. Conventional computer systems typically rely on "optical character recognition" (OCR) or complex image-to-text/Markdown conversion techniques in processing such unstructured visual information. However, these methods have a number of problems. Firstly, the information loss is serious, the OCR technology is difficult to accurately identify tables, charts and handwriting with complex formats, and the conversion process can completely lose abundant visual context information such as image layout, colors, arrow pointing, element association and the like, so that the subsequent processing cannot acquire complete information dimension, for example, the height relation of each column in a histogram or the logic pointing of the arrow in a flow chart cannot be accurately judged. In addition, the prior art has complicated processing flow and low efficiency, and preprocessing, converting and cleaning of massive images is a computationally intensive and time-consuming process, so that a large amount of computing resources are needed, real-time retrieval and response of information are difficult to realize, and when the image data are changed, the whole flow processing is needed again, so that the labor cost is extremely high and the efficiency is low. In addition, the retrieval and question-answering accuracy rate in the prior art is low, and the information loss and error of the front-end conversion step can cause that the subsequent language model can not obtain complete and accurate context, so that the final retrieval result deviates from the requirement, the question-answering result is wrong or is not optimal, and the high requirement of the professional fields such as finance, medical treatment, consultation and the like on the information accuracy can not be met. Therefore, the industry has an urgent need for a technical scheme capable of directly understanding and processing complex image contents and avoiding distortion of information in a conversion process, so as to realize more efficient and accurate multi-mode information retrieval and question-answering. Disclosure of Invention In summary, aiming at the defects of the prior art, the invention provides a visual information retrieval question-answering system and method based on a multi-mode embedding and visual language model, which aim to solve the problems of image information conversion loss, low processing efficiency and insufficient retrieval question-answering accuracy in the traditional technology and realize efficient and accurate retrieval and question-answering of visual information. The invention provides a visual information retrieval question-answering method based on a multi-mode embedding and visual language model, which comprises an offline index construction stage and an online retrieval question-answering stage, wherein the offline index construction stage comprises an image traversing step, an image vectorizing step and an index library construction step, and the online retrieval question-answering stage comprises a question input step, a question vectorizing step, a vector similarity retrieval step, a multi-mode prompt construction step, a model reasoning step and an answer generation step; the image vectorization step calls a multi-mode embedded model to conduct feature extraction and vector conversion on each traversed image to generate a high-dimensional image vector capable of representing the core content of the image, the index library construction step stores all the high-dimensional image vectors into a vector database to construct a vector index library, the question input step allows a user to input a text question through a system interactive interface in an online search question-answering stage, the question vectorization step calls a multi-mode embedded model identical to the offline index construction stage to convert the text question into a high-dimensional query vector, the vector similarity retrieval step conducts high-speed similarity calculation on a pre-constructed vector index library based on the high-dimensional query vector to calculate similarity values of the query