CN-122019584-A - Multi-modal retrieval method, device, equipment and storage medium

CN122019584ACN 122019584 ACN122019584 ACN 122019584ACN-122019584-A

Abstract

The invention provides a multi-mode retrieval method, device, equipment and storage medium, which comprises the steps of determining an analysis strategy according to rendering characteristics of a document to be processed, analyzing the document to be processed based on the analysis strategy, extracting image content in the document to be processed, generating image description text corresponding to the image content, fusing the image content and the image description text at an original logic position in a text stream corresponding to the text content in an in-situ semantic injection mode to form a graphic fusion intermediate representation, performing partitioning processing on the intermediate representation based on semantic integrity constraint, constructing a multi-mode knowledge base based on each text partition and image association metadata, performing vector retrieval in the multi-mode knowledge base based on user query, recalling text vector blocks comprising related graphic semantics, and generating a target answer for fusing image references. The invention effectively solves the problem that the pictures in the knowledge base document can not be accurately analyzed and recalled during multi-mode retrieval.

Inventors

ZHAO YUNTAO
HUANG DAN
LUO ZIHAN
HUANG CHUAN
REN SIYU

Assignees

吉旗（成都）科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251211

Claims (10)

1. A multi-modal retrieval method, comprising: determining an analysis strategy adapted to the document to be processed according to rendering characteristics of the document to be processed; Analyzing the document to be processed based on the adaptive analysis strategy, extracting text content and image content in the document to be processed, carrying out semantic understanding on the image content, and generating an image description text corresponding to the image content; Fusing the image content and the image description text back to the original logic position of the image content in the text stream corresponding to the text content in an in-situ semantic injection mode to form an intermediate representation of image-text fusion; Performing blocking processing on the intermediate representation of the graphic fusion based on semantic integrity constraint, and constructing a multi-modal knowledge base based on each text block and image association metadata of each text block; Based on related graph-text semantics in user inquiry, vector retrieval is carried out in the multi-mode knowledge base, text vector blocks comprising the related graph-text semantics are recalled, and target answers for fusion image references are generated based on the text vector blocks comprising the related graph-text semantics.
2. The multi-modal retrieval method according to claim 1, wherein the parsing the document to be processed based on the adapted parsing policy, extracting text content and image content in the document to be processed, and performing semantic understanding on the image content, generating an image description text corresponding to the image content, includes: When the adapted resolution policy is a rasterized visual resolution policy for a layout document, rendering each page of the layout document into a full page image using a rendering engine; aiming at any page of the format document, performing visual element detection on the full-page image, and identifying and positioning an image area in the full-page image; Inputting the coordinates of the full-page image and the image area into a multi-mode understanding model to obtain the depth semantic description of the image area; and determining an image description text corresponding to the image content based on the depth semantic description of the image area corresponding to each page.
3. The multi-modal retrieval method according to claim 2, wherein the fusing the image content and the image description text back to the original logical position of the image content in the text stream corresponding to the text content by in-situ semantic injection, to form a fused-text intermediate representation, includes: Carrying out association packaging on the image placeholders of the pages, the depth semantic descriptions of the image areas corresponding to the pages and the storage addresses of the image areas corresponding to the pages; and inserting the packaged association information serving as a semantic unit into the position of the image placeholder in each analysis text to form the intermediate representation of the graphic fusion.
4. The multi-modal retrieval method according to claim 1, wherein the parsing the document to be processed based on the adapted parsing policy, extracting text content and image content in the document to be processed, and performing semantic understanding on the image content, generating an image description text corresponding to the image content, includes: When the adapted analysis strategy is an object extraction strategy for the streaming document, analyzing a bottom document object model structure of the streaming document, and positioning an image embedded label in a document object model; Determining anchor point positions of the images in the text stream according to node relations of the image embedded labels in the document object model; And extracting the data of the image, performing optical character recognition and semantic understanding, and generating an image description text corresponding to the image content.
5. The multi-modal retrieval method according to claim 4, wherein the fusing the image content and the image description text back to the original logical position of the image content in the text stream corresponding to the text content by in-situ semantic injection, to form a fused-text intermediate representation, includes: associating the image description text with a storage address of the image; And inserting the associated information into a corresponding position of a text stream determined by the anchor point position, so as to ensure that the image description text and the text of the original context of the image are kept continuous.
6. The multi-modal retrieval method according to any one of claims 1-5, wherein the constructing a multi-modal knowledge base based on each text block and image association metadata for each of the text blocks includes: Generating a corresponding text vector for each text block, wherein a source text of the text vector comprises an original text and an injected image description text; establishing metadata for each text block, wherein the metadata at least comprises a storage address of an image referenced in the text block and page number information of a source document; and constructing and obtaining the multi-modal knowledge base based on the text vector of each text block and the metadata of each text block.
7. A multi-modal retrieval apparatus, comprising: the analysis strategy determining module is used for determining an analysis strategy matched with the document to be processed according to the rendering characteristics of the document to be processed; the document analysis module is used for analyzing the document to be processed based on the adaptive analysis strategy, extracting text content and image content in the document to be processed, carrying out semantic understanding on the image content, and generating an image description text corresponding to the image content; Fusing the image content and the image description text back to the original logic position of the image content in the text stream corresponding to the text content in an in-situ semantic injection mode to form an intermediate representation of image-text fusion; The multi-modal knowledge base construction module is used for carrying out blocking processing on the intermediate representation of the graphic fusion based on semantic integrity constraint and constructing a multi-modal knowledge base based on each text block and the image association metadata of each text block; And the multi-mode retrieval module is used for carrying out vector retrieval in the multi-mode knowledge base based on related graph-text semantics in user inquiry, recalling a text vector block comprising the related graph-text semantics, and generating a target answer for fusing image references based on the text vector block comprising the related graph-text semantics.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the multimodal retrieval method of any of claims 1 to 6 when the computer program is executed.
9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the multimodal retrieval method according to any of claims 1 to 6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the multimodal retrieval method of any of claims 1 to 6.

Description

Multi-modal retrieval method, device, equipment and storage medium Technical Field The present invention relates to the field of artificial intelligence and natural language processing technologies, and in particular, to a multi-modal retrieval method, apparatus, device, and storage medium. Background The existing large-model retrieval enhancement generation ((RETRIEVAL AUGMENTED GENERATION, RAG) technology has good effect when processing pure text, but has obvious defects when processing pictures in enterprise-level documents, such as portable document format (Portable Document Format, PDF) documents and Word documents, (1) the document analysis precision is low, namely simple optical character recognition (Optical Character Recognition, OCR) cannot understand complex charts (such as trend charts and architecture charts), so that high-dimensional semantic information is lost, (2) context splitting is caused, the traditional method usually extracts pictures independently to perform optical character recognition OCR, so that logical association of the pictures and the context is interrupted. Therefore, there is a need to provide an effective solution to the above technical problems. Disclosure of Invention Aiming at the defects in the prior art, the invention provides a multi-mode retrieval method, a device, equipment and a storage medium, which effectively solve the problem that pictures in a knowledge base document can not be accurately analyzed and recalled during multi-mode retrieval. In a first aspect, the present invention provides a multi-modal retrieval method comprising the steps of: determining an analysis strategy adapted to the document to be processed according to rendering characteristics of the document to be processed; Analyzing the document to be processed based on the adaptive analysis strategy, extracting text content and image content in the document to be processed, carrying out semantic understanding on the image content, and generating an image description text corresponding to the image content; Fusing the image content and the image description text back to the original logic position of the image content in the text stream corresponding to the text content in an in-situ semantic injection mode to form an intermediate representation of image-text fusion; Performing blocking processing on the intermediate representation of the graphic fusion based on semantic integrity constraint, and constructing a multi-modal knowledge base based on each text block and image association metadata of each text block; Based on related graph-text semantics in user inquiry, vector retrieval is carried out in the multi-mode knowledge base, text vector blocks comprising the related graph-text semantics are recalled, and target answers for fusion image references are generated based on the text vector blocks comprising the related graph-text semantics. According to the multi-mode searching method provided by the invention, the analyzing the document to be processed based on the adapted analyzing strategy, extracting text content and image content in the document to be processed, carrying out semantic understanding on the image content, and generating an image description text corresponding to the image content, wherein the method comprises the following steps: When the adapted resolution policy is a rasterized visual resolution policy for a layout document, rendering each page of the layout document into a full page image using a rendering engine; aiming at any page of the format document, performing visual element detection on the full-page image, and identifying and positioning an image area in the full-page image; Inputting the coordinates of the full-page image and the image area into a multi-mode understanding model to obtain the depth semantic description of the image area; and determining an image description text corresponding to the image content based on the depth semantic description of the image area corresponding to each page. According to the multi-mode retrieval method provided by the invention, the image content and the image description text are fused back to the original logic position of the image content in the text stream corresponding to the text content in an in-situ semantic injection mode to form an intermediate representation of image-text fusion, and the method comprises the following steps: Carrying out association packaging on the image placeholders of the pages, the depth semantic descriptions of the image areas corresponding to the pages and the storage addresses of the image areas corresponding to the pages; and inserting the packaged association information serving as a semantic unit into the position of the image placeholder in each analysis text to form the intermediate representation of the graphic fusion. According to the multi-mode searching method provided by the invention, the analyzing the document to be processed based on the adapted analyzing strategy, extracting t