CN-121980002-A - Multi-modal map enhanced retrieval method and dialogue system based on feature fusion optimization

CN121980002ACN 121980002 ACN121980002 ACN 121980002ACN-121980002-A

Abstract

The invention discloses a multimodal atlas enhancement retrieval method and a dialogue system based on feature fusion optimization, comprising the steps of respectively pre-training and fine-tuning a visual model and a language model by utilizing field images and text data; the method comprises the steps of constructing a knowledge graph based on text data in a knowledge base, constructing a vector database containing associated image data, carrying out semantic analysis and optimization on original query of a user by utilizing a language model, forming a structured search intention, searching related subgraphs, text semantic vector information and associated image data based on the search intention, encoding the subgraphs into knowledge contexts, inputting the knowledge contexts into a dynamic prompt generator to generate visual prompts, extracting enhanced visual features with the associated image data through the visual model, carrying out collaborative reasoning on the subgraphs, the text semantic vector information and the enhanced visual features, inputting the text semantic vector information and the enhanced visual features into the language model, generating a final answer and outputting the final answer. The method can realize deep fusion and accurate retrieval of multi-modal knowledge, and remarkably improve accuracy and efficiency.

Inventors

YU GUOBIN
SHEN LIYAN
YANG YUHUI
WANG YONGCHAO

Assignees

浙江大学

Dates

Publication Date: 20260505
Application Date: 20251202

Claims (10)

1. The multi-mode map enhanced retrieval method based on feature fusion optimization is characterized by comprising the following steps of: respectively pre-training and fine-tuning the visual model and the language model by using the field image and the text data; Extracting entities and relations by utilizing the trimmed language model after text data in the knowledge base are segmented to construct a knowledge graph, and simultaneously storing image data in the knowledge base and corresponding entities together with semantic vectors of the entities and semantic vectors of the relations in a vector database after carrying out association indexing; receiving original inquiry of a user, carrying out semantic analysis and active optimization treatment on the original inquiry by utilizing the trimmed language model, and further extracting a target entity and a relation to form a structured retrieval intention; Based on the search intention, respectively executing map query and semantic search in a knowledge map and vector database to obtain related subgraphs, text semantic vector information and related image data; encoding the subgraph into a knowledge context input dynamic hint generator to generate a visual hint, extracting enhanced visual features through a pre-trained visual model based on the visual hint and associated image data; and (3) forming the multi-modal context by the subgraph, the text semantic vector information and the enhanced visual features together, inputting the trimmed language model for collaborative reasoning, generating a final answer and outputting the final answer.
2. The multi-modal atlas enhancement retrieval method based on feature fusion optimization according to claim 1, wherein the pre-training and fine-tuning of the visual model and the language model using the domain image and the text data, respectively, includes: Pre-training a visual model based on Vision Transformer architecture by adopting a mask self-encoder framework, randomly masking high-proportion image blocks in an input image, and utilizing an encoder and a lightweight decoder of the visual model to work cooperatively to reconstruct pixel values of a masked area, so that the model has visual feature depth understanding capability of learning structural representation of the image from a visual context; and performing supervised fine tuning on the general large language model by using an instruction data set constructed based on the field text data, wherein a fine tuning target comprises entity identification and relation understanding, so that the model has the deep understanding capability on field terms, knowledge structures and semantic logics.
3. The multi-modal spectrum enhancement retrieval method based on feature fusion optimization according to claim 1, wherein the extracting of entities and relations by using the trimmed language model after the text data in the knowledge base is segmented to construct a knowledge spectrum comprises: Performing multistage segmentation processing on text data in a knowledge base, wherein the processing comprises the steps of firstly splitting the text data into preliminary segments according to paragraphs, and then splitting ultra-long paragraphs in the preliminary segments by adopting sliding windows with overlapping areas to obtain a plurality of text fragments; And extracting a prompt word template based on a preset entity relation, extracting structural information from the text fragment by utilizing the trimmed language model, and performing multi-round data optimization including cleaning, merging and de-duplication on the extraction result to form a canonical entity and relation data set and constructing a knowledge graph.
4. The multi-modal atlas enhancement retrieval method based on feature fusion optimization according to claim 1, wherein the performing association indexing on the image data in the knowledge base and the corresponding entity includes: In the knowledge graph, a storage path or a unique identifier of an image file is added for an entity node of an image type as an attribute, and a semantic vector of the entity node is associated with the image identifier in a vector database for storage, so that a direct mapping relation from entity semantics to images is established.
5. The multi-modal map enhanced search method based on feature fusion optimization according to claim 1, wherein the receiving the original query of the user, performing semantic analysis and active optimization processing on the original query by using the trimmed language model, further extracting the target entity and the relationship to form a structured search intention, includes: analyzing the original query of the user by utilizing the trimmed language model, and identifying key information and potential missing points; based on entity relations and common field interaction scenes in the knowledge graph, supplementing key information and potential missing points as prompt words, and generating a plurality of candidate complement directions or explicit expressions by combining the prompt words; And feeding back the candidate contents to a user for confirmation so as to optimize and obtain a query expression with clear semantics, and finally accurately extracting a target entity and a relation from the query expression to form the structured search intention.
6. The multi-modal map enhancement retrieval method based on feature fusion optimization according to claim 1, wherein the dynamic hint generator is a lightweight multi-layer perceptron, inputs the vector into the encoded sub-graph knowledge context vector, outputs the vector into a group of learnable visual hint vectors, and when the visual model is input, the group of hint vectors are spliced with the image block embedding vector, and the focusing of the knowledge context driven image area is realized by guiding an attention mechanism inside the model.
7. The multi-modal map enhanced retrieval method based on feature fusion optimization of claim 6, wherein the dynamic hint generator introduces orthogonality constraints when generating visual hint vectors, enabling different hint vectors to focus on different semantic regions in the image, guaranteeing the diversity of hints and covering the key visual information in whole.
8. The multi-modal spectrum enhancement retrieval dialogue system based on feature fusion optimization is realized by the multi-modal spectrum enhancement retrieval method based on feature fusion optimization according to any one of claims 1-7, and is characterized by comprising a model territory module, a knowledge base construction module, a query understanding module, an association retrieval module, a visual enhancement module and a multi-modal reasoning module; The model territory module is used for respectively pre-training and fine-tuning the visual model and the language model by utilizing the field image and the text data; the knowledge base construction module is used for extracting entities and relations by utilizing the trimmed language model after text data in the knowledge base are subjected to segmentation processing so as to construct a knowledge graph, and simultaneously, the image data in the knowledge base and the corresponding entities are subjected to association indexing and then are stored in the vector database together with semantic vectors of the entities and semantic vectors of the relations; the query understanding module is used for receiving original query of a user, carrying out semantic analysis and active optimization processing on the original query by utilizing the trimmed language model, and further extracting target entities and relations to form a structured retrieval intention; the associated search module is used for respectively executing map query and semantic search in the knowledge map and vector database based on search intention to acquire related subgraph, text semantic vector information and associated image data; The visual enhancement module is used for encoding the subgraph into a knowledge context, inputting the knowledge context into the dynamic prompt generator to generate a visual prompt, and extracting enhanced visual characteristics through a pre-trained visual model based on the visual prompt and associated image data; The multi-modal reasoning module is used for jointly forming multi-modal context by the subgraph, the text semantic vector information and the enhanced visual features, inputting the finely tuned language model for collaborative reasoning, generating a final answer and outputting the final answer.
9. An electronic device comprising a memory and one or more processors, the memory being configured to store a computer program, wherein the processor is configured to implement the feature fusion optimization-based multimodal atlas enhancement retrieval method of any one of claims 1-7 when the computer program is executed.
10. A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a computer, implements the feature fusion optimization-based multimodal atlas enhancement retrieval method of any one of claims 1-7.

Description

Multi-modal map enhanced retrieval method and dialogue system based on feature fusion optimization Technical Field The invention belongs to the technical field of natural language processing, and particularly relates to a multi-modal map enhanced retrieval method and a dialogue system based on feature fusion optimization. Background With the rapid development of information technology, knowledge bases in various fields (such as medical treatment, finance, education, cultural heritage and the like) have undergone tremendous revolution. The early knowledge base mainly takes text entry type storage as a main part, has single function and can only meet the requirement of simple knowledge inquiry. However, with the popularization of advanced technologies such as high-precision scanning, three-dimensional modeling, image acquisition and the like, knowledge bases in various fields are greatly upgraded, and the knowledge bases are converted from simple text storage to complex sets of multi-type data such as fusion text, images, three-dimensional models, audio and the like, so that typical multi-mode data characteristics are presented. The evolution of the data form not only provides rich materials for deep mining of knowledge in various fields, but also provides brand new challenges of cross-modal association, accurate positioning and semantic understanding for the retrieval technology. In view of the challenge, the currently mainstream single-mode retrieval technology can realize efficient retrieval in respective modes, but is not attractive when processing a multi-mode knowledge base, and the technical bottleneck is a mode barrier existing between different mode data. The barriers are derived from the fact that the representation forms, feature dimensions and semantic expression logic of different modal data have essential differences that text data transfer semantic information by taking a character sequence as a carrier, image data present visual features by taking a pixel matrix as a basis, audio data take a waveform signal as a core to bear auditory information, and the differences enable a system to not directly establish deep semantic association among the different modal data, so that in an actual retrieval scene, the composite retrieval requirement of a user is difficult to accurately meet. In order to compensate for the lack of semantic association, the knowledge graph technology is used as a structured knowledge representation and association query tool, and the introduction of the knowledge graph technology relieves the problem of the lack of semantic association in the traditional retrieval to a certain extent. Through systematic modeling of core entities (such as diseases, medicines and genes in medical fields, institutions, products and clients in financial fields), entity attributes (such as indications of medicines and yields of products) and relationships among entities (such as correspondence between diseases and symptoms and borrowing relationships between institutions and clients), a knowledge graph can construct a logically clear knowledge network, and accurate query and reasoning based on entity association are realized. However, the traditional knowledge graph construction and reasoning core is still unfolded around text semantics, modeling capability of non-text modal information such as images, audios and three-dimensional models is seriously insufficient, and an effective technical means for converting semantic information in the non-text data into structural knowledge with identifiable and associable graphs is lacked. Furthermore, the term system in the specific field often has extremely strong expertise and high ambiguity, and this characteristic becomes an important obstacle for the accurate understanding of the requirements of the search technology. In each professional field, a large number of concepts have the characteristic of 'similar meaning', the surface expressions are similar, but the core connotation and the applicable scene have obvious differences, and the validity of the search result can be ensured only by accurately distinguishing. Taking the cultural heritage field as an example, when a user refers to the term "tripod", the user may point to different forms such as a square tripod, a round tripod, a separate tripod, etc., and these form differences are not only reflected in the appearance structure, but also include technological features and cultural connotations of different times. Therefore, if the conventional search system relies on text labels only, it cannot understand and associate specific visual morphological features, which inevitably leads to generalization and deviation of search results. If the search system cannot accurately identify such a subdivision difference, the search result will be generalized. Therefore, in the prior art, a core bottleneck that a modal barrier is difficult to break, a knowledge graph is insufficient in representing multi-modal i