CN-122019801-A - Knowledge enhancement method and device integrating heterogeneous multi-mode retrieval and cross-mode alignment

CN122019801ACN 122019801 ACN122019801 ACN 122019801ACN-122019801-A

Abstract

The application provides a knowledge enhancement method and a knowledge enhancement device for fusing heterogeneous multi-modal retrieval and cross-modal alignment, and the knowledge enhancement method comprises the following steps of constructing a knowledge base containing data of different modalities, obtaining query characteristics, obtaining M data characteristics with cosine similarity from high to low in the knowledge base to form an initial data characteristic set, reordering the initial data characteristic set based on semantic similarity score to obtain a final data characteristic set, generating a context prompt based on the final data characteristic set, and carrying out knowledge enhancement on user query by the context prompt. According to the scheme, an initial feature set is formed through cosine similarity, and then the pre-trained multi-mode semantic similarity score calculation model is used for reordering, so that noise fragments which are globally similar but are not locally matched are filtered through fine-granularity semantic matching, and the semantic relevance of a retrieval result is remarkably improved.

Inventors

HUANG ZHIPING
QI XINQI
XU ZHIDONG
FAN HONGJUN
HE WENGUANG

Assignees

杭州求数科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. The knowledge enhancement method integrating heterogeneous multi-modal retrieval and cross-modal alignment is characterized by comprising the following steps of: Constructing a knowledge base containing data of different modes, respectively extracting features of the data of each mode in the knowledge base, and mapping the extracted data features to an embedded space; Acquiring query features based on user query, and acquiring M data features with cosine similarity from high to low in an embedded space to form an initial data feature set; calculating the semantic similarity score between each data feature in the initial data feature set and the query feature by using a pre-trained multi-mode semantic similarity score calculation model, and reordering the initial data feature set from high to low based on the semantic similarity score to obtain a final data feature set, wherein the multi-mode semantic similarity calculation model adopts encoders corresponding to different modes to extract the features of the data features of different modes in the initial data feature set; K data features with high-low semantic similarity with query features are obtained in the final data feature set and used as semantic similar features, all the semantic similar features are fused to obtain a context prompt, and knowledge enhancement is carried out on the user query by the context prompt.
2. The knowledge enhancement method integrating heterogeneous multi-modal retrieval and cross-modal alignment according to claim 1, wherein a plurality of pre-trained encoding modules corresponding to different modalities are used for respectively carrying out feature extraction on data of the corresponding modalities in a knowledge base and mapping the data to an embedded space, wherein each encoding module comprises an encoder and a linear projection layer, the encoder is used for extracting feature representations of the data of the corresponding modalities, and the linear projection layer is used for mapping the extracted feature representations to the embedded space.
3. The knowledge enhancement method for fusing heterogeneous multi-modal retrieval and cross-modal alignment according to claim 2, wherein a plurality of coding modules are subjected to joint training, a cosine similarity matrix is built in an embedded space by using characteristic representations of any two different modalities in the joint training process, a cross-modal contrast loss function is built by maximizing positive sample pair similarity and minimizing negative sample pair similarity based on the cosine similarity matrix, the weighted sum of the cross-modal contrast loss functions corresponding to all the cross-modal characteristic representations is used as a total loss function, and the parameters of each coding module are adjusted in a joint mode by taking the minimum total loss function as a target to complete the joint training.
4. The knowledge enhancement method integrating heterogeneous multi-modal retrieval and cross-modal alignment according to claim 1, wherein a data index is built for each data in a knowledge base and stored in an index base, data features in an embedded space are obtained based on the data index, cosine similarity calculation is carried out on the data features and query features, and the feature index comprises spatial positions of corresponding data features in the embedded space.
5. The method for enhancing knowledge by integrating heterogeneous multi-modal retrieval and cross-modal alignment according to claim 1 is characterized in that after a user query is acquired, the user query is subjected to modal identification, if the user query is single-modal, the user query is subjected to feature extraction by using a coder corresponding to the mode to obtain query features, if the user query is multi-modal, the user query is subjected to modal segmentation to obtain a plurality of user sub-queries with different modes, the user sub-queries are respectively subjected to feature extraction by using the coder corresponding to the different modes to obtain a plurality of user sub-features, and then all the user sub-features are fused to obtain the query features.
6. The method for enhancing knowledge by fusing heterogeneous multi-modal retrieval and cross-modal alignment according to claim 1, wherein a pre-trained multi-modal semantic similarity score calculation model is constructed based on a transducer architecture, the multi-modal semantic similarity score calculation model comprises a sequence coding unit, a cross-modal interaction unit and a semantic relevance scoring unit, query features and features to be matched are input into a semantic matching model, the sequence coding unit respectively codes the query features and the features to be matched into a query feature sequence and a feature sequence to be matched, the cross-modal interaction unit carries out cross-modal interaction fusion on the query feature sequence and the feature sequence to be matched based on self-attention and cross-attention to obtain a cross-modal fusion feature sequence, and the semantic relevance scoring unit maps the cross-modal fusion feature sequence into a semantic similarity score through a multi-layer perceptron, wherein the features to be matched are any features in an initial data feature set.
7. The method for enhancing knowledge by fusing heterogeneous multi-modal retrieval and cross-modal alignment according to claim 1, wherein if the knowledge enhancement result is used for a first large language model, the Token representation of each semantic similar feature is spliced to obtain a context prompt, and if the knowledge enhancement result is used for a second large language model, each semantic similar feature is converted into a text Token representation, and the text Token representation of each semantic similar feature is spliced to obtain the context prompt, wherein the first large language model supports multi-modal input, and the second large language model supports only text input.
8. A knowledge enhancement device for fusing heterogeneous multi-modal retrieval with cross-modal alignment, comprising: The construction module is used for constructing a knowledge base containing data of different modes, respectively extracting the characteristics of the data of each mode in the knowledge base and mapping the extracted data characteristics to an embedded space; The similarity calculation module is used for acquiring query characteristics based on user query, and acquiring M data characteristics with cosine similarity from high to low in an embedded space to form an initial data characteristic set; The reordering module is used for calculating the semantic similarity score between each data feature in the initial data feature set and the query feature by using a pre-trained multi-mode semantic similarity score calculation model, and reordering the initial data feature set from high to low based on the semantic similarity score to obtain a final data feature set, wherein the multi-mode semantic similarity calculation model adopts encoders corresponding to different modes to extract the features of the data features of different modes in the initial data feature set; And the knowledge enhancement module is used for acquiring K data features with high-to-low semantic similarity with the query features in the final data feature set as semantic similarity features, fusing all the semantic similarity features to obtain a context prompt, and carrying out knowledge enhancement on the user query by using the context prompt.
9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to run the computer program to perform a method of knowledge enhancement incorporating heterogeneous multimodal retrieval and cross-modality alignment as claimed in any of claims 1 to 7.
10. A readable storage medium, wherein a computer program is stored in the readable storage medium, which when executed by a processor implements a knowledge enhancement method of merging heterogeneous multimodal retrieval with cross-modal alignment as claimed in any of claims 1-7.

Description

Knowledge enhancement method and device integrating heterogeneous multi-mode retrieval and cross-mode alignment Technical Field The application relates to the technical field of artificial intelligence, in particular to a knowledge enhancement method and device integrating heterogeneous multi-modal retrieval and cross-modal alignment. Background Along with the rapid development of artificial intelligence technology, a Large Language Model (LLMs) shows excellent capability in the field of natural language processing, and a retrieval enhancement generation (RETRIEVAL-AugmentedGeneration, RAG) technology is used as a main stream means for improving the accuracy of large model facts and expanding knowledge coverage, and is widely applied to scenes such as intelligent question-answering, content generation and the like. The core logic of the traditional RAG technology is to search text fragments related to user query from an external text knowledge base, input the text fragments as context information into a large language model, so that the model generates a more reliable answer which is more fit with requirements, and a better application effect is realized in a pure text knowledge interaction scene. However, the information in the real world presents remarkable heterogeneous multi-modal characteristics, the multi-modal information carriers such as the web page data, the video content with commentary and the multimedia courseware containing audio description become the main existence form of knowledge, the traditional RAG technology which simply relies on text modes has obvious information utilization limitation, a large amount of potential useful knowledge in the multi-modal carrier can be lost, the richness and the accuracy of the generated content of the model are greatly reduced, and the performance of the traditional RAG technology is more difficult to meet the actual application demands especially in the inter-modal knowledge reasoning and question-answering tasks related to visual characteristics and auditory information. In the prior art system, the multi-modal retrieval technology is applied to single fields such as image search, video retrieval and the like, but is not subjected to depth fusion with the generation process of RAG, and lacks a unified retrieval framework for heterogeneous multi-modal data of text, image, audio and video mixture. Meanwhile, the existing method lacks an effective cross-modal alignment mechanism, the characteristic characterization of different modal data is mutually independent, and accurate matching cannot be realized on a semantic level, so that cross-modal knowledge retrieval and information calling are difficult to realize, a large amount of heterogeneous multi-modal knowledge cannot be effectively utilized by a large model, and a technical barrier for knowledge utilization is formed. Part of researches try to directly input multi-mode information into a multi-mode large model such as GPT-4V and the like to make up the defects, but the method has obvious technical bottlenecks that on one hand, non-text mode information needs to be encoded into a special Token special for the model, extremely high requirements are put on hardware computing resources, the technology landing cost is increased, and on the other hand, a context window of the model has fixed limitations, so that the size of a callable knowledge base is limited and the retrieval and the use of large-scale heterogeneous multi-mode knowledge cannot be supported. In summary, the problem of high-efficiency utilization of heterogeneous multi-mode information by a large model in the prior art is not solved, and a large model knowledge enhancement generation method capable of realizing unified retrieval and cross-mode semantic alignment of heterogeneous multi-mode data and deep fusion with a RAG generation process is needed to improve understanding and generation capability of the large model on real-world complex multi-mode information. Disclosure of Invention The embodiment of the application provides a knowledge enhancement method and a knowledge enhancement device for fusing heterogeneous multi-modal retrieval and cross-modal alignment, which are characterized in that an initial feature set is formed by calculating cosine similarity, and then the initial feature set is reordered by a pre-trained multi-modal semantic similarity score calculation model, so that noise fragments which are globally similar but locally unmatched are filtered by fine-granularity semantic matching, and the semantic relevance of a retrieval result is remarkably improved. In a first aspect, an embodiment of the present application provides a data warehousing method, where the method includes: Constructing a knowledge base containing data of different modes, respectively extracting features of the data of each mode in the knowledge base, and mapping the extracted data features to an embedded space; Acquiring query features based on user