CN-122019749-A - Multi-mode data retrieval method, device, equipment, medium and product

CN122019749ACN 122019749 ACN122019749 ACN 122019749ACN-122019749-A

Abstract

The embodiment of the invention provides a multi-mode data retrieval method, a device, equipment, a medium and a product, wherein the method comprises the steps of obtaining a natural language problem of a user, carrying out intention understanding on the natural language problem, and obtaining a query intention of the natural language problem; the method comprises the steps of selecting a search path of the query intention, determining at least one target modal type matched with the natural language problem, distributing the natural language problem to vector databases corresponding to the target modal type for searching respectively to obtain candidate multi-modal search results of the natural language problem, and verifying the candidate multi-modal search results according to a pre-constructed multi-modal business knowledge graph to obtain target multi-modal search results of the natural language problem. By using the method, the cross-modal semantic understanding capability and the information retrieval performance are obviously improved by introducing a synergistic mechanism of semantic understanding, routing decision and knowledge enhancement.

Inventors

CONG ZHIXIN
LIU YANXI
SONG DAIQIANG
YUAN YIFAN
LI JIANQIANG
TAN HAO
YANG YUE
LIN HENGXI
WU BINGZHE
Jin Xiongnan
LI QINGQING
HUANG YAN
WANG LI
ZHENG WEIGUANG
YUAN SONG
ZHOU SHUANG
SHAN XUE
KUANG YAN
MENG FEI
CHEN HONG

Assignees

中移信息技术有限公司
中国移动通信集团有限公司

Dates

Publication Date: 20260512
Application Date: 20260121

Claims (14)

1. A multi-modal data retrieval method comprising: Acquiring a natural language problem of a user, carrying out intention understanding on the natural language problem, and acquiring a query intention of the natural language problem; Searching path selection is carried out on the query intention, and at least one target mode type matched with the natural language problem is determined; distributing the natural language problem to each vector database corresponding to the target modal type for searching respectively to obtain candidate multi-modal searching results of the natural language problem; and verifying the candidate multi-modal retrieval results according to the pre-constructed multi-modal business knowledge graph to obtain target multi-modal retrieval results of the natural language problem.
2. The method of claim 1, wherein the intent understanding of the natural language question, obtaining a query intent of the natural language question, comprises: carrying out intention recognition on the natural language questions by adopting an intention classifier to obtain intention types of the natural language questions; Extracting entity information of the natural language problem by adopting a named entity recognition module to obtain service elements of the natural language problem; taking the intention type and the business elements of the natural language as query intents of the natural language questions.
3. The method of claim 1, wherein the routing the query intent to determine at least one target modality type that matches the natural language question comprises: Adding the query intent to a prompt word and inputting the prompt word into a basic large model to output at least one target modality type matching the natural language question, or And inputting the query intention into a fine-tuning large model to output at least one target modal type matched with the natural language problem.
4. The method of claim 1, wherein the distributing the natural language question to each vector database corresponding to the target modality type for searching respectively, to obtain a candidate multi-modality search result of the natural language question, includes: Carrying out structural analysis on the natural language problem to obtain a structural analysis result; Aligning and reasoning the structured analysis result with a pre-constructed multi-mode business knowledge graph to obtain an enhanced retrieval instruction corresponding to the natural language problem; distributing the enhanced search instruction to each vector database corresponding to the target mode type for searching respectively to obtain a single search result corresponding to each target mode type; Carrying out fusion processing on each single search result to obtain a fusion search result; and taking each single search result and the fusion search result as the candidate multi-mode search result.
5. The method of claim 4, wherein the aligning and reasoning the structured analysis result with a pre-constructed multi-modal business knowledge graph to obtain the enhanced retrieval instruction corresponding to the natural language problem comprises: matching the structured analysis result with an entity relationship path in the multi-mode business knowledge graph; if the matching is successful, the structural analysis result is used as the core semantic of the query intention, and the associated information of the structural analysis result is used as the auxiliary semantic of the query intention; and taking the core semantics and auxiliary semantics of the query intention as enhanced retrieval instructions corresponding to the natural language questions.
6. The method of claim 5, wherein validating the candidate multimodal retrieval based on a pre-constructed multimodal service knowledge graph to obtain a target multimodal retrieval of the natural language question comprises: determining semantic consistency scores of the candidate multi-mode search results and entity relationship paths in the multi-mode business knowledge graph; And sequencing the candidate multi-mode search results according to the semantic consistency score from high to low to obtain the candidate multi-mode search results with the preset number as the target multi-mode search results of the natural language problem.
7. The method of claim 1, further comprising, prior to said obtaining the natural language question of the user: converting the original multi-modal data into feature vectors of corresponding modal types by adopting a layered feature extraction mode; And expressing the feature vector of each mode type according to a paragraph index, and storing the feature vector marked with the mode type into a corresponding vector database according to the mode type.
8. The method of claim 7, wherein converting the original multi-modal data into the feature vector of the corresponding modality type by using the hierarchical feature extraction method comprises: extracting image mode data from the original multi-mode data, and carrying out feature extraction on the image mode data by utilizing a visual encoder to generate a visual feature vector; Extracting video mode data from the original multi-mode data, dividing the video mode data into a plurality of video segments, and performing feature extraction on each video segment by using the visual encoder to generate visual feature vectors; And extracting text modal data from the original multi-modal data, and carrying out semantic coding on the text modal data by utilizing a pre-training language model or a text coder to generate text feature vectors containing context information.
9. The method of claim 7, wherein the representing feature vectors of each of the modality types according to a paragraph index and storing the feature vectors labeled with the modality types according to the modality types in a corresponding vector database comprises: Dividing paragraphs of the original multi-modal data to obtain data paragraphs of corresponding modal types; binding each characteristic vector with a corresponding data section aiming at each mode type, and storing the characteristic vector marked with the mode type into a corresponding vector database according to the mode type.
10. The method of claim 1, wherein the step of determining the position of the substrate comprises, If the query intention of the natural language question comprises an image, the target multi-mode search result comprises an image related to the natural language question and an image index position; If the query intention of the natural language question comprises a video, the target multi-mode search result comprises a video name, a paragraph start-stop time stamp, a key frame and a corresponding identification text of the video related to the natural language question; if the query intention of the natural language question comprises text, the target multi-modal search result comprises text related to the natural language question and a text index position.
11. A multi-modal data retrieval apparatus comprising: The intention determining module is used for acquiring natural language questions of a user, carrying out intention understanding on the natural language questions and acquiring query intention of the natural language questions; The mode determining module is used for carrying out search path selection on the query intention and determining at least one target mode type matched with the natural language problem; The initial determining module is used for distributing the natural language problem to each vector database corresponding to the target modal type for searching respectively to obtain candidate multi-modal searching results of the natural language problem; and the retrieval determining module is used for verifying the candidate multi-mode retrieval result according to a pre-constructed multi-mode service knowledge graph to obtain a target multi-mode retrieval result of the natural language problem.
12. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the multimodal data retrieval method of any of claims 1-10.
13. A computer readable storage medium storing computer instructions for causing a processor to perform the multi-modal data retrieval method according to any one of claims 1-10.
14. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the multimodal data retrieval method according to any of the claims 1-10.

Description

Multi-mode data retrieval method, device, equipment, medium and product Technical Field The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, a medium, and a product for retrieving multi-mode data. Background In recent years, with the development of artificial intelligence technology in the multi-modal field, research of multi-modal information retrieval, reasoning and question-answering systems has become the focus. At present, multi-modal information retrieval and reasoning are gradually mature in the fields of intelligent question answering, video understanding, electronic government affairs and the like. Common practice includes semantic expansion and matching based on knowledge graph, processing multi-modal input by using visual/language model, and fusing retrieval enhancement generation mechanism to improve generation quality. Some technologies also introduce autoregressive large models, graph neural networks, knowledge adapters, etc. for feature fusion and inference enhancement. However, in business scenes such as electronic signature and video double recording, the data has the complex characteristics of unstructured, heterogeneous mode mixing, large semantic span and the like, so that the existing method has the problems of insufficient precision, timeliness and expandability, and particularly high-efficiency, controllable and multi-level business information extraction and compliance examination are difficult to support. Disclosure of Invention The embodiment of the invention provides a multi-modal data retrieval method, device, equipment, medium and product, which realizes that a multi-modal content retrieval mode with high adaptability and intelligence is constructed by introducing a synergistic mechanism of semantic understanding, routing decision and knowledge enhancement, thereby fundamentally solving the bottleneck of the prior art and remarkably improving the cross-modal semantic understanding capability and information retrieval performance. In a first aspect, the present embodiment provides a multi-modal data retrieval method, including: Acquiring a natural language problem of a user, carrying out intention understanding on the natural language problem, and acquiring a query intention of the natural language problem; Searching path selection is carried out on the query intention, and at least one target mode type matched with the natural language problem is determined; distributing the natural language problem to each vector database corresponding to the target modal type for searching respectively to obtain candidate multi-modal searching results of the natural language problem; and verifying the candidate multi-modal retrieval results according to the pre-constructed multi-modal business knowledge graph to obtain target multi-modal retrieval results of the natural language problem. In a second aspect, the present embodiment provides a multi-modal data retrieval apparatus, the apparatus comprising: The intention determining module is used for acquiring natural language questions of a user, carrying out intention understanding on the natural language questions and acquiring query intention of the natural language questions; The mode determining module is used for carrying out search path selection on the query intention and determining at least one target mode type matched with the natural language problem; The initial determining module is used for distributing the natural language problem to each vector database corresponding to the target modal type for searching respectively to obtain candidate multi-modal searching results of the natural language problem; and the retrieval determining module is used for verifying the candidate multi-mode retrieval result according to a pre-constructed multi-mode service knowledge graph to obtain a target multi-mode retrieval result of the natural language problem. In a third aspect, the present embodiment provides an electronic device, including: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the multimodal data retrieval method of any of the embodiments of the invention. In a fourth aspect, the present embodiment provides a computer-readable storage medium storing computer instructions for causing a processor to implement a multi-modal data retrieval method according to any one of the embodiments of the present invention when executed. In a fifth aspect, embodiments of the present invention also provide a computer program product comprising a computer program which, when executed by a processor, implements a multimodal data retrieval method according to any of the embodiments of the present invention. The embodiment of the invention provides a multi-mode data retrieval method, a device, equipment