CN-122019591-A - Cross-modal retrieval method and device

CN122019591ACN 122019591 ACN122019591 ACN 122019591ACN-122019591-A

Abstract

The application provides a cross-modal retrieval method and device, and relates to the technical field of cross-modal retrieval and artificial intelligence. The method comprises the steps of obtaining a sample to be queried, a pre-trained multi-modal large language model and an identifier index library, wherein the identifier index library is constructed by utilizing a sample data set of a plurality of preset modes and the multi-modal large language model, the sample to be queried has at least one preset mode of the plurality of preset modes, inputting the sample to be queried into the multi-modal large language model, outputting a structured semantic identifier corresponding to the sample to be queried, matching the sample to be queried in the identifier index library based on the semantic identifier, and determining a query result corresponding to the sample to be queried according to the matching result. The application generates the structured semantic identifier through the multi-mode large language model, and matches the semantic identifier with the constructed identifier index library, thereby realizing high-efficiency and accurate cross-mode retrieval.

Inventors

WANG LEI
ZHOU XI
DONG RUI
YANG YATING
MA BO
Aihetamu River. Aihemeti
LI TIANYUAN
Han Bangju

Assignees

中国科学院新疆理化技术研究所

Dates

Publication Date: 20260512
Application Date: 20260119

Claims (10)

1. A cross-modal retrieval method, comprising: acquiring a sample to be queried, a pre-trained multi-modal large language model and an identifier index library, wherein the identifier index library is constructed by utilizing a sample data set of a plurality of preset modes and the multi-modal large language model, and the sample to be queried has at least one preset mode in the plurality of preset modes; inputting the sample to be queried into the multi-modal large language model, and outputting a structured semantic identifier corresponding to the sample to be queried; and matching in the identifier index library based on the semantic identifier, and determining a query result corresponding to the sample to be queried according to the matching result.
2. The method of claim 1, wherein the identifier index library is constructed by: processing a plurality of sample data sets based on a pre-constructed prompt template to obtain a model input sequence, wherein the prompt template is constructed according to a preset input format and a preset output format, and the input format of the prompt template is used for restricting the modes of the plurality of processed sample data sets; inputting the model input sequence into the multi-modal large language model to obtain a structured semantic identifier corresponding to each sample data set, wherein the output format of the prompt template is used for constraining the output format of the multi-modal large language model, and the semantic identifier has the same format as the output format of the prompt template; determining the association relation between each sample data set and the corresponding semantic identifier based on the semantic identifier corresponding to each sample data set; And constructing the identifier index library based on the association relation.
3. The method of claim 2, wherein the input format includes a stitching order and placeholders corresponding to each of the modalities, wherein processing the plurality of sample data sets based on a pre-constructed hint template to obtain a model input sequence comprises: Encoding the sample data set aiming at any sample data set to obtain the encoding characteristics of the sample data set; determining a target placeholder corresponding to a modality of the sample dataset from a plurality of the placeholders based on the encoding features; performing sequence splicing on the sample data set and the target placeholder to obtain a spliced sample data set; And based on the splicing sequence, performing secondary splicing on a plurality of spliced sample data sets to obtain the model input sequence.
4. The method of claim 2, wherein the semantic identifier comprises a first semantic identifier and a second semantic identifier, wherein the association comprises a first association and a second association, wherein constructing the identifier index library based on the association comprises: Extracting a first identification vector of a first semantic identifier and a sample vector of the sample data set under the condition that the semantic identifier is the first semantic identifier; determining the similarity between the first identification vector and the sample vector, and establishing a first association relationship between the first semantic identifier and the sample data set under the condition that the similarity is larger than a preset similarity threshold; acquiring a hash value of a second semantic identifier and a hash value of a first semantic identifier under the condition that the semantic identifier is the second semantic identifier; Establishing a second association relationship between the second semantic identifier and the sample data set under the condition that the difference value between the hash value of the second semantic identifier and the hash value of the first semantic identifier is within a preset difference value threshold value range; and constructing the identifier index library based on the first association relation and the second association relation.
5. The method according to claim 1, wherein the semantic identifier is formed by combining at least two semantic units according to a predetermined combination rule, and the semantic units are target object semantic units, behavior action semantic units or scene environment semantic units.
6. The method of claim 5, wherein inputting the sample to be queried into the multimodal large language model, outputting a structured semantic identifier corresponding to the sample to be queried, comprises: processing the multi-modal large language model based on a preset identifier generation strategy to generate a plurality of candidate semantic identifiers corresponding to the sample to be queried; Matching the plurality of candidate semantic identifiers with a pre-constructed tree-shaped constraint structure, wherein the tree-shaped constraint structure is constructed by utilizing a plurality of semantic units; And screening candidate semantic identifiers meeting preset matching conditions from the plurality of candidate semantic identifiers based on a matching result, and taking the candidate semantic identifiers as the structured semantic identifiers corresponding to the sample to be queried.
7. The method of claim 6, wherein the tree-like constraint structure is constructed by: instantiating each semantic unit as a node, and configuring constraint conditions for each node, wherein the constraint conditions comprise the dependency relationship between each node and the semantic type of each node; determining the connection relation of each node based on the constraint condition; And constructing the tree constraint structure based on the nodes and the connection relation.
8. The method of claim 6, wherein the tree constraint structure comprises a plurality of connection edges, wherein filtering candidate semantic identifiers meeting a preset matching condition from the plurality of candidate semantic identifiers based on a matching result as the structured semantic identifiers corresponding to the sample to be queried comprises: determining the generation probability of each candidate semantic identifier and the semantic unit corresponding to each candidate semantic identifier; instantiating a plurality of semantic units corresponding to each candidate semantic identifier into a plurality of candidate nodes; Determining candidate connection edges corresponding to each candidate semantic identifier based on the plurality of candidate nodes and the constraint condition; and when the connecting edges which are the same as the candidate connecting edges exist in the plurality of connecting edges, generating a candidate semantic identifier with highest probability as the structured semantic identifier corresponding to the sample to be queried.
9. The method according to claim 1, wherein the method further comprises: Acquiring semantic interpretation information of the semantic identifier corresponding to the sample to be queried; Constructing a training sample pair by taking the semantic identifier corresponding to the sample to be queried as a key and the semantic interpretation information as a value; based on the training sample pair, performing secondary training on the multi-modal large language model; And updating the identifier index library by utilizing the multi-modal large language model after secondary training.
10. A cross-modal retrieval apparatus, comprising: The system comprises a sample acquisition module, a query module and a query module, wherein the sample acquisition module is used for acquiring a sample to be queried, a pre-trained multi-modal large language model and an identifier index library, the identifier index library is constructed by utilizing a sample data set of a plurality of preset modes and the multi-modal large language model, and the sample to be queried has at least one preset mode in the plurality of preset modes; The identification output module is used for inputting the sample to be queried into the multi-modal large language model and outputting a structured semantic identifier corresponding to the sample to be queried; And the identification query module is used for carrying out matching in the identifier index library based on the semantic identifier, and determining a query result corresponding to the sample to be queried according to the matching result.

Description

Cross-modal retrieval method and device Technical Field The application relates to the technical field of cross-modal retrieval and artificial intelligence, in particular to a cross-modal retrieval method and device. Background Cross-modal retrieval aims at processing different modal data such as images, texts and the like and realizing semantic association and retrieval matching among different modalities. The cross-modal retrieval system needs to have multi-modal semantic coding capability, and images and texts are mapped to a unified semantic representation space, so that cross-modal semantic alignment is realized, and a user can query images through texts or query related text contents through the images. At present, a generation type cross-modal retrieval technology is adopted in cross-modal retrieval, and identifiers corresponding to samples are directly generated in an reasoning stage by utilizing the generation capability of a multi-modal large language model, so that an end-to-end retrieval flow is completed. However, when the generated cross-modal retrieval technology is directly applied to a cross-modal retrieval task, the design of identifiers usually depends on manual labeling, cluster generation or atomic identifiers needing to expand word lists, and good balance between semantic alignment and expandability is difficult to achieve. Such design limitations tend to deviate the generated results from the predefined semantic space, ultimately resulting in reduced cross-modality retrieval performance. Disclosure of Invention In view of the above problems, the present application provides a cross-modal retrieval method and apparatus. The application provides a cross-modal retrieval method which comprises the steps of obtaining a sample to be queried, a pre-trained multi-modal large language model and an identifier index library, wherein the identifier index library is constructed by utilizing a sample data set of a plurality of preset modes and the multi-modal large language model, the sample to be queried has at least one preset mode of the plurality of preset modes, inputting the sample to be queried into the multi-modal large language model, outputting a structured semantic identifier corresponding to the sample to be queried, matching the sample to be queried in the identifier index library based on the semantic identifier, and determining a query result corresponding to the sample to be queried according to a matching result. According to the embodiment of the application, the identifier index library is constructed by processing a plurality of sample data sets based on a pre-constructed prompt template to obtain a model input sequence, constructing the prompt template according to a preset input format and a preset output format, wherein the input format of the prompt template is used for restraining the modes of the plurality of processed sample data sets, inputting the model input sequence into a multi-modal large language model to obtain a structured semantic identifier corresponding to each sample data set, the output format of the prompt template is used for restraining the output format of the multi-modal large language model, the semantic identifier has the same format as the output format of the prompt template, determining the association relation between each sample data set and the corresponding semantic identifier based on the semantic identifier corresponding to each sample data set, and constructing the identifier index library based on the association relation. According to the embodiment of the application, an input format comprises a splicing order and placeholders corresponding to each mode, a model input sequence is obtained by processing a plurality of sample data sets based on a pre-constructed prompt template, the model input sequence comprises the steps of encoding the sample data sets for any sample data set to obtain encoding characteristics of the sample data sets, determining target placeholders corresponding to the modes of the sample data sets from the plurality of placeholders based on the encoding characteristics, performing sequence splicing on the sample data sets and the target placeholders to obtain a spliced sample data set, and performing secondary splicing on the plurality of spliced sample data sets based on the splicing order to obtain the model input sequence. According to the embodiment of the application, the semantic identifier comprises a first semantic identifier and a second semantic identifier, the incidence relation comprises the first incidence relation and the second incidence relation, the identifier index library is built based on the incidence relation, the method comprises the steps of extracting a first identification vector of the first semantic identifier and a sample vector of a sample data set under the condition that the semantic identifier is the first semantic identifier, determining the similarity of the first identification vecto