CN-122019811-A - Cross-modal image-text retrieval method based on inter-modal information interaction

CN122019811ACN 122019811 ACN122019811 ACN 122019811ACN-122019811-A

Abstract

The invention provides a cross-modal image-text retrieval method based on inter-modal information interaction, which comprises the steps of respectively carrying out feature extraction on text data and image data, carrying out local feature context enhancement processing based on a self-attention mechanism on the extracted features, carrying out key local feature sparse screening processing based on global features on the enhanced text features and image features, carrying out cross-modal attention interaction processing on the screened key features, carrying out feature fusion on the key features and the features after the cross-modal attention interaction processing, carrying out feature aggregation and standardization on the fused features, and carrying out contrast learning training and retrieval matching based on the standardized features. According to the invention, redundant interaction is avoided through key feature screening, cross-modal attention is utilized to accurately fuse image-text semantics, original modal information is self-adaptively reserved by combining a gating mechanism, the retrieval precision and robustness are remarkably improved, meanwhile, the calculation cost is reduced, and the problem of modal gap is effectively solved.

Inventors

YOU XIU
LIU JIADONG
YAN JING
LI LIN
WEI WEI
LIANG JIYE

Assignees

山西大学

Dates

Publication Date: 20260512
Application Date: 20260313

Claims (8)

1. A cross-modal image-text retrieval method based on information interaction among modalities is characterized by comprising the following steps: step 1, acquiring text data and image data; step 2, extracting characteristics of the text data and the image data respectively; step 3, carrying out local feature context enhancement processing based on a self-attention mechanism on the extracted features; Step 4, carrying out key local feature sparse screening treatment based on global features on the enhanced text features and the enhanced image features; Step 5, performing cross-modal attention interaction processing on the screened key features; Step 6, carrying out feature fusion on the key features and the features subjected to cross-modal attention interaction treatment; Step 7, feature aggregation and standardization are carried out on the fused features; And 8, performing contrast learning training and retrieval matching based on the standardized features.
2. The method according to claim 1, wherein in step 2, feature extraction is performed on text data and image data, respectively, specifically: Obtaining text data, after word segmentation and word stopping of the text data, extracting feature vectors of each word through a pre-training language model to obtain a text local feature set, and carrying out global average pooling on the text local feature set to generate text global features; Obtaining image data, performing target detection on the image data to obtain regional features in the image, obtaining an image local regional feature set, and performing global average pooling on the image local regional feature set to obtain image global features.
3. The method according to claim 2, wherein in step 3, the extracted features are subjected to a local feature context enhancement process based on a self-attention mechanism, in particular: The text local word characteristics are enhanced by calculating self-attention scores among the text local word characteristics, weighting and summing the self-attention scores to obtain text local word characteristics containing the text local word characteristics to form a set; and (3) enhancing the image local feature context, namely calculating self-attention scores among the image local region features, and carrying out weighted summation to obtain the image local region features containing the context to form a set.
4. The method according to claim 3, wherein in step 4, the enhanced text features and image features are subjected to a global feature-based key local feature sparse screening process, specifically: Calculating cosine similarity between text local word features containing context and text global features, and reserving features with the cosine similarity larger than a threshold value to obtain a text key local feature set; And (3) filtering the image key features, namely calculating cosine similarity between the image local region features containing the context and the image global features, and reserving features with the cosine similarity larger than a threshold value to obtain an image key local feature set.
5. The method of claim 4, wherein in step 5, cross-modal attention interaction processing is performed on the screened key features, specifically: The image features of the image focused text are generated by taking the features of the image key region and the text key word as input, calculating the attention score of the image to the text, weighting the text key features based on the score, and obtaining the image features perceived by the text context, wherein the image mode feature forms are reserved, and the text information is merged; And generating text features of the text attention image, namely calculating attention scores of the text on the image by taking the features of the key regions of the image and the features of the text keywords as inputs, weighting the key features of the image based on the scores to obtain text features perceived by the context of the image, wherein the text modal feature forms are reserved, and the text features are blended into the image information.
6. The method according to claim 5, wherein in step 6, feature fusion is performed on the key features and the features after the cross-modal attention interaction processing, specifically: The image region gating fusion is carried out, namely a gating value is calculated for each image key region, and original region features and cross-mode image features are fused based on the gating value, so that image fusion features are obtained; And merging the original word features and the cross-modal text features based on the gating values to obtain text merging features.
7. The method according to claim 6, wherein step 7, feature aggregation and normalization are performed on the fused features, specifically: respectively carrying out global average pooling on the image fusion features and the text fusion features to obtain image aggregation features and text aggregation features; and respectively carrying out L2 standardization on the image aggregation characteristic and the text aggregation characteristic.
8. The method according to claim 7, wherein in step 8, the comparison learning training and the search matching are performed based on the normalized features, specifically: Constructing positive and negative sample pairs, calculating the standardized feature similarity of the matched image-text pairs, and training a model through comparison loss; And obtaining standardized characteristics of the query text/image through the steps, calculating the similarity with the image/text in the database, and returning to the Top-k result.

Description

Cross-modal image-text retrieval method based on inter-modal information interaction Technical Field The invention relates to the technical field of cross-modal retrieval, in particular to a cross-modal image-text retrieval method based on information interaction among modalities. Background With the rapid growth of internet multimedia data, the conventional single-mode search cannot meet the requirements, and cross-mode search (aiming at exploring the relationship among different-mode data and supporting other-mode data with one mode as query search semantic similarity) becomes a research hotspot, while image-text search is an important sub-field of cross-mode search, and has been continuously developed in recent years. The core challenge of cross-modal retrieval is that the mode gap, namely, the data (such as images, texts, audios and the like) of different modes have essential differences on the bottom characteristic representation, so that effective comparison and matching cannot be directly carried out in the original characteristic space, therefore, the core task of cross-modal retrieval is to establish a cross-modal semantic bridge, and the semantic matching is promoted through information interaction among the modes, so that the purpose of accurate retrieval is achieved. To address this challenge, researchers have proposed a range of common characterization learning methods. These approaches are directed to mapping different modality data into a shared low dimensional space. In this space, similar semantics are as close as possible, while dissimilar semantics are as far apart as possible. At present, cross-modal retrieval methods can be roughly divided into two categories, real-value retrieval and hash retrieval. Hash-based cross-modal retrieval maps high-dimensional data into low-dimensional binary hash codes through a hash function, measures similarity by utilizing Hamming distances, and essentially sacrifices some semantic information to trade for efficiency. In contrast, real-valued based cross-modal retrieval is directed to extracting low-dimensional real-valued features of multi-modal data, which can preserve more rich semantic information, but at the same time increase storage costs and computational requirements. The present invention belongs to the latter. However, the existing cross-modal image-text retrieval for real-value characterization learning has some problems: 1. Cross-modal full interaction redundancy interference problem When the prior method carries out cross-modal interaction, the information extracted by the image and text feature extraction network is fully interacted, which may cause that some irrelevant information is interacted by mistake, and redundant information interference is caused, thereby influencing the overall interaction quality. 2. Cross-modal interaction single-mode information loss problem In the conventional cross-mode interaction method, in the interaction process, information of another mode is excessively depended, the final fusion characteristic only depends on the interaction result of the two modes, and sometimes the fused characteristic information is far away from the original information of an image or a text, so that the effective information of a single mode (image or text) is lost, and therefore the information of the mode cannot be represented. In conclusion, it is necessary to design a cross-modal image-text retrieval method based on information interaction among modalities. Disclosure of Invention In order to overcome the defects of the prior art, the invention aims to provide a cross-mode image-text retrieval method based on information interaction among modes. In order to achieve the above object, the present invention provides the following solutions: the invention provides a cross-modal image-text retrieval method based on information interaction among modalities, which comprises the following steps: step 1, acquiring text data and image data; step 2, extracting characteristics of the text data and the image data respectively; step 3, carrying out local feature context enhancement processing based on a self-attention mechanism on the extracted features; Step 4, carrying out key local feature sparse screening treatment based on global features on the enhanced text features and the enhanced image features; Step 5, performing cross-modal attention interaction processing on the screened key features; Step 6, carrying out feature fusion on the key features and the features subjected to cross-modal attention interaction treatment; Step 7, feature aggregation and standardization are carried out on the fused features; And 8, performing contrast learning training and retrieval matching based on the standardized features. Preferably, in step 2, feature extraction is performed on the text data and the image data, which specifically includes: Obtaining text data, after word segmentation and word stopping of the text data, extracting feature vectors of