CN-121502053-B - Non-supervision cross-modal retrieval method based on memory retention hash

CN121502053BCN 121502053 BCN121502053 BCN 121502053BCN-121502053-B

Abstract

The invention provides an unsupervised cross-modal retrieval method based on memory retention hash, which relates to the technical field of unsupervised cross-modal retrieval. The invention provides stable training supervision by utilizing the historic memory similarity matrix construction module, and effectively improves the discrimination capability and the retrieval robustness of the hash codes. The method does not need semantic tags, is suitable for the unlabeled multimedia data retrieval in the real scene, has a light model structure and high retrieval efficiency, and has stronger practical application capability and expansibility.

Inventors

LI DAI
GE XIAOYU
FU HAIYAN
GUO YANQING

Assignees

大连理工大学

Dates

Publication Date: 20260512
Application Date: 20260113

Claims (9)

1. An unsupervised cross-modal retrieval method based on memory retention hash is characterized by comprising the following steps: Step 1, feature extraction and Hash learning, wherein a pre-training encoder is adopted to respectively extract original semantic features of an image mode And original semantic features of text modalities Original semantic features of image modes through residual multi-layer perceptron And original semantic features of text modalities Projecting to a modal sharing information space to obtain sharing semantic features of an image mode Shared semantic features with text modalities By means of belt Single-layer linearity Ha Xiceng of activation function, sharing semantic features of the image modality Shared semantic features with text modalities Compressed to correspond to Uygur loose real value hash feature Using a sign function Generating corresponding binary hash codes ; Step 2, constructing a feature fusion and a similarity matrix, and sharing semantic features of the image modes through a cross-mode sharing feature fusion module Shared semantic features with text modalities Performing self-adaptive fusion to obtain uniform cross-modal characteristics Establishing a history memory base based on a momentum update mechanism and storing the cross-modal characteristics History information of (a) Combining cross-modal features And the history information Construction of cross-modal fusion semantic similarity matrix ; Step 3, model training and optimizing, based on the real value hash characteristic And cross-modal fusion semantic similarity matrix Constructing fusion semantic similarity preserving loss Pair semantic similarity preserving penalty Intra-and inter-modal similarity maintenance loss Total training loss of (2) The model parameters are optimized by minimizing the total training loss, and the binary hash code is output.
2. The memory-retention-hash-based unsupervised cross-modal retrieval method of claim 1, wherein the pre-training encoder comprises an image encoder and a text encoder, wherein the image encoder is Vision Transformer, and the text encoder is GPT-2.
3. The method for unsupervised cross-modal retrieval based on memory retention hashing of claim 1, wherein the residual multi-layer perceptron comprises The same MLP layer with residual connection.
4. The method for unsupervised cross-modal retrieval based on memory retention hash according to claim 1, wherein in the step 2, the cross-modal sharing feature fusion module performs adaptive fusion on the sharing semantic features to obtain uniform cross-modal features Comprising the following steps: Step 21, sharing semantic features of the image mode Shared semantic features with text modalities Splicing to fuse input features through concat operation The method comprises the following steps: ; Wherein, the Representing feature stitching operations; Step 22, by The layer's Transformer encoder inputs features to the fusion Semantic fusion is carried out to generate unified cross-modal characteristics The formula of the semantic fusion process is as follows: Wherein, the method comprises the steps of, Representing the parameters as Is a transform encoder of (c).
5. The non-supervised cross-modal retrieval method based on memory retention hashing of claim 1, wherein the building of the history memory base based on the momentum update mechanism satisfies: ; Wherein, the , Representing momentum update coefficients; Representing the first in the memory bank Historical features of individual locations; representing the first in the current iteration And (3) adopting a small batch updating strategy to update the memory bank position corresponding to the current small batch sample in each iteration according to the cross-modal characteristics corresponding to the individual image texts.
6. The method for unsupervised cross-modal retrieval based on memory retention hashing as in claim 5, wherein the momentum update coefficient employs a scaled momentum function that grows linearly with the number of iterations, satisfying: Wherein, the method comprises the steps of, Representing the current training iteration times; All represent momentum growth rate adjustment hyper-parameters; representing the upper limit of the momentum coefficient.
7. The method for unsupervised cross-modal retrieval based on memory retention hash of claim 1, wherein the cross-modal feature And the history information Construction of cross-modal fusion semantic similarity matrix The construction process of (1) comprises first calculating the cross-modal characteristics With history information Cosine similarity of (2) Then through linear transformation Mapping the similarity value to a target range to obtain a final cross-modal fusion semantic similarity matrix.
8. The method for unsupervised cross-modal retrieval based on memory retention hashing according to claim 1, wherein in step 3, the fusion semantic similarity keeps losing The calculation formula of (2) is as follows: ; Wherein, the Representing a fusion hash feature; The Frobenius norm of the matrix; representing cosine similarity between vectors.
9. The method for unsupervised cross-modal retrieval based on memory retention hashing according to claim 1, wherein in said step 3, said pair-wise semantic similarity is kept lost The calculation formula of (2) is as follows: ; Wherein, the Representing the trace of the matrix; representing the identity matrix; Representing a pair-wise instance alignment degree adjustment hyper-parameter, the intra-and inter-modal similarity maintenance loss The calculation formula of (2) is as follows: ; Wherein, the , Representing the image and text modalities, respectively.

Description

Non-supervision cross-modal retrieval method based on memory retention hash Technical Field The invention relates to the technical field of non-supervision cross-modal retrieval, in particular to a non-supervision cross-modal retrieval method based on memory retention hash. Background Early cross-modal retrieval method [1] is mainly based on real-valued space mapping, such as typical correlation analysis, cross-modal factor analysis and other methods, and performs similarity calculation by mapping different modal data into a common real-valued space. Although the method can realize the semantic alignment among the modes, the method faces the problem of high storage and calculation complexity when processing large-scale data. To address this challenge, cross-modal hash methods have developed that significantly improve retrieval efficiency by mapping multi-modal data into a compact binary hash space. Cross-modal hash methods can be divided into two categories, supervised and unsupervised. The supervision method such as deep cross-modal hash (DCMH) [2] utilizes semantic tags to guide hash code learning, so that good retrieval performance can be obtained generally, but the supervision method relies on large-scale labeling data, has high labeling cost and limits the practical application range. The unsupervised method avoids the dependence on the label, and has more practical value by mining the cross-modal correlation of the data. A representative method is that a depth map neighborhood consistency maintaining network (DGCPN) [4] further introduces a high-order neighborhood structure to promote semantic modeling capability, for example, a Depth Joint Semantic Reconstruction Hash (DJSRH) [3] learns semantic relations by constructing a joint semantic affinity matrix of a cross-modal instance. However, the existing unsupervised cross-modal hash method still faces two major challenges, namely insufficient interaction of features among modes, misalignment of feature space, and lack of stability, which is caused by noise and fluctuation of training process, because a similarity matrix is generally constructed based on fast-changing transient features in the training process. The advent of momentum contrast learning (MoCo) [5] and other methods in the representation learning field provides a new idea for unsupervised feature learning. The method maintains the history feature memory library through a momentum update mechanism, realizes stable and consistent feature representation, and effectively relieves the problem of inconsistent features in the training process. These advances provide a useful reference for stability supervision signal construction in cross-modal hashing. However, the existing method does not fully combine a momentum memory mechanism with cross-modal hash, and does not systematically solve the problem of collaborative optimization of cross-modal feature fusion and stable similarity supervision. Most of the existing methods still adopt a dual-encoder architecture to independently learn the characteristics of each mode, and lack deep cross-mode semantic interaction. In addition, the similarity matrix construction relies on fusion strategies and instantaneous characteristics of manual design, is difficult to adapt to a dynamic training process, and limits further improvement of discrimination capability and retrieval performance of hash codes. Therefore, an unsupervised hash method capable of realizing cross-modal deep semantic fusion and providing stable similarity supervision is needed to improve accuracy and robustness of cross-modal retrieval. Reference to the literature [1] Andrew G, Arora R, Bilmes J, et al. Deep canonical correlation analysis[C]//Proceedings of the 30th International Conference on Machine Learning. Atlanta, Georgia, USA: PMLR, 2013: 1247-1255. [2] Jiang Q Y, Li W J. Deep cross-modal hashing[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 3232-3240. [3] Su S, Zhong Z, Zhang C. Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 3027-3035. [4] Yu J, Zhou H, Zhan Y, et al. Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing[C]//Proceedings of the AAAI conference on artificial intelligence. 2021, 35(5): 4626-4634. [5] He K, Fan H, Wu Y, et al. Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 9729-9738. Disclosure of Invention According to the traditional cross-modal retrieval method, the different modal data are generally mapped into the real-valued public space for similarity calculation, and although semantic association can be captured, the technical problem of high storage and calculation cost is faced when large-scale data are processed, and the non-supervision cross-moda