CN-121980053-A - Noise robust cross-modal retrieval method and system based on neighborhood enhancement and confidence guidance

CN121980053ACN 121980053 ACN121980053 ACN 121980053ACN-121980053-A

Abstract

The invention belongs to the technical field of multi-modal information retrieval, and provides a noise robust cross-modal retrieval method and system based on neighborhood enhancement and confidence guidance, wherein the method comprises the steps of obtaining original cross-modal characteristics between an image sample and a text sample; according to the obtained original cross-modal characteristics, an original similarity matrix and a homomodal similarity matrix between samples are calculated, neighborhood characteristics of the obtained homomodal similarity matrix are subjected to weighted aggregation based on an attention mechanism to generate neighborhood enhancement characteristics, semantic consistency regularization constraint is carried out on the obtained original cross-modal characteristic matching errors and the neighborhood enhancement characteristic matching errors, a Bayesian Gaussian mixture model and a cross-network confidence fusion mode are adopted, soft label supervision is combined, and noise robust cross-modal retrieval based on neighborhood enhancement and confidence guidance is completed.

Inventors

Jin Fengfei
TIAN CHENGLIN
HUANG CHEN
ZHANG HUAXIANG
LIU DONGMEI
LIU LI

Assignees

山东师范大学

Dates

Publication Date: 20260505
Application Date: 20260126

Claims (10)

1. The noise robust cross-modal retrieval method based on neighborhood enhancement and confidence guidance is characterized by comprising the following steps of: Acquiring original cross-modal characteristics between an image sample and a text sample; according to the obtained original cross-modal characteristics, calculating an original similarity matrix and a homomodal similarity matrix between samples; Carrying out weighted aggregation on the neighborhood characteristics of the obtained homomodal similarity matrix based on an attention mechanism to generate neighborhood enhancement characteristics; Carrying out semantic consistency regularization constraint on the obtained original cross-modal feature matching error and neighborhood enhancement feature matching error; and a Bayesian Gaussian mixture model and a cross-network confidence fusion mode are adopted, and soft label supervision is combined to complete noise robust cross-modal retrieval based on neighborhood enhancement and confidence guidance.
2. The noise robust cross-modal retrieval method based on neighborhood enhancement and confidence guidance as claimed in claim 1, wherein the method adopts a double-branch structure and passes through the encoder respectively And Extracting original feature representations of images and text Projecting the extracted original characteristic representation into a shared embedded space to obtain the similarity between image-text pairs I.e. Wherein, the method comprises the steps of, Parameters representing the similarity function.
3. The noise robust cross-modal retrieval method based on neighborhood enhancement and confidence guidance as claimed in claim 1, wherein in the process of generating neighborhood enhancement features, similarity matrices of anchor points of the same modality and other modality examples are calculated, and k most similar neighbor examples are taken from the obtained similarity matrices To construct neighbor sets within a modality I.e. 。
4. A neighborhood enhancement and confidence guidance based noise robust cross-modality retrieval method as claimed in claim 3, wherein the attention coefficient between the anchor sample and its neighborhood sample is calculated Weighting and summing the neighborhood characteristics by using the obtained attention coefficient, and fusing the neighborhood characteristics with the original characteristics to obtain neighborhood enhancement characteristics of structural perception, namely Wherein, the method comprises the steps of, As a result of the original characteristics of the features, As a result of the fused neighborhood feature, Is a learnable residual scaling parameter.
5. The noise robust cross-modal retrieval method based on neighborhood enhancement and confidence guidance as recited in claim 1, wherein the soft label supervision includes employing real label supervision for clean samples, employing cross-network average confidence generation labels for noise samples and hybrid samples to achieve sample quality adaptive supervision intensity regulation.
6. The noise robust cross-modal retrieval method based on neighborhood enhancement and confidence guidance as claimed in claim 1, wherein in the process of the semantic consistency regularization constraint, the similarity change between the front and rear samples is enhanced by the constraint, and the adopted loss function Is that Wherein, the method comprises the steps of, The soft label of the calculation is represented, Representation of 。
7. A neighborhood enhancement and confidence guidance based noise robust cross-modal retrieval system, comprising: An acquisition module configured to acquire raw cross-modal features between the image sample and the text sample; A computing module configured to compute an original similarity matrix and a homomodal similarity matrix between samples from the obtained original cross-modal features; the aggregation module is configured to conduct weighted aggregation on the neighborhood characteristics of the obtained homomodal similarity matrix based on the attention mechanism, and generate neighborhood enhancement characteristics; The retrieval module is configured to conduct semantic consistency regularization constraint on the obtained original cross-modal feature matching errors and neighborhood enhancement feature matching errors, and complete noise robust cross-modal retrieval based on neighborhood enhancement and confidence guidance by combining soft label supervision in a Bayesian Gaussian mixture model and cross-network confidence fusion mode.
8. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps of a neighborhood enhancement and confidence-directed noise robust cross-modality retrieval method based on any of claims 1-6.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, characterized in that the processor implements the steps of the neighborhood enhancement and confidence-directed based noise robust cross-modality retrieval method of any of claims 1-6 when the program is executed by the processor.
10. A computer program product comprising software code, wherein a program in the software code performs the steps of the neighborhood enhancement and confidence-directed based noise robust cross-modality retrieval method of any of claims 1-6.

Description

Noise robust cross-modal retrieval method and system based on neighborhood enhancement and confidence guidance Technical Field The invention belongs to the technical field of multi-modal information retrieval, and particularly relates to a noise robust cross-modal retrieval method and system based on neighborhood enhancement and confidence guidance. Background The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art. Cross-modal retrieval (CMR) technology is used as a key bridge for connecting data of different modes, and aims to establish a unified semantic expression space among the data of different modes such as images, texts, audios and the like, so that the data of one mode is used as query, and the target content related to the semantics in the other mode is retrieved. Plays an important role in a plurality of application scenes. With the rapid development of internet technology, cross-modal retrieval has become a key method in the fields of electronic commerce, social media and the like. However, the rapid growth of multimodal data exacerbates the core challenges of cross-modality retrieval, how to bridge heterogeneous gaps between different modalities, and quantifying semantic similarity between modalities to achieve efficient alignment has not yet been adequately addressed. The traditional cross-modal retrieval maps images and texts into a unified semantic space mainly by constructing a visual encoder and a text encoder, so that positive samples are closer in the shared space, and negative samples are farther away. The method generally depends on correctly matched image-text data during training, and defaults that the corresponding relation of all training samples is correct. Due to the automatic characteristics of network crawling data and the inconsistency of manual labeling, noise Correspondence (NC) relations are ubiquitous in the large-scale cross-modal data construction process. For these actual large-scale network data, conventional methods often fail to identify false paired samples, nor lack the ability to maintain semantic consistency under unreliable labeling conditions. For noise correspondence problems, unreliable samples are typically identified by modeling the overall loss distribution using the memory effect (DNN) of deep neural networks, but these approaches still have significant limitations. First, the loss modeling-based strategy generally forces the samples into two categories, clean and noisy, and lacks the ability to flexibly process intermediate samples between the two, thus easily misjudging samples with small semantic deviations but still learning value. Second, most methods only focus on the overall loss variation of the sample, and do not utilize the local semantic consistency information contained in the unimodal neighborhood structure, and when noise breaks local feature distribution, the model is difficult to maintain stable semantic alignment from the structural point of view. Finally, some methods ensure training stability by simply rejecting or weakening noise samples, but such a rough approach often results in loss of effective supervisory signals, making the model incapable of maintaining sufficient semantic integrity in a high noise environment. Therefore, it is difficult to combine robustness and effectiveness with existing noise processing methods, and a finer solution is needed. In summary, at least the following problems exist in the cross-modal search of noise robustness at present: (1) The existing noise recognition strategy is too dependent on the loss statistics of sample level, and lacks of auxiliary judgment on local semantic structures, so that a model is difficult to accurately distinguish a clean sample, a noise sample and a fuzzy sample between the clean sample and the noise sample. (2) In the aspect of utilizing noise samples, the existing method is still rough, lacks a dynamic and self-adaptive supervision and regulation mechanism, and is difficult to maintain complete semantic information under the high-noise condition. Disclosure of Invention In order to solve the problems, the invention provides a noise robust cross-modal retrieval method and system based on neighborhood enhancement and confidence guidance, provides a robust method for cross-modal retrieval, improves the detection effect by using the modal complementary characteristics, remarkably reduces the influence of noise labeling on retrieval performance, and has high robustness and good application prospect. According to some embodiments, the first scheme of the invention provides a noise robust cross-modal retrieval method based on neighborhood enhancement and confidence guidance, which adopts the following technical scheme: a noise robust cross-modal retrieval method based on neighborhood enhancement and confidence guidance comprises the following steps: Acquiring original cross-modal characte