CN-121980052-A - Noise reconstruction text retrieval method based on multi-mode large language model

CN121980052ACN 121980052 ACN121980052 ACN 121980052ACN-121980052-A

Abstract

The invention relates to the technical field of cross-modal retrieval, in particular to a noise reconstruction text retrieval method based on a multi-modal large language model, which maps a training data set to a shared representation space, calculates a similarity screening noisy sample set, reconstructs noisy image semantics by using the multi-modal large language model, replaces an original sample to generate a refined data set, screens a high-confidence sample by using a Gaussian mixture model, the method comprises the steps of constructing a clean sample set and a knowledge base, constructing a pseudo classifier based on the knowledge base, converting a matching task into a classification task, calculating pseudo classification loss and entropy regularization items, and finally constructing a complete optimization objective function for iterative training of the model.

Inventors

LI MINGYONG
WANG YUKAI
CUI SHAOGUO
Yan Dengwei
Ma Wantian

Assignees

重庆师范大学

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (9)

1. A noise reconstruction text retrieval method based on a multi-mode large language model is characterized by comprising the following steps: S1, acquiring a training data set containing image text pairs, extracting feature vectors of the training data set by using an image encoder and a text encoder, and mapping the feature vectors to a shared representation space; S2, calculating cosine similarity of image text pairs in the shared representation space, judging that the image text pairs with similarity smaller than a first preset threshold value are noise-containing samples, and constructing a noise-containing sample set; s3, performing semantic reconstruction on images in the noisy sample set by using a multi-mode large language model to generate a reconstructed text description, and replacing an original noisy sample with the reconstructed sample to generate a refined data set; S4, modeling sample loss distribution of the training data set by adopting a Gaussian mixture model, and screening samples with posterior probability larger than a second preset threshold value to construct a clean sample set and a knowledge base; s5, constructing a pseudo classifier based on the knowledge base, converting an image-text matching task into a classification task, calculating a pseudo prediction result and a pseudo label for the clean sample set, and calculating a pseudo classification loss function and an entropy regularization term; S6, constructing a complete optimization objective function comprising symmetrical cross entropy loss, the pseudo classification loss function and the entropy regularization term based on the refined data set, and performing iterative training on the model by minimizing the complete optimization objective function.
2. The method for retrieving a noise reconstructed text based on a multi-modal large language model according to claim 1, wherein in step S1, the step of extracting feature vectors of the training data set using an image encoder and a text encoder comprises: Deploying a vision-language pre-training architecture, and respectively instantiating the image encoder and the text encoder by adopting a CLIP model as a trunk model for feature extraction; performing feature mapping transformation, mapping the image into image feature vectors by the image encoder, and mapping the text into text feature vectors by the text encoder; and constructing a shared semantic space, and performing alignment constraint on the image feature vector and the text feature vector by using a contrast learning mechanism to enable the image feature vector and the text feature vector to establish cross-modal unified representation in the same shared representation space.
3. The method for retrieving a noise reconstructed text based on a multi-modal large language model according to claim 1, wherein in step S2, the step of determining that the pair of image text having the similarity smaller than the first preset threshold is a noisy sample includes: Generating a cosine similarity matrix, performing dot product operation on the image-text sample pairs in the shared representation space and combining with normalized module length constraint to obtain a matrix for representing semantic association degree; Scaling and amplifying the cosine similarity by using temperature parameters in a pre-training model to separate matching signals and non-matching signals distributed in different numerical ranges; and executing noise threshold screening, setting the first preset threshold, traversing diagonal elements of the cosine similarity matrix, identifying sample pairs with similarity scores lower than the first preset threshold as noise corresponding relations, and constructing the noise-containing sample set.
4. The method for retrieving a reconstructed text based on a multi-modal large language model according to claim 1, wherein in step S3, the step of performing semantic reconstruction on the image in the noisy sample set by using the multi-modal large language model to generate a reconstructed text description includes: configuring a visual semantic reconstructor, deploying the multi-modal large language model to establish a generation channel from a damaged visual signal to high-quality text description; Designing a structured prompt, constructing a diversified prompt template comprising objects, actions and scene visual angles, and guiding the multi-modal large language model to mine deep semantic information in an image; And executing self-adaptive prompt selection, introducing a self-adaptive prompt selection module, dynamically adjusting a prompt policy according to the number of noise texts associated with the image, and generating the reconstructed text description consistent with the content of the image.
5. The method for retrieving a noise reconstructed text based on a multi-modal large language model according to claim 1, wherein in step S3, the step of replacing the original noisy samples with reconstructed samples to generate a refined data set comprises: Constructing a reconstructed sample batch, and pairing and combining images in the noisy sample set with the reconstructed text description to construct a reconstructed noise data set; Performing sample set replacement operation, positioning in an original training data set and physically replacing an original noise corresponding relation by using samples in the reconstructed noise data set; Synthesizing refined training data, performing union fusion on the replaced reconstructed data and non-noise data in the original data set, and generating the refined data set containing effective supervision signals.
6. The method for retrieving a noise reconstructed text based on a multi-modal large language model according to claim 1, wherein in step S4, the step of modeling a sample loss distribution of the training data set by using a gaussian mixture model and screening samples with a posterior probability greater than a second preset threshold value comprises: extracting single sample loss characteristics, counting single sample loss values of the training data set, and identifying a data distribution area which is preferentially fitted by the deep neural network; Building a bi-component probability model, fitting the sample loss distribution by using the bi-component Gaussian mixture model, and constructing a probability density function representing noise components and clean components; Calculating the posterior probability of a clean sample, and calculating the posterior probability of each sample belonging to a Gaussian component with a lower mean value based on the Bayes theorem; and (3) implementing strict threshold filtering, setting the second preset threshold, and screening out samples with the posterior probability larger than or equal to the second preset threshold to construct the clean sample set.
7. The method for retrieving a noise reconstructed text based on a multi-modal large language model according to claim 1, wherein in step S4, the step of constructing a clean sample set and a knowledge base includes: Initializing a retrieval mechanism, establishing a knowledge base container for storing reference information, and configuring a retrieval strategy for the clean sample set; Performing bidirectional search matching, traversing training sample pairs, and respectively searching the image-text pair with the highest cosine similarity with the current image and the image-text pair with the highest cosine similarity with the current text in the clean sample set; and generating an evaluation item set, and storing the retrieved high-confidence image-text pairs as evaluation items into the knowledge base to form a knowledge base batch containing image reference information and text reference information.
8. The method for retrieving a noise reconstructed text based on a multi-modal large language model according to claim 1, wherein in step S5, the step of constructing a pseudo classifier based on the knowledge base, converting a text matching task into a classification task, and calculating a pseudo prediction result and a pseudo tag for the clean sample set comprises: defining a classification semantic tag, reconstructing an image-text alignment task into a K-class classification problem, and constructing a semantic class space by using clean text description in the knowledge base; Generating soft probability distribution, and processing the clean sample set by using the pseudo classifier to respectively output an image pseudo-prediction result and a text pseudo-prediction result in a probability distribution form; and extracting a hard supervision signal, and executing a maximum parameter solving operation on the text pseudo-predicted result to determine a hard pseudo tag for prompting the model to output a deterministic predicted result.
9. The method for retrieving a noise reconstructed text based on a multi-modal large language model according to claim 1, wherein in step S6, the constructing includes a step of optimizing a target function based on the refined data set symmetrical cross entropy loss, the pseudo classification loss function and the entropy regularization term, including: calculating a symmetric cross entropy penalty, calculating an image-to-text and text-to-image bi-directional matching penalty based on the refined dataset to enhance feature alignment; Calculating a pseudo-classification loss, calculating a standard cross entropy loss based on the pseudo tag and an image pseudo-prediction result in the pseudo-prediction results, and constructing the pseudo-classification loss function; Calculating an entropy regularization term, and calculating expectations on logarithmic probability distribution of the image pseudo-prediction result to construct the entropy regularization term so as to prevent the classifier from collapsing; And synthesizing a complete optimization target, and constructing the complete optimization target function by linearly weighting and combining the symmetrical cross entropy loss, the pseudo classification loss function and the entropy regularization term.

Description

Noise reconstruction text retrieval method based on multi-mode large language model Technical Field The invention relates to the technical field of cross-modal retrieval, in particular to a noise reconstruction text retrieval method based on a multi-modal large language model. Background Cross-modal matching aims at establishing semantic alignment between heterogeneous modalities such as images and texts so as to realize collaborative understanding and unified representation of multi-modal information. In recent years, with the breakthrough of the multi-mode large language model MLLM and the diffusion generation technology, the performance of downstream tasks such as image generation, cross-mode retrieval, zero sample learning and the like is greatly improved by accurately aligning visual and language features in a shared semantic space. In order to meet the requirement of large-scale model training, the data construction mode has been comprehensively shifted from manual annotation with high cost to a large-scale open-source data set based on network crawling. However, large-scale network data inevitably introduces noise correspondence problems of teletext semantic mismatch. The existing noise correction or re-weighting method is often ineffective in a high noise environment, and the noise and clean samples are difficult to accurately distinguish, so that the performance fluctuation of the model is large when the noise proportion is increased, and the robustness and generalization capability are obviously reduced. Disclosure of Invention In order to make up for the defects, the invention provides a noise reconstruction text retrieval method based on a multi-mode large language model, and aims to solve the problem that in the prior art, noise and clean samples are difficult to accurately distinguish under a high-noise environment. The invention provides a noise reconstruction text retrieval method based on a multi-mode large language model, which comprises the following steps: S1, acquiring a training data set containing image text pairs, extracting feature vectors of the training data set by using an image encoder and a text encoder, and mapping the feature vectors to a shared representation space; S2, calculating cosine similarity of image text pairs in the shared representation space, judging that the image text pairs with similarity smaller than a first preset threshold value are noise-containing samples, and constructing a noise-containing sample set; s3, performing semantic reconstruction on images in the noisy sample set by using a multi-mode large language model to generate a reconstructed text description, and replacing an original noisy sample with the reconstructed sample to generate a refined data set; S4, modeling sample loss distribution of the training data set by adopting a Gaussian mixture model, and screening samples with posterior probability larger than a second preset threshold value to construct a clean sample set and a knowledge base; s5, constructing a pseudo classifier based on the knowledge base, converting an image-text matching task into a classification task, calculating a pseudo prediction result and a pseudo label for the clean sample set, and calculating a pseudo classification loss function and an entropy regularization term; S6, constructing a complete optimization objective function comprising symmetrical cross entropy loss, the pseudo classification loss function and the entropy regularization term based on the refined data set, and performing iterative training on the model by minimizing the complete optimization objective function. Preferably, in step S1, the step of extracting the feature vector of the training dataset by using an image encoder and a text encoder includes: Deploying a vision-language pre-training architecture, and respectively instantiating the image encoder and the text encoder by adopting a CLIP model as a trunk model for feature extraction; performing feature mapping transformation, mapping the image into image feature vectors by the image encoder, and mapping the text into text feature vectors by the text encoder; and constructing a shared semantic space, and performing alignment constraint on the image feature vector and the text feature vector by using a contrast learning mechanism to enable the image feature vector and the text feature vector to establish cross-modal unified representation in the same shared representation space. Preferably, in step S2, the step of determining that the pair of image text having the similarity smaller than the first preset threshold is a noisy sample includes: Generating a cosine similarity matrix, performing dot product operation on the image-text sample pairs in the shared representation space and combining with normalized module length constraint to obtain a matrix for representing semantic association degree; Scaling and amplifying the cosine similarity by using temperature parameters in a pre-training model t