CN-121980013-A - Image-text retrieval method based on prompt learning and related equipment

CN121980013ACN 121980013 ACN121980013 ACN 121980013ACN-121980013-A

Abstract

The application relates to the technical field of image-text retrieval, and provides an image-text retrieval method based on prompt learning and related equipment. The method comprises the steps of generating prompt vectors corresponding to all descriptive texts and image feature vectors of the descriptive texts, generating pixel-level visual prompts of target images corresponding to the descriptive texts based on the image feature vectors corresponding to the descriptive texts, constructing a first loss function based on the prompt vectors, acquiring final prompt vectors by using the first loss function, generating second loss functions based on all pixel-level visual prompts Fu Goujian and utilizing the second loss functions to final pixel-level visual prompts, generating visual prompts to be searched for of the images to be searched for when the images to be searched for are received, searching for the search texts based on the visual prompts to be searched for and the final prompt vectors, and searching for the search images based on the prompt vectors and all final image feature vectors when the texts to be searched for are received. The method of the application can improve the accuracy of image-text retrieval.

Inventors

CAO WENZHI
Sheng Risen
YANG JUNFENG
LIU LIMEI
YU HAIHANG
LI ZIHAO

Assignees

湖南工商大学

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (10)

1. The image-text retrieval method based on prompt learning is characterized by comprising the following steps of: generating prompt vectors corresponding to all description texts in a search database, and generating image feature vectors corresponding to each description text by using a diffusion model, wherein the prompt vectors are used for expressing semantic information of all the description texts, and the search database stores a plurality of images and the description texts corresponding to each image; generating pixel-level visual prompts of images corresponding to the descriptive texts based on the image feature vectors corresponding to the descriptive texts respectively aiming at each descriptive text, wherein the pixel-level visual prompts are used for describing image information of the images; constructing a first loss function based on the prompt vector, training the prompt vector by using the first loss function to obtain a final prompt vector, training the visual prompt symbols of all pixel levels by using the second loss function based on the visual prompt Fu Goujian of all pixel levels to obtain the visual prompt symbols of all pixel levels of each image; When an image to be searched is received, generating a visual prompt to be searched of the image to be searched, and searching a search text similar to the image to be searched from all description texts of a search database based on the visual prompt to be searched and the final prompt vector; When receiving the text to be searched, searching out search images similar to the text to be searched from all images in a search database based on the prompt vector and all final pixel-level visual prompts.
2. The method for retrieving graphics according to claim 1, wherein generating a pixel-level visual cue of an image corresponding to the descriptive text based on the image feature vector corresponding to the descriptive text includes: by the formula: ; Acquisition of the first Pixel-level visual cues for individual images ; Wherein, the Represent the first The number of feature vectors of the image, The representation of the decoder is given by way of example, The parameters of the decoder are indicated as such, , The number of images is represented and, A prompt mask is indicated and is presented, Represent the first The pixel level cues of the individual images.
3. The method of claim 1, wherein constructing a first loss function based on the hint vector comprises: Acquiring a plurality of search category codes; splicing each search category code with the prompt vector to obtain a category prompt vector corresponding to each search category code; a first penalty function is constructed based on all category hint vectors.
4. A method of retrieving a graphic as claimed in claim 3, wherein the first loss function is: ; ; Wherein, the A value representing a first loss function is represented, Representing the number of images used for training, The calculation of the degree of similarity is indicated, Represent the first The original scores of the individual images on their true categories, The fusion scoring raw score is represented and, The temperature of the training stability is indicated, Indicating the number of search category codes, Representing the will be The class hint vector of each image is input to the image vector output in the text encoder, Representing the text-side hint template input the category text vector obtained by the text encoder, And Representing the semantic priors of the image domain, 、 Is the parameter of the ultrasonic wave to be used as the ultrasonic wave, The global temperature is indicated and the temperature of the material is, Representing the uncertainty adaptive temperature.
5. The teletext retrieval method according to claim 2, wherein the second loss function is: ; Wherein, the A value representing a second loss function is indicated, 、、、 Is the preset weight-up parameter of the device, Indicating that the primary loss is to be retrieved, Indicating a loss of consistency of attention, Representing the mask sparseness and compactness constraints, Indicating a loss of gating, Representing feature level cues regularization: ; ; ; ; ; Wherein, the Representing the number of images used for training, Indicating the number of search category codes, Represent the first The original scores of the individual images on their true categories, Represent the first Image pair number The logit scores of the candidate classes, The temperature of the training stability is indicated, Represents a cross-modal attention-seeking graph, Representing the soft mask resulting from the attention-seeking, The normalization operator is represented as a function of the normalization, The visual characteristic cue vector is represented as such, Representing a final mask generated based on the pixel-level visual cue, Representing the identity matrix of the cell, The weight is represented by a weight that, Indicating Kullback-Leibler divergence, Representing the weight of the gating target, Predictive weights representing gating network output: ; Wherein, the And Representing the two-way weights of a pixel prompt branch and a characteristic prompt branch output by the gating network, The text semantic vector is represented and, Representing the feature vectors of the original image, Is a lightweight multi-layer perceptron.
6. The method according to claim 1, wherein the retrieving, based on the visual cue to be retrieved and the final cue vector, a retrieval text similar to the image to be retrieved from all description texts in a retrieval database includes: Respectively splicing the final prompt vector and the description text aiming at each description text to obtain a final description text, and encoding the final description text to obtain a final text semantic vector; the visual prompt sign to be searched is overlapped to the image to be searched to obtain a prompt image to be searched, and feature extraction is carried out on the prompt image to be searched to obtain the feature of the image to be searched; calculating the similarity between the image features to be searched and each final text semantic vector; And taking the similarity with the largest value as the final text similarity, and taking the description text corresponding to the final text similarity as the retrieval text.
7. The method according to claim 1, wherein said retrieving a retrieved image similar to the text to be retrieved from all images of a retrieval database based on the hint vector and all final pixel level visual hints comprises: respectively aiming at each image of a search database, overlapping a final pixel-level visual prompt sign corresponding to the image to obtain a prompt image of the image, and extracting features of the prompt image to obtain a final image feature vector of the image; splicing the prompt vector and the text to be searched to obtain a spliced text to be searched, and encoding the spliced text to be searched to obtain a text vector to be searched; calculating the similarity between the text vector to be retrieved and each final image feature vector; And taking the similarity with the largest value as the final image similarity, and taking the image corresponding to the final image similarity as the retrieval image.
8. An image-text retrieval device based on prompt learning is characterized by comprising: The system comprises a first generation module, a second generation module and a display module, wherein the first generation module generates prompt vectors corresponding to all description texts in a search database, and generates image feature vectors corresponding to each description text by using a diffusion model; The second generation module is used for generating pixel-level visual prompts of images corresponding to the descriptive texts based on the image feature vectors corresponding to the descriptive texts respectively for each descriptive text, wherein the pixel-level visual prompts are used for describing image information of the images; The construction module is used for constructing a first loss function based on the prompt vector, training the prompt vector by utilizing the first loss function to obtain a final prompt vector, training the visual prompts of all pixel levels by utilizing the second loss function to obtain the visual prompts of all pixel levels, wherein the first loss function is used for describing the accuracy of the prompt vector, and the second loss function is used for describing the accuracy of the visual prompts of all pixel levels; the first retrieval module is used for generating a visual prompt to be retrieved of the image to be retrieved when receiving the image to be retrieved, and retrieving retrieval texts similar to the image to be retrieved from all description texts of a retrieval database based on the visual prompt to be retrieved and the final prompt vector; And the second retrieval module is used for retrieving retrieval images similar to the text to be retrieved from all images in a retrieval database based on the prompt vector and all final pixel-level visual prompts when the text to be retrieved is received.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the prompt learning based teletext retrieval method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the prompt learning-based teletext retrieval method according to any one of claims 1 to 7.

Description

Image-text retrieval method based on prompt learning and related equipment Technical Field The application relates to the technical field of image-text retrieval, in particular to an image-text retrieval method based on prompt learning and related equipment. Background In recent years, the field of deep learning is vigorously developed, and computer vision and natural language processing are most rapidly developed. The visual language pre-training model and the pre-training technology connect the two fields of computer vision and natural language processing for training together, so that the visual mode and the text mode are projected into a unified representation space, and the visual mode and the text mode are aligned, thereby realizing the visual and text multi-mode downstream tasks. Currently, the most popular visual language pre-training model is the teletext contrast pre-training model (CLIP, contrastive Language-IMAGE PRETRAINING), which aims at aligning the image feature space with the text feature space, thus giving the model a cross-modal generalization capability. The CLIP model has two mutually independent encoders, a text encoder and an image encoder. In the training process, text data and image data are respectively sent to encoders and are respectively encoded into corresponding vector representation sets, and the vector representation of each encoder is mapped to a shared embedded space through linear projection. The text image rank matrix is constructed, the N pairs of matched image-text feature cosine similarity in the data set is maximized through combined training of the image encoder and the text encoder, and meanwhile the cosine similarity of N2-N error pairs is minimized, so that the purpose of contrast learning is achieved, the model can learn and capture the similarity of images and texts, and the alignment and understanding of vision and language are realized. With the increasing amount of vision language pre-training models, the hardware, data and cost of fine tuning them is also increasing. In addition, the rich diversity of downstream tasks also complicates and complicates the design of the pre-training-fine tuning stage, suggesting learning to address these challenges. The pre-training-hint paradigm converts downstream tasks into the same form as pre-training tasks of the pre-training model and uses a template to provide cues and hints to the pre-training model. The method has the core idea that the knowledge and the representation capability which are learned by the pre-training model are utilized, and the model is guided to learn and infer the downstream tasks through reasonable prompt information. The method is limited by a single text prompt template designed manually, and for complex image-text data, the manually designed prompt template can not fully capture all information and semantic association, so that the accuracy of image-text retrieval tasks is relatively low. The method has the main defects that in a multi-mode scene, the mode information such as texts and images is associated with each other, the prompt is added only for the text input, or the text side prompt is optimized through the image input, and the full mining of the pre-training model knowledge is limited by the single-mode data learning mode, so that the accuracy of image-text retrieval is influenced. Therefore, the existing image-text searching method has the problem of low image-text searching accuracy. Disclosure of Invention The application provides a picture and text retrieval method based on prompt learning and related equipment, which can solve the problem of low picture and text retrieval accuracy. In a first aspect, the present application provides an image-text searching method based on prompt learning, where the image-text searching method includes: Generating prompt vectors corresponding to all the description texts in the search database, and generating image feature vectors corresponding to each description text by using a diffusion model, wherein the prompt vectors are used for expressing semantic information of all the description texts, and the search database stores a plurality of images and the description texts corresponding to each image; Generating pixel-level visual prompts of images corresponding to the descriptive texts based on the image feature vectors corresponding to the descriptive texts respectively for each descriptive text; constructing a first loss function based on the prompt vector, training the prompt vector by using the first loss function to obtain a final prompt vector, training the visual prompt symbols of all pixel levels by using the second loss function based on the visual prompt Fu Goujian of all pixel levels to obtain the visual prompt symbols of all pixel levels of each image; When an image to be searched is received, a visual prompt to be searched of the image to be searched is generated, and search texts similar to the image to be searched are searched fr