CN-121436092-B - Model training and image-text matching method and device, electronic equipment and storage medium

CN121436092BCN 121436092 BCN121436092 BCN 121436092BCN-121436092-B

Abstract

The application discloses a model training and image-text matching method, a device, electronic equipment and a storage medium. The method comprises the steps of obtaining an initial training data set, wherein the initial training data set comprises a plurality of first image-text pairs, each first image-text pair comprises a picture and a first descriptive text corresponding to the content of the picture, generating a target training data set according to the initial training data set, wherein the target training data set comprises a plurality of groups of second image-text pairs, each group of second image-text pairs comprises a forward image-text pair, a negative image-text pair and an approximate image-text pair, and performing fine tuning training on an initial image-text alignment model by adopting the target training data set to obtain a target image-text alignment model. The application solves the technical problem that the model has poor effect on negative semantic understanding because the related technology only enhances the understanding capability of the model on the negative semantic by constructing the negative graph-text pairs.

Inventors

HE ZHONGJIANG
GAO XUEYI
LI XUELONG
LIU JIANG
SUN HAO

Assignees

中电信人工智能科技(北京)有限公司

Dates

Publication Date: 20260512
Application Date: 20251231

Claims (11)

1. A method of model training, comprising: Acquiring an initial training data set, wherein the initial training data set comprises a plurality of first image-text pairs, and each first image-text pair comprises a picture and a first description text corresponding to the content of the picture; Generating a target training data set according to the initial training data set, wherein the target training data set comprises a plurality of groups of second image-text pairs, each group of second image-text pairs comprises a forward image-text pair, a negative image-text pair and an approximate image-text pair, the forward image-text pair comprises a forward description text with positive semantics and pictures corresponding to the description content of the forward description text, the negative image-text pair comprises a negative description text with negative semantics and pictures corresponding to the description content of the negative description text, and the approximate image-text pair comprises approximate description text based on the positive semantics and the negative semantics and pictures corresponding to the description content of the approximate description text; Performing fine tuning training on the initial image-text alignment model by adopting the target training data set to obtain a target image-text alignment model, wherein the target image-text alignment model is used for matching images corresponding to the descriptive text; The method comprises the steps of carrying out fine adjustment training on an initial graph-text alignment model by adopting a target training data set, dividing the target training data set into a plurality of training batches, wherein each training batch comprises a plurality of groups of second graph-text pairs, respectively carrying out analysis on data of the training batches by adopting the initial graph-text alignment model to obtain a model analysis result, determining a negative semantic vector according to the model analysis result, determining a negative graph-text pair which is correctly matched with the model analysis result, acquiring a forward description text in the positive graph-text pair which belongs to the same group as the negative graph-text pair, determining a feature vector according to the forward description text and the negative description text in the negative graph-text pair, and constructing a feature matrix according to the feature vector, wherein the feature vector is used for representing a feature difference value between the forward description text and the negative description text, the feature vector is used for recording the feature vector corresponding to the negative graph-text pairs which are correctly matched in the training batches, carrying out linear transformation on the feature matrix, carrying out the feature matrix is used for obtaining a negative semantic vector, carrying out linear transformation on the feature vector, and carrying out a non-dimensional transformation on the feature vector, and obtaining a negative semantic vector, and carrying out a non-function value-based on the feature vector, thereby obtaining a feature value.
2. The model training method of claim 1, wherein generating a target training data set from the initial training data set comprises: Randomly selecting the first image-text pair from the initial training data set, and analyzing a first description text in the first image-text pair by adopting a text analysis agent to generate candidate text elements, wherein the candidate text elements are text elements which are not mentioned in the first description text but are related to a scene described by the first description text, and the text elements comprise at least one of actions, target objects and attributes; Analyzing the candidate text element and the picture in the first picture-text pair by adopting a multi-mode intelligent agent, and judging whether the picture in the first picture-text pair contains the content described by the candidate text element or not; Determining the candidate text element as a target text element under the condition that the picture in the first picture-text pair does not contain the content described by the candidate text element; And generating the second image-text pair in the target training data set according to the target text element and the first descriptive text.
3. The model training method of claim 2, wherein generating the second pair of graphics in the target training dataset from the target text element and the first descriptive text comprises: Adding a negative word to the target text element by adopting the text analysis agent, and adding the target text element added with the negative word to the first description text to obtain a negative description text; determining the negative descriptive text and the image in the first image-text pair as the negative image-text pair; and replacing an original text element corresponding to the type of the target text element in the first description text with the target text element to obtain a forward description text; And generating a picture corresponding to the content described by the forward description text, and determining the picture and the forward description text as the forward image-text pair.
4. A model training method as claimed in claim 3, further comprising: Determining an approximate text element corresponding to the target text element, wherein the semantic similarity between the approximate text element and the target text element exceeds a preset similarity threshold value, and the text elements are the same in type; replacing an original text element corresponding to the type of the approximate text element in the first description text with the approximate text element to obtain an approximate description text; And generating a picture corresponding to the content described by the approximate description text, and determining the picture and the approximate description text as the approximate graph-text pair.
5. The model training method of claim 1, wherein constructing a feature matrix from the feature vectors comprises: Acquiring the feature vector corresponding to the correctly matched negative image-text pair in the model analysis result of the current training batch; under the condition that the negative graph-text pair is correctly matched for the first time, the feature vector is newly added into the feature matrix; Acquiring a historical feature vector corresponding to an identifier of the negative graph-text pair in the feature matrix under the condition that the negative graph-text pair is correctly matched in a training batch before the current training batch; And carrying out weighted fusion on the feature vector obtained by the current training batch and the historical feature vector stored before in the feature matrix to obtain a new feature vector, and replacing the historical feature vector stored before in the feature matrix with the new feature vector, wherein when the weighted fusion is carried out, the weight value corresponding to the feature vector obtained by the current training batch is larger than the weight value corresponding to the historical feature vector stored before in the feature matrix.
6. The model training method of claim 1, wherein determining a loss function value based on the negative semantic vector comprises: determining a first loss value according to the matching condition of the image-text pair represented by the model analysis result and the corresponding condition of the image-text pair in the target training data set, wherein the first loss value is used for representing the accuracy degree of image-text alignment of a model; Determining a negative image-text pair which is correctly matched in the model analysis result, and determining a feature vector corresponding to the negative image-text pair, wherein the feature vector is used for representing a feature difference value between a forward description text and the negative description text in the forward image-text pair which belong to the same group as the negative image-text pair; Determining a second loss value according to the feature vector and the negative semantic vector, wherein the second loss value is used for representing the similarity degree between the feature vector and the negative semantic vector; And determining the loss function value according to the first loss value and the second loss value.
7. The image-text matching method is characterized by comprising the following steps of: Acquiring a description text to be matched; Analyzing the description text to be matched by adopting a target image-text alignment model, and determining a target image corresponding to the description text to be matched, wherein the target image-text alignment model is obtained by carrying out fine adjustment training on an initial image-text alignment model according to a target training data set, the target training data set comprises a plurality of groups of second image-text pairs, each group of second image-text pairs comprises a forward image-text pair, a negative image-text pair and an approximate image-text pair, the forward image-text pair comprises a forward description text with positive semantics and a picture corresponding to the description content of the forward description text, the negative image-text pair comprises a negative description text with negative semantics and a picture corresponding to the description content of the negative description text, and the approximate image-text pair comprises an approximate description text based on the positive semantics and the negative semantics and a picture corresponding to the description content of the approximate description text; The method comprises the steps of carrying out fine adjustment training on an initial graph-text alignment model by adopting a target training data set, dividing the target training data set into a plurality of training batches, wherein each training batch comprises a plurality of groups of second graph-text pairs, respectively carrying out analysis on data of the training batches by adopting the initial graph-text alignment model to obtain a model analysis result, determining a negative semantic vector according to the model analysis result, determining a negative graph-text pair which is correctly matched with the model analysis result, acquiring a forward description text in the positive graph-text pair which belongs to the same group as the negative graph-text pair, determining a feature vector according to the forward description text and the negative description text in the negative graph-text pair, and constructing a feature matrix according to the feature vector, wherein the feature vector is used for representing a feature difference value between the forward description text and the negative description text, the feature vector is used for recording the feature vector corresponding to the negative graph-text pairs which are correctly matched in the training batches, carrying out linear transformation on the feature matrix, carrying out the feature matrix is used for obtaining a negative semantic vector, carrying out linear transformation on the feature vector, and carrying out a non-dimensional transformation on the feature vector, and obtaining a negative semantic vector, and carrying out a non-function value-based on the feature vector, thereby obtaining a feature value.
8. A model training device, comprising: The training set acquisition module is used for acquiring an initial training data set, wherein the initial training data set comprises a plurality of first image-text pairs, and each first image-text pair comprises a picture and a first description text corresponding to the content of the picture; The training set optimization module is used for generating a target training data set according to the initial training data set, wherein the target training data set comprises a plurality of groups of second image-text pairs, each group of second image-text pairs comprises a forward image-text pair, a negative image-text pair and an approximate image-text pair, the forward image-text pair comprises a forward description text with positive semantics and a picture corresponding to the description content of the forward description text, the negative image-text pair comprises a negative description text with negative semantics and a picture corresponding to the description content of the negative description text, and the approximate image-text pair comprises an approximate description text based on the positive semantics and the negative semantics and a picture corresponding to the description content of the approximate description text; The fine-tuning training module is used for carrying out fine-tuning training on the initial image-text alignment model by adopting the target training data set to obtain a target image-text alignment model, wherein the target image-text alignment model is used for matching images corresponding to the descriptive text; The method comprises the steps of carrying out fine adjustment training on an initial graph-text alignment model by adopting a target training data set, dividing the target training data set into a plurality of training batches, wherein each training batch comprises a plurality of groups of second graph-text pairs, respectively carrying out analysis on data of the training batches by adopting the initial graph-text alignment model to obtain a model analysis result, determining a negative semantic vector according to the model analysis result, determining a negative graph-text pair which is correctly matched with the model analysis result, acquiring a forward description text in the positive graph-text pair which belongs to the same group as the negative graph-text pair, determining a feature vector according to the forward description text and the negative description text in the negative graph-text pair, and constructing a feature matrix according to the feature vector, wherein the feature vector is used for representing a feature difference value between the forward description text and the negative description text, the feature vector is used for recording the feature vector corresponding to the negative graph-text pairs which are correctly matched in the training batches, carrying out linear transformation on the feature matrix, carrying out the feature matrix is used for obtaining a negative semantic vector, carrying out linear transformation on the feature vector, and carrying out a non-dimensional transformation on the feature vector, and obtaining a negative semantic vector, and carrying out a non-function value-based on the feature vector, thereby obtaining a feature value.
9. An electronic device comprising a memory and a processor for executing a program stored in the memory, wherein the program is executed to perform the model training method of any one of claims 1 to 6 or the pattern matching method of claim 7.
10. A non-volatile storage medium, characterized in that the non-volatile storage medium comprises a stored computer program, wherein the device in which the non-volatile storage medium is located performs the model training method according to any one of claims 1 to 6 or the graph matching method according to claim 7 by running the computer program.
11. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the model training method of any one of claims 1 to 6 or the graph matching method of claim 7.

Description

Model training and image-text matching method and device, electronic equipment and storage medium Technical Field The application relates to the technical field of multi-mode large models, in particular to a model training and image-text matching method, a device, electronic equipment and a storage medium. Background The CLIP (Contrastive Language-IMAGE PRETRAINING, language-image contrast pre-training model) is a multi-mode model with two mode information of aligned images and texts, has strong capability in the aspect of image-text alignment, and is a basic model for constructing multi-mode tasks such as image-text multi-mode retrieval, visual question-answering and the like. While CLIP has a superior performance on public data sets, understanding negative semantics is one of its significant drawbacks. The related technology only enhances the understanding capability of the model to the negative semantics by constructing the negative graph-text pairs, so that the technical problem that the model has poor effect on the understanding of the negative semantics is caused. In view of the above problems, no effective solution has been proposed at present. Disclosure of Invention The embodiment of the application provides a model training and image-text matching method, device, electronic equipment and storage medium, which at least solve the technical problem that the model has poor effect in negative semantic understanding because the understanding capability of the model on the negative semantic is enhanced by constructing a negative image-text pair in the related technology. According to one aspect of the embodiment of the application, a model training method is provided, which comprises the steps of obtaining an initial training data set, wherein the initial training data set comprises a plurality of first image-text pairs, each first image-text pair comprises a picture and a first descriptive text corresponding to the content of the picture, generating a target training data set according to the initial training data set, wherein the target training data set comprises a plurality of groups of second image-text pairs, each group of second image-text pairs comprises a forward image-text pair, a negative image-text pair and an approximate image-text pair, each forward image-text pair comprises a forward descriptive text with positive semantics and a picture corresponding to the descriptive content of the forward descriptive text, each negative image-text pair comprises a negative descriptive text with negative semantics and a picture corresponding to the descriptive content of the negative descriptive text, each approximate image-text pair comprises an approximate descriptive text based on the between the positive semantics and the negative semantics and a picture corresponding to the descriptive content of the approximate descriptive text, and training the initial image-text alignment model by adopting the target training data set to obtain a target image-text alignment model, and the target image-text alignment model is used for matching the images corresponding to the descriptive text. Optionally, generating the target training data set according to the initial training data set comprises randomly selecting a first image-text pair from the initial training data set, analyzing a first description text in the first image-text pair by a text analysis agent to generate a candidate text element, wherein the candidate text element is a text element which is not mentioned in the first description text but is related to a scene described by the first description text, the text element comprises at least one of an action, a target object and an attribute, analyzing the candidate text element and a picture in the first image-text pair by a multi-mode agent to judge whether the content described by the candidate text element is contained in the picture in the first image-text pair, determining the candidate text element as the target text element under the condition that the content described by the candidate text element is not contained in the picture in the first image-text pair, and generating a second image-text pair in the target training data set according to the target text element and the first description text. Optionally, generating the second image-text pair in the target training data set according to the target text element and the first description text comprises adopting a text analysis intelligent agent to add a negative word to the target text element, adding the target text element after adding the negative word to the first description text to obtain the negative description text, determining images in the negative description text and the first image-text pair as the negative image-text pair, adopting the target text element to replace the original text element corresponding to the type of the target text element in the first description text to obtain the forward description text, generating a p