CN-115526259-B - Training method and device for multi-mode pre-training model

CN115526259BCN 115526259 BCN115526259 BCN 115526259BCN-115526259-B

Abstract

The invention provides a training method and device for a multi-mode pre-training model, which are used for constructing the multi-mode pre-training model comprising a multi-mode image-text information processing network, constructing a weakly aligned image-text data set, wherein the weakly aligned image-text data set comprises a text data set, an image-tag data set and an image-reference description data set, and training the multi-mode pre-training model by using the weakly aligned image-text data set. The multi-mode image-text information processing network can directly process multi-mode image-text information, does not need an external model to assist in extracting image features, and has strong reasoning capability. Meanwhile, the multi-mode pre-training model is trained by adopting the weakly aligned image-text data set, so that the dependence on manually marked image-text aligned data is reduced, and the problem of high data overhead in the process of training the multi-mode pre-training model by using the aligned large-scale image-text data set is avoided.

Inventors

LIU YANG
CHEN CHI
LI PENG
SUN MAOSONG

Assignees

清华大学

Dates

Publication Date: 20260508
Application Date: 20220929

Claims (9)

1. A method of training a multimodal pre-training model, the method comprising: Constructing a multi-mode pre-training model containing a multi-mode image-text information processing network; Constructing a weakly aligned image-text dataset, wherein the weakly aligned image-text dataset comprises a text dataset, an image-tag dataset, and an image-reference description dataset; Training the multimodal pre-training model using the weakly aligned image-text dataset; The sample in the image-tag data set consists of an image and a tag word text sequence corresponding to the image, wherein the tag word text sequence corresponding to the image is formed by splicing tag words of all entities in the image; the image-reference description data set is composed of an image and a corresponding reference description thereof; The training the multimodal pre-training model with the weakly aligned image-text dataset includes: Performing first preprocessing on each sample in the image-tag data set to obtain a first data set; performing a second preprocessing on each sample in the image-reference description dataset to obtain a second dataset; performing third preprocessing on each sample in the text data set to obtain a third data set; generating a multi-modal characterization vector corresponding to each sample in the first data set, a multi-modal characterization vector corresponding to each sample in the second data set and a multi-modal characterization vector corresponding to each sample in the third data set by utilizing a multi-modal image-text information processing network; Performing a covered tag word prediction task on the multi-modal pre-training model by using the multi-modal characterization vector corresponding to each sample in the first data set, the multi-modal characterization vector corresponding to each sample in the second data set and the multi-modal characterization vector corresponding to each sample in the third data set, wherein the combined training of the description matching task and the covered word prediction task is referred to; wherein the first pretreatment at least comprises: cutting an image in a sample into N blocks to obtain a corresponding image block sequence; randomly covering part of tag words of the tag word text sequence in the sample to obtain a tag word text sequence with covering marks; The second pretreatment comprises at least: cutting an image in a sample into N blocks to obtain a corresponding image block sequence; Determining word segmentation sequences corresponding to the reference descriptions in the samples; the third pretreatment at least comprises: determining a text word segmentation sequence corresponding to a text in a sample; And randomly covering partial word segmentation in the text word segmentation sequence to obtain the text word segmentation sequence with the covering mark.
2. The method of training a multimodal pre-training model according to claim 1, wherein the generating of the image-tag dataset comprises: acquiring an image dataset; for each image in the image dataset, acquiring all entities present in the image by means of a target detector; generating a tag word text sequence corresponding to the image based on all entities existing in the image; and generating the image-label data set by utilizing each image in the image data set and the corresponding label word text sequence.
3. The method of training a multimodal pre-training model according to claim 2, wherein the image-reference description data set generation process comprises: Removing overlapped entities from all the entities, and constructing a first entity set by using the rest entities; constructing a second entity set by utilizing any entity in the first entity set and the entity with the same tag word as any entity; generating a reference description of the any entity based on the tag word of the any entity and the size and position of each entity in the second entity set; taking the reference description of any entity as the corresponding reference description of the image; the image-reference description dataset is generated using each image in the image dataset and its corresponding reference description.
4. The method for training a multimodal pre-training model as defined in claim 1, wherein the multimodal teletext information processing network comprises a text embedding layer, a visual encoder, and a multimodal encoder, wherein the generating, using the multimodal teletext information processing network, a multimodal characterization vector for each sample in the first dataset comprises: Converting the tag word text sequence with the covering mark of each sample in the first data set into a text word vector by utilizing the text embedding layer; determining, with the visual encoder, an image feature vector corresponding to a sequence of image blocks for each sample in the first dataset; Fusing a text word vector converted by a tag word text sequence with a covering mark of each sample in the first data set and an image feature vector corresponding to an image block sequence of each sample in the first data set by using the multi-mode encoder to obtain a multi-mode characterization vector corresponding to each sample in the first data set; the generating, by using a multi-mode image-text information processing network, a multi-mode characterization vector corresponding to each sample in the second data set includes: converting the index descriptive word segmentation sequence of each sample in the second data set into a text word vector by utilizing the text embedding layer; Determining, with the visual encoder, an image feature vector corresponding to a sequence of image blocks for each sample in the second data set; Fusing a text word vector converted by the index descriptive word sequence of each sample in the second data set and an image feature vector corresponding to the image block sequence of each sample in the second data set by using the multi-modal encoder to obtain a multi-modal characterization vector corresponding to each sample in the second data set; the generating, by using a multi-mode image-text information processing network, a multi-mode characterization vector corresponding to each sample in the third dataset includes: Converting the text word segmentation sequence with the covering labels of each sample in the third data set into text word vectors by using the text embedding layer; and taking the text word vector converted by the text word segmentation sequence with the covering label of each sample in the third data set as a multi-modal characterization vector corresponding to each sample in the third data set.
5. The training method of the multimodal pre-training model according to claim 1, wherein the performing the masked tag word prediction task on the multimodal pre-training model using the multimodal characterization vector corresponding to each sample in the first data set, the multimodal characterization vector corresponding to each sample in the second data set, and the multimodal characterization vector corresponding to each sample in the third data set refers to joint training describing a matching task and a masked word prediction task, includes: predicting covered tag words of each sample in the first data set by using multi-modal characterization vectors corresponding to each sample in the first data set; predicting the position of a reference entity corresponding to each sample in the second data set in an image corresponding to each sample in the second data set by using a multi-modal characterization vector corresponding to each sample in the second data set; the multi-modal characterization vector corresponding to each sample in the third data set predicts covered word segments of each sample in the third data set; Calculating a covered tag word prediction loss of the first dataset; Calculating a reference entity location prediction penalty for the second dataset; calculating a masked word segmentation prediction loss of the third dataset; optimizing parameters of the multi-mode pre-training model by taking the sum of the prediction loss of the covered tag words of the first data set, the prediction loss of the referring entity position of the second data set and the prediction loss of the covered segmentation words of the third data set as training loss; Repeating the operation until the multi-mode pre-training model converges.
6. The method of claim 5, wherein the masked tag word predictive loss of the first dataset The calculation formula of (2) is as follows: ; Wherein, the Representing an image And its corresponding text sequence of tag words The sample to be formed is a sample, As a result of the image-tag data set, For text sequences of tag words The text word vector corresponding to the middle covering tag word, For text sequences of tag words The text word vector corresponding to the uncovered tag word, Is an image Image feature vectors corresponding to the sequence of image blocks of (a), For text sequences of tag words Cross entropy between candidate word joint probability distribution corresponding to the middle covering tag word and the true value; the second data set refers to the entity pointed to by the descriptive word sequence and predicts loss of position The calculation formula of (2) is as follows: ; Wherein, the Representing an image And their corresponding reference descriptions, For an image-reference to a description dataset, Is that The number of images to be included in the image data, Is that And (3) with A loss of soft dice between them, Is that And (3) with A binary cross-entropy loss between them, Is an image Representing an image Probability functions corresponding to the respective N image blocks, Is an image Is used to represent the image Probability functions corresponding to the respective N image blocks, Is that Middle (f) The probability function corresponding to the individual image block, Is that Middle (f) Probability functions corresponding to the image blocks, wherein the probability functions take the value of 0 or 1, when taking the value of 0, the probability functions represent that no reference entity exists, and when taking the value of 1, the probability functions represent that the reference entity exists; the masked word segmentation prediction loss of the third dataset The calculation formula of (2) is as follows: ; Wherein, the In the case of a data set of text, Is text The corresponding text word vectors corresponding to the masking word in the corresponding text word sequence, Is text The corresponding text word vector corresponding to the uncovered word in the corresponding text word sequence, Is text And the cross entropy between the candidate word joint probability distribution and the true value corresponding to the masking word in the corresponding text word segmentation sequence.
7. A training device for a multimodal pre-training model, the device comprising: The first building module is used for building a multi-mode pre-training model comprising a multi-mode image-text information processing network; a second construction module for constructing a weakly aligned image-text dataset, wherein the weakly aligned image-text dataset comprises a text dataset, an image-tag dataset, and an image-reference description dataset; a training module for training the multimodal pre-training model using the weakly aligned image-text dataset; The sample in the image-tag data set consists of an image and a tag word text sequence corresponding to the image, wherein the tag word text sequence corresponding to the image is formed by splicing tag words of all entities in the image; the image-reference description data set is composed of an image and a corresponding reference description thereof; The training module is further used for carrying out first preprocessing on each sample in the image-label data set to obtain a first data set; performing a second preprocessing on each sample in the image-reference description dataset to obtain a second dataset; performing third preprocessing on each sample in the text data set to obtain a third data set; generating a multi-modal characterization vector corresponding to each sample in the first data set, a multi-modal characterization vector corresponding to each sample in the second data set and a multi-modal characterization vector corresponding to each sample in the third data set by utilizing a multi-modal image-text information processing network; Performing a covered tag word prediction task on the multi-modal pre-training model by using the multi-modal characterization vector corresponding to each sample in the first data set, the multi-modal characterization vector corresponding to each sample in the second data set and the multi-modal characterization vector corresponding to each sample in the third data set, wherein the combined training of the description matching task and the covered word prediction task is referred to; wherein the first pretreatment at least comprises: cutting an image in a sample into N blocks to obtain a corresponding image block sequence; randomly covering part of tag words of the tag word text sequence in the sample to obtain a tag word text sequence with covering marks; The second pretreatment comprises at least: cutting an image in a sample into N blocks to obtain a corresponding image block sequence; Determining word segmentation sequences corresponding to the reference descriptions in the samples; the third pretreatment at least comprises: determining a text word segmentation sequence corresponding to a text in a sample; And randomly covering partial word segmentation in the text word segmentation sequence to obtain the text word segmentation sequence with the covering mark.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the training method of the multimodal pre-training model of any of claims 1 to 6 when the program is executed.
9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a training method of a multimodal pre-training model according to any of claims 1 to 6.

Description

Training method and device for multi-mode pre-training model Technical Field The invention relates to the technical field of unsupervised machine learning, in particular to a training method and device for a multi-mode pre-training model. Background With the continuous development of the transfer learning technology, a series of graphic multi-mode Pre-training (Pre-train Multimodal Model) models, such as UNITER, vinVL, CLIP, DALL.E, are widely developed. These multi-modal pre-training models are typically pre-trained on simpler tasks using large-scale data sets. After the pre-training is completed, the parameters of the multi-mode pre-training model are finely adjusted according to specific downstream tasks (such as visual question-answering tasks, image-text retrieval tasks and the like), and the parameters are used for executing the corresponding downstream tasks, so that the executing effect of the corresponding downstream tasks is improved. The prior art multimodal pre-training models, mostly use aligned large-scale image-text datasets for pre-training to obtain cross-modality understanding capabilities. However, the aligned large-scale image-text data often needs to be manually marked or cleaned, and the data overhead is too high. Few use non-aligned image-text datasets for pre-training, the cross-modal capability is relatively poor, and the pre-training process requires external models to assist in extracting feature vectors of images, which has the problem of relatively low reasoning efficiency. Accordingly, there is a need to provide a multi-modal pre-training model approach that enables training with high cross-modal understanding capabilities that leverages non-aligned image datasets and text datasets. Disclosure of Invention The invention provides a training method and a device for a multi-mode pre-training model, which adopts a weakly aligned image-text data set to train the multi-mode pre-training model, lightens the dependence on manually marked image-text aligned data, avoids the problem of large data overhead caused by using an aligned large-scale image-text data set to train the multi-mode pre-training model, can directly process multi-mode image-text information, does not need an external model to assist in extracting image features, avoids the problem of relatively low reasoning efficiency caused by using a non-aligned image-text data set to train the multi-mode pre-training model, and simultaneously refers to the problem of relatively poor cross-mode capability caused by using a non-aligned image-text data set to train the multi-mode pre-training model by carrying out covered tag word prediction task on the multi-mode pre-training model. In a first aspect, the present invention provides a training method of a multimodal pre-training model, the method comprising: Constructing a multi-mode pre-training model containing a multi-mode image-text information processing network; Constructing a weakly aligned image-text dataset, wherein the weakly aligned image-text dataset comprises a text dataset, an image-tag dataset, and an image-reference description dataset; Training the multimodal pre-training model using the weakly aligned image-text dataset; The sample in the image-tag data set consists of an image and a tag word text sequence corresponding to the image, wherein the tag word text sequence corresponding to the image is formed by splicing tag words of all entities in the image; The image-reference description data set is composed of an image and a corresponding reference description thereof. According to the training method of the multi-mode pre-training model provided by the invention, the generation process of the image-label data set comprises the following steps: acquiring an image dataset; for each image in the image dataset, acquiring all entities present in the image by means of a target detector; generating a tag word text sequence corresponding to the image based on all entities existing in the image; and generating the image-label data set by utilizing each image in the image data set and the corresponding label word text sequence. According to the training method of the multi-mode pre-training model provided by the invention, the image-reference description data set generation process comprises the following steps: Removing overlapped entities from all the entities, and constructing a first entity set by using the rest entities; constructing a second entity set by utilizing any entity in the first entity set and the entity with the same tag word as any entity; generating a reference description of the any entity based on the tag word of the any entity and the size and position of each entity in the second entity set; taking the reference description of any entity as the corresponding reference description of the image; the image-reference description dataset is generated using each image in the image dataset and its corresponding reference description. According to the training