CN-115393692-B - Associative text-to-image generation method based on generation type pre-training language model

CN115393692BCN 115393692 BCN115393692 BCN 115393692BCN-115393692-B

Abstract

The invention discloses an associative text-to-image generation method based on a generative pre-training language model, which comprises the steps of carrying out fine adjustment on the generative pre-training model based on a data set, enabling the pre-training model to obtain text information with good semantic retention, obtaining a fine-adjusted pre-training model, taking ten sentences corresponding to each image in an original data set as input of the fine-adjusted pre-training model, obtaining a generated data set output by the model, carrying out constraint processing and semantic retention evaluation selection on the generated data set, obtaining an associative text data set, and generating an image consistent in cross-modal semantic characteristics of a text image by utilizing a DF-GAN-based countermeasure generation network model based on the associative text data set. The invention comprehensively utilizes the associativity and rich semantic information of the generated pre-training model, and balances the problem of unbalanced text information and image information of the antagonism generation network on the text-to-image cross-mode generation task to a certain extent.

Inventors

BAO BINGKUN
SHENG YEFEI
TAO MING
TAN ZHIYI
SHAO XI

Assignees

南京邮电大学

Dates

Publication Date: 20260505
Application Date: 20220908

Claims (5)

1. A method for generating an associative text-to-image based on a generated pre-training language model, comprising: Step S1, fine tuning is carried out on the generated pre-training model based on a data set, so that the pre-training model obtains the existing text information with good semantic retention, and the fine-tuned pre-training model is obtained; S2, taking ten sentences corresponding to each image in the original dataset as the input of the fine-tuned pre-training model obtained in the step S1 to obtain a generated dataset output by the model; step S3, generating a network model by utilizing the countermeasure generation network model based on DF-GAN based on the associational text data set obtained in the step S2, and generating an image consistent in the cross-mode semantic features of the text image; The step S1 includes: step S11, acquiring a data set, arranging ten sentences corresponding to each image in the data set into sentence strings, wherein the data set comprises a plurality of images, each image corresponds to ten sentences, arranging the ten sentences corresponding to each image into sentence strings according to the following rules, wherein the sentence strings are arranged as follows, "$sentence a# sentence b# sentence c# - # sentence 9# sentence 10", the sentence strings are divided into two parts, wherein the first part is random initialization, the sentences a, b and c are three sentences randomly initialized from ten sentences corresponding to one image, the second part is sequential splicing of the rest sentences, wherein "#", "$" are separators and initiator respectively, GPT-2 generates structured sentence strings, the separators facilitate the disassembly of the generated sentence strings, and the initiator is used for preventing the model from generating overlong or too short sentence strings; Step S12, inputting sentence strings of a data set into a pre-training model for training and fine tuning to obtain a fine-tuned pre-training model, wherein the pre-training model is a GPT-2 model, the training and fine tuning method of the GPT-2 model comprises the steps of setting a given input sentence string to be expressed as a sentence sequence [ X 1 ,x 2 ,...,x m ], m is an mth sentence in the sentence string, and loss functions of the GPT-2 model during pre-training and fine tuning are L 1 (X) and L 2 (X) respectively, wherein the formulas are as follows: the pre-training loss function L 1 (X) adopts a maximum likelihood function, P () represents conditional probability, and Θ is a neural network modeling parameter, i is a traversing value of 0, 1..k, k is smaller than m, and k is the size of a sliding window; the fine tuning process adopts supervised learning, a training sample comprises a sentence sequence [ X 1 ,x 2 ,...,x m ] and a first sentence X 1 serving as a class label, and the class label is predicted according to the sentence sequence [ X 1 ,x 2 ,...,x m ] in the GPT-2 model fine tuning process, namely L 2 (X); The optimization function L 3 is the weighted sum of L 1 and L 2 , L 3 ＝L 2 +λL 1 , wherein lambda is a super parameter, and L 1 and L 2 are loss functions of the GPT-2 model during pre-training and fine tuning respectively; in step S2, constraint processing is carried out on the generated data set, wherein the constraint processing comprises the steps of processing the generated data set by adopting a nearby principle, format regularization and sentence selection, and carrying out semantic retention evaluation selection on the generated data set, and the method comprises the following steps: The generated dataset is evaluated using bleu metrics, wherein bleu metrics include samples bleu a of different backgrounds for different poses in the same class, samples bleu b of different classes that differ significantly, and samples bleu c of similar visual characteristics but belonging to different classes: Candidates represents sentences of the generated dataset, reference is sentences of the original dataset, count represents a Count, count clip represents a molecular cut-off Count, n-gram represents the number of consecutive words measured in Candidates appearing in reference, n-gram 'represents the number of consecutive words measured in Candidates, c, c' is the number of sentences measured simultaneously selected from the dataset, Σ c∈candidates 、∑ c′∈candidates represents the number of sentences including all Candidates, Σ n-gram∈c 、Σ n-gram′∈c′ represents the number of sentences that all match in the calculated candidate variables and represents the number of specific variables in the reference, count clip (n-gram) represents the number of sentences that match in Candidates appearing in the reference, and Count (n-gram ') represents the number of sentences that n-gram' matches in Candidates; Three indices bleu a 、bleu b and bleu c of the generated dataset and the original dataset are calculated, respectively; And if the ratio of the generated data set to the three indexes of the original data set is consistent, the three indexes are consistent with the original data set semanteme, and the generated data set consistent with the original data set semanteme is selected as the associated text data set.
2. The method for generating associative text-to-image based on a generated pre-training language model according to claim 1, wherein the DF-GAN based challenge generating network model comprises a pre-trained text encoder, a generator, and a arbiter; The text encoder is used for associating all texts in the text data set, encoding the texts through the text encoder and storing the output sentence vectors into a text encoding library; The method comprises the steps of interacting a plurality of sentences input by a method with a feature map of a current level in a depth semantic fusion module of each layer, calculating a cross-modal attention mechanism to distinguish weight parts of the sentences in different generator layers, and converting the image features into images by a convolution layer, wherein each depth semantic fusion module comprises an up-sampling layer, a residual block and a text-image feature fusion block; Converting the image into image features by using a series of downsampling layers in the discriminator, then connecting the image features with sentence vectors, and calculating countermeasures through one-step generation to ensure visual reality and semantic consistency; the loss functions of the generator and arbiter are as follows: Wherein, L D is the loss function of the discriminator, L G is the noise vector of Gaussian distribution sampling in the loss function formula of the generator, D is the discriminator, G is the generator, G (z) represents the image generated by the generator, and e is the sentence vector; Representing the composite data distribution, the true data distribution and the unmatched data distribution respectively, The hinge loss calculation function is represented by x, the true image is represented by D (x, e), the true image is input by the discriminator, and the generated image is input by D (G (z), e).
3. The method for generating an associative text-to-image based on a generated pre-training language model according to claim 2, wherein the processing of the generator comprises: the generator adopts an attention mechanism, and alpha n is set as the attention mechanism weight corresponding to the nth sentence; wherein X is an input sentence vector, z is an input random noise, s is an attention score function, W represents a plurality of linear layers, the sentence vector is mapped to the vector in a potential space, W (z) is a feature map of an image under a current layer of the generator, and alpha n is calculated under the condition of given W (z) and X.
4. An associative text-to-image generating device based on a generating pre-training language model is characterized by comprising a processor and a storage medium; The storage medium is used for storing instructions; The processor being operative according to the instructions to perform the steps of the method according to any one of claims 1 to 3.
5. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method of any of claims 1 to 3.

Description

Associative text-to-image generation method based on generation type pre-training language model Technical Field The invention relates to the technical field of image generation, in particular to a method for generating an associative text-to-image based on a generated pre-training language model. The background technology is as follows: The experience of the world is gradually changing from single mode to multi-mode with the development of multimedia technology. In brief, multimodal refers to information of multiple modalities including text, images, video, audio, etc. As the name suggests, multimodal research is a problem with the fusion of these different types of data. Text-to-image generation is a potential and increasingly important task in the area of multimodal machine learning and deep learning. The task has good application in image editing, video editing and stylized generation, and user personalized customization, and can also help design class work in the future. For example, users may enter a request to draw an image they desire or to help them complete professional tasks such as designing clothing, modifying a layout, etc., based on text content. In some existing applications, the training set and the test set are from the CUB2011 bird dataset, with 10 descriptions for each CUB image. Recent methods consider that in the past, only one text is selected as input to generate a matching target image, and one text often describes only a part of one image, which results in a complex generation process and thus a low image quality. It is necessary to retrieve text close to the input text as additional input to enrich the text information. However, the method of expanding the text amount by searching similar texts violates the task for the purpose of filling in the part which cannot be described by a single text, and in addition, the existing method limits the search range to be in ten sentences correspondingly, however, the text in the data set has the problem of misdescription and incomplete description, and external knowledge which does not exist in the original text is required for better describing the image. In addition, the existing method only makes a self-attention mechanism of a text layer, and omits interaction of image information and text information in the cross-modal task. Disclosure of Invention The invention aims to provide an associative text-to-image generation method based on a generated pre-training language model, which associates and generates richer text information for text-to-image generation by introducing the generated pre-training model, utilizes a complementary fine tuning method and a cross-modal attention construction countermeasure generation network based on the text and the image, and ensures that the generated image has better semantic consistency. The technical scheme adopted by the invention is as follows: In a first aspect, there is provided a method for generating an associative text-to-image based on a generated pre-training language model, comprising: Step S1, fine tuning is carried out on the generated pre-training model based on a data set, so that the pre-training model obtains the existing text information with good semantic retention, and the fine-tuned pre-training model is obtained; S2, taking ten sentences corresponding to each image in the original dataset as the input of the fine-tuned pre-training model obtained in the step S1 to obtain a generated dataset output by the model; and step S3, generating an image consistent in cross-mode semantic features of the text image by utilizing a DF-GAN-based countermeasure generation network model based on the associational text data set obtained in the step S2. In some embodiments, the step S1 includes: Step S11, acquiring a data set, and arranging ten sentences corresponding to each image in the data set into sentence strings; and step S12, inputting the sentence strings of the data set into a pre-training model for training and fine tuning to obtain a fine-tuned pre-training model. In some embodiments, in step S11, ten sentences corresponding to each image in the dataset are arranged into sentence strings, including the dataset including a plurality of images, each image corresponding to ten sentences, the ten sentences corresponding to each image being arranged into sentence strings according to the following rules: the sentence string is arranged as follows, "$sentence a# sentence b# sentence c# -, # sentence 9# sentence 10$"; the sentence string is divided into two parts, wherein the first part is random initialization, and the sentences a, b and c are three sentences initialized randomly from ten sentences corresponding to one image; The second part is the sequential concatenation of the remaining sentences, wherein "#", "$" are separator and initiator respectively, GPT-2 generates a structured sentence string, separator facilitates the disassembly of the generated sentence string, and initiator is used