CN-116245967-B - Text image generation method and system based on local detail editing

CN116245967BCN 116245967 BCN116245967 BCN 116245967BCN-116245967-B

Abstract

The invention relates to a text generation image method and system based on local detail editing. The method comprises the following steps of 1. Dividing an input text into a plurality of independent object attribute descriptions through a grammar analyzer. 2. An initial image is generated by generating the countermeasure network and the initial image is mapped to a hidden space in which the countermeasure network is generated. 3. The corresponding region of the object attribute description in the initial image is found by the feature location module, and a corresponding attention map and feature map are generated based on the attention mechanism. 4. The hidden space vector is modified according to the feature diagram and attention try to modify, and is re-sent into the generation countermeasure network, so that an image conforming to the text fine granularity description is obtained. The shape loss, the attention loss and the discriminant loss are designed to control the shape, the local characteristics and the texture details of the object in three stages, so that the controllable editing of the image details is realized. The invention can automatically edit the details of the generated image according to the input text, and generate higher-quality and more various images than the prior method.

Inventors

PENG YUXIN
DENG ZIJUN
HE XIANGTENG

Assignees

北京大学

Dates

Publication Date: 20260512
Application Date: 20221223

Claims (9)

1. A method for generating an image based on text edited with local detail, comprising the steps of: Dividing the input text into a plurality of independent object attribute descriptions through a grammar analyzer; Generating an initial image by generating an countermeasure network, and mapping the initial image to a hidden space for generating the countermeasure network; Finding out corresponding areas of object attribute description in the initial image through a feature positioning module, and generating corresponding attention map and feature map based on an attention mechanism; modifying the hidden space vector according to the feature diagram and the attention diagram, and re-sending the hidden space vector into a generated countermeasure network to obtain an image conforming to the text fine granularity description; The feature location module uses a de-duplication algorithm to eliminate overlapping regions in the attention patterns, the de-duplication algorithm first filters out attention patterns that overlap less with other attention patterns, and then combines the remaining attention patterns by maximum value combining Get global attention to try Finally, the overlapping area is removed by: Wherein the method comprises the steps of Representing attention force diagram Median value less than And n represents the overlapping area where two attention attempts are taken.
2. The method of claim 1, wherein the parser comprises a text block model and a parse tree parser, wherein the parser first uses the text block model to divide the text into non-overlapping phrases to obtain noun phrases in the text description, and after the noun phrases are obtained, uses the parse tree parser to obtain a grammar structure of the input text, and merges core noun phrases in the sentence with adjacent verb phrases, preposition phrases and noun clauses to form independent object attribute descriptions, wherein the merging algorithm is specifically that a sentence component is defined as a standard of independent attribute description, then traverses the parse tree from bottom to top, and when traversing to a certain node, puts the node into a division result if the node meets the standard, otherwise merges the node with a sibling node until the standard is met.
3. The method of claim 1, wherein mapping the initial image to the hidden space that generates the countermeasure network includes first randomly sampling a vector t in the generated countermeasure network hidden space and then modifying the vector using the following mapping loss function L proj : L proj ＝||F(x)-F(G(t))|| 2 Wherein x represents the initial image, G (), F (), the generated countermeasure network, and F (), the feature extraction model, and the loss function makes the generated image of the countermeasure network generated according to t as close as possible to the initial image x, thereby mapping x to its hidden space.
4. The method of claim 1, wherein screening out attention patterns that overlap less with other attention patterns comprises: For the ith attention profile, the saliency area is denoted as a i , and the deduplication algorithm first calculates the global attention area A i-1 in the previous i-1 profile: A i-1 ＝a 1 ∪a 2 ∪...∪a i-1 wherein, U represents that each point in the two attention diagrams takes a larger value; then, the attention profile with less overlap with other attention profiles is filtered out, and the attention profile i is ignored when the following condition is satisfied: Where S (-) refers to the area of the strikings significance region, a i ＞A i-1 denotes the region where the median in a i is greater than a i-1 in the strikings, n denotes the overlap region where two strikings are taken, the first formula requires the attention region to have a distinguishing local feature, α is the discrimination threshold, the second formula requires the attention region to not overlap much with other regions, β is the overlap ratio.
5. The method of claim 1, wherein modifying the hidden space vector according to the feature map and attention map, and re-sending the hidden space vector into the generation countermeasure network to obtain an image conforming to the text fine-grained description, comprises designing shape loss, attention loss, and morphology, local feature and texture details of the discriminator loss control object, thereby realizing controllable editing of the image details, wherein the shape loss function L s , the attention loss function L a , and the discriminator loss function L d are defined as follows: L s ＝||F(x)-F(t)|| 2 L a ＝∑||F(x·mask i )-F(t i ·mask i )|| 2 L d =softplus(-D(x)) Wherein F () represents a feature extraction model, x represents an edited image, t represents a detail editing target image, t i and mask i represent a feature map and an attention map, respectively, D is a discriminator generating an countermeasure network, and Softplus functions are defined as: softplus(x)=log(1+e x )。
6. The method of claim 5, wherein the controllable editing of image details is divided into three phases: In the first stage, the target image t is set as a reference image which is spliced according to the attention module feature image and the attention attempt and contains features of all attributes in the text description, and the loss function corrects the shape of the object by using a shape loss L s , wherein the loss function L is expressed as L=L s ; In the second stage, the target image t is unchanged, a loss function adds attention loss L a to correct the image local feature, the loss function L being denoted as l=l s +λ a L a , where λ a is a super-parameter for balancing shape loss and attention loss; In the third stage, the target image t is updated to the last modified image of the second stage, the detail texture of the image is modified by introducing a discriminator loss, the image is more realistic, and the loss function L is represented as l=l s +λ d L d , wherein λ d is a super-parameter for balancing the shape loss and the attention loss.
7. A text-to-image generation system based on local detail editing employing the method of any one of claims 1-6, comprising: the grammar analysis module is responsible for dividing an input text into a plurality of independent object attribute descriptions; The image generation module is responsible for sending the object attribute description into the text encoder and generating an countermeasure network, generating an initial image, and mapping the initial image to a hidden space of the network; The feature positioning module is responsible for finding out a corresponding region of the object attribute description in the initial image, and generating a feature map and an attention map based on an attention mechanism; the local detail editing module is responsible for correcting the local area of the initial image according to the feature map and attention map so as to enable the local area to accord with the text fine granularity description.
8. A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-6.
9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1-6.

Description

Text image generation method and system based on local detail editing Technical Field The invention relates to the field of image generation, in particular to a text generation image method and system based on local detail editing. Background The text-generated image is intended to generate semantically consistent, content-authentic images from a given natural language description. In recent years, with the rapid development of the generation of the countermeasure network technology, the existing method has made a great progress in synthesizing the image direction in which the content is true. In addition, since the text generated image has a wide application prospect, such as visual reading, planar design, criminal investigation, and the like, the field has become one of the most active research fields in recent years. Text-to-image technology has two major research challenges, namely, how to ensure the consistency of image-text semantics, and how to generate high-definition, realistic images. To ensure semantic consistency of image-text, existing methods typically use text encoders and image encoders to learn text-image cross-modal characterization, the main idea being to use text-image pairs to train one fixed-length text encoder and one image encoder simultaneously, thereby mining semantic associations in the text-image pairs. In recent years, a series of models adopts StackGAN architecture (Zhang Han,et al.Stackgan:Text to photo-realistic image synthesis with stacked generative adversarial networks.Proceedings of the IEEE international conference on computer vision.2017) to generate images, and text-image cross-modal characterization is learned by pre-training an LSTM (Long Short Term Memory ) text encoder and a CNN (Convolutional Neural Network ) image encoder to solve the problem of semantic consistency. AttnGAN the relevance of the local region of the image to the word in the text is found in a semi-supervised manner by introducing a attentive mechanism using DAMSM (Deep Attentional Multimodal Similarity Model, deep attentive multimodal similarity) model (Xu Tao,et al.Attngan:Fine-grained text to image generation with attentional generative adversarial networks.Proceedings of the IEEE conference on computer vision and pattern recognition.2018), so that the details of the image are consistent with the text description. However, the fixed length coding employed by existing methods has difficulty in handling highly complex and flexible natural language descriptions, resulting in models that generally handle single object properties better, but it is difficult to understand and code multiple object properties in a textual description as features. Therefore, it is difficult for the existing method to satisfy the requirement of the user on the generated object in terms of details, namely, consistency of text fine-grained description and image local area semantics is maintained. Disclosure of Invention The invention provides a text generation image method based on local detail editing, which can edit local detail of an image according to input fine-granularity text description, thereby realizing automatic generation of the image with controllable detail. The text image generation method based on local detail editing is based on the principle that firstly, the text is divided into a plurality of independent object attribute descriptions, an initial image is generated, and then the local detail of the initial image is modified according to the descriptions, so that the automatic generation of the image with controllable detail is realized. In order to achieve the above purpose, the invention adopts the following technical scheme: a method for generating an image based on text edited with local detail, comprising the steps of: (1) Dividing the input text into a plurality of independent object attribute descriptions through a grammar analyzer; (2) Generating an initial image by generating an countermeasure network, and mapping the initial image to a hidden space for generating the countermeasure network; (3) Finding out corresponding areas of the object attribute description in the step (1) in the initial image of the step (2) through a feature positioning module, and generating corresponding attention map and feature map based on an attention mechanism; (4) Modifying the hidden space vector of the step (2) according to the feature map and attention try to modify, and re-sending the hidden space vector into a generated countermeasure network to obtain an image conforming to the text fine granularity description. Further, in the above method, the parser in step (1) is composed of a text block model and a syntax tree parser. Since noun phrases in text tend to be the core of the object property description, the parser first uses a text chunking model to divide the text into non-overlapping phrases, resulting in noun phrases in the text description. After the noun phrase is obtained, a grammar tree analyzer is use