CN-121982151-A - Feature optimization method in text-to-image generation model

CN121982151ACN 121982151 ACN121982151 ACN 121982151ACN-121982151-A

Abstract

The application belongs to the field of text-to-image generation, and discloses a feature optimization method in a text-to-image generation model, which comprises the following steps that S1, an input text prompt is sent to a pre-trained text encoder for processing, and initial text embedded feature representation is obtained; S2, designing an embedding distinguishing mechanism based on the geometric relationship between the noun embedded features of the target object in the initial text embedded features, and optimizing the embedded feature representation of the original text prompt; the application provides a feature optimization method in a text-to-image generation model, which comprises an embedding distinguishing mechanism and an attention separating mechanism so as to better promote the semantic consistency and the overall image quality of a generated image under a multi-object scene.

Inventors

CAO XIAOPENG
ZHANG TIAN

Assignees

西安邮电大学

Dates

Publication Date: 20260505
Application Date: 20260129

Claims (8)

1. A method of feature optimization in a text-to-image generation model, the method comprising: s1, sending an input text prompt to a pre-trained text encoder for processing to obtain an initial text embedded feature representation; S2, designing an embedding distinguishing mechanism based on the geometrical relationship between the nouns of the target objects in the initial text embedding characteristics, and optimizing the embedding characteristic representation of the original text prompt; The step S2 includes: S21, analyzing the original text prompt by using a natural language processing tool, extracting a sub-prompt containing a target object, and inputting the sub-prompt into a text encoder to obtain a corresponding sub-prompt embedded feature representation; s22, calculating a geometric relationship representation variable between an optimized direction variable of the embedded feature and the embedded feature of the target object based on the original prompt embedded feature and the sub-prompt embedded feature of the target object; S23, according to geometrical relation representation variables among embedded features of a target object, carrying out optimization direction adjustment on the embedded features of the original text to obtain optimized text embedded feature representation; S3, inputting the optimized text embedded features as conditional signals into a denoising network, and introducing an attention separation mechanism in a diffusion reverse denoising process to guide image generation; The step S3 includes: s31, taking the optimized text embedding characteristics as condition input, and guiding image generation through a cross attention mechanism in the denoising process; And S32, applying attention guidance in the first half denoising process by using a specific loss function through an attention separation mechanism, and further optimizing image generation.
2. The method of feature optimization in a text-to-image generation model of claim 1, wherein the pre-trained text encoder is CLIP: in Stable Difference, a text prompt is entered By CLIP text encoder Mapping to text embedded feature representation: ; The embedded feature is used as a conditional signal and is injected into a U-Net denoising network through a cross attention mechanism to guide image generation in the diffusion process.
3. The method for feature optimization in a text-to-image generation model according to claim 1, wherein said step S21 comprises: S211, analyzing the original prompt text by using a natural language processing tool Stanza to obtain a sub-prompt text which contains single object nouns and keeps the original context; S212, sending the sub-text prompt to the text encoder to obtain text embedded feature representation of the sub-prompt.
4. The method for feature optimization in a text-to-image generation model according to claim 1, wherein said step S22 comprises: s221, calculating an optimized direction variable of the embedded feature by using the original text embedded feature representation and the text embedded feature representation of the sub-prompt The formula is as follows: ; Wherein, the Representing a target object The corresponding text embedding feature in the original prompt context, Representing the same target object Embedded features in their corresponding clause context; S222, calculating Euclidean distance, cosine similarity, umap and t-SNE scores between two object noun embedded features in the original text embedded features to obtain a geometric relationship expression variable CS between the object embedded features, wherein the formula is as follows: Wherein, the method comprises the steps of, The two objects of interest are represented as such, 、 The normalized euclidean distance and cosine similarity, Representing local semantic similarity scores computed based on low-dimensional neighborhood relationships For weight parameters set according to experiments.
5. The method for optimizing features in a text-to-image generation model according to claim 1, wherein in step S23, the optimized text embedded feature representation is adjusted in a manner of representing variables according to a geometric relationship between the embedded features of the target object, and the variables of the original text embedded feature in the embedded optimization direction are adjusted according to the following formula: ; Wherein, the For the target object in the original prompt Is embedded with text features of (a); a geometric relationship representing the embedded features of the target object represents a variable, Controlling the offset amplitude; Embedded for the enhanced target object.
6. The method of claim 1, wherein the denoising network is based on Stable dispersion, the core idea of the model is to map high-dimensional pixel space to low-dimensional potential space, and learn Diffusion process in the potential space to significantly reduce computation cost and improve generation efficiency, and give an image Encoding it into a potential representation by a variational self-encoder (VAE) In the potential space, the forward diffusion process is gradually oriented Adding Gaussian noise to obtain a series of latent variables The reverse process is then performed by a parameterized denoising network Learning, wherein, For conditional text embedding features, typically obtained by frozen CLIP text encoders, let the vector corresponding to the conditional hint word be expressed as Denoising network Is to minimize the following loss function: ; Wherein, the To add Gaussian noise For denoising the network predicted noise, the objective is to minimize the mean square error between the predicted noise and the real noise, and during the sampling phase, the model is derived from Gaussian noise Initially, the noise is gradually removed by using a denoising network, and finally potential variables similar to the original data distribution are obtained Through VAE decoder Reconstructed as an image.
7. The method for feature optimization in a text-to-image generation model according to claim 1, wherein in step S31, the cross-attention mechanism includes injecting text condition information into a feature space of a diffusion model by the mechanism, and setting a text embedding feature matrix outputted by a CLIP text encoder as Wherein For the length of the text sequence, For the text feature dimension, the text embedded features are respectively projected into keys (Key) and values (Value) through linear transformation, and the diffusion model is arranged at the first position The intermediate characterization of the time steps by U-Net is noted as Wherein For the flattened length of the spatial signature, the signature is used as a Query (Query) by linear projection, and the cross-attention calculation process is defined as follows: ; Wherein, the Projection matrix In order for the parameters to be able to be learned, Representing the characteristic dimension of the attention head, resulting attention moment array The semantic correspondence strength between each position in the image space and the text token is depicted.
8. The method according to claim 1, wherein in step S32, the specific loss function is a regular term of attention feature representing a variable in combination with a geometrical relationship between objects: ; ; Wherein, the And (3) with Respectively representing the original and regulated attention characteristic diagrams, For the target object set, the second term is used as reconstruction constraint for limiting the regulation amplitude, and the error term For evaluating the quality of the attention feature of the target object, Representing a target object Normalized Attention profile in the Cross-Attention layer, In order to confuse the score(s), In order to adjust the coefficients, the attention distinguishing strength is adaptively adjusted according to the semantic proximity degree between objects, wherein the more similar the semantics are, the stronger the distinguishing constraint is applied, and otherwise, the distinguishing constraint is weakened gradually, so that excessive intervention is avoided, and the joint objective loss function of the overall optimization objective is as follows: 。

Description

Feature optimization method in text-to-image generation model Technical Field The invention belongs to the technical field of text-to-image generation, and particularly relates to a feature optimization method in a text-to-image generation model. Background Text-to-image synthesis is an important research task in the field of multi-modal artificial intelligence, and has the core meaning of enhancing the consistency between natural language understanding and visual content generation and providing key technical support for cross-modal semantic alignment and generation modeling. The task not only promotes unified modeling and expression of language and visual heterogeneous modes in theory, but also obviously reduces visual content creation cost in practice, and is used for serving application scenes such as intelligent design, digital content production, artistic auxiliary creation and the like. Currently, text-to-image generation research has largely undergone an evolution from methods based on generating a countermeasure network to methods based on diffusion models. In early researches, a generated countermeasure network is mostly adopted, mapping from text semantics to image space is realized through countermeasure training of a generator and a discriminator, and the generation resolution and detail quality are improved by means of staged generation and a cross-modal attention mechanism. However, such methods have inherent deficiencies in training stability, sample diversity, and text-to-image semantic alignment, limiting further enhancement of generation performance. With the rise of a diffusion model, the research center of gravity gradually turns to a generation paradigm based on gradual denoising, the method remarkably improves the generation quality and semantic consistency through a stable training mechanism, and combines a strong text encoder with cross-modal attention to realize more accurate text condition guidance. On the basis, in order to improve the controllability of the generated result, the related research introduces structural prior conditions such as layout, segmentation, edges and the like to enhance space and structure constraint, and realizes personalized and style-consistent generation through model adaptation and parameter efficient fine adjustment. However, such methods typically rely on additional training procedures, facing the problems of high computational overhead and limited generalization capability. Therefore, in recent years, a diffusion process regulation and control method without training gradually appears, attention characteristic distribution or text embedding characteristic representation is directly regulated in an reasoning stage, so that the multi-object semantic consistency and the effective control of the generation behavior are realized under the condition that model parameters are not updated, and a new research direction is provided for efficient and flexible text-to-image generation. Disclosure of Invention The invention aims to provide a feature optimization method in a text-to-image generation model, which solves the problem of semantic confusion in an embedded feature geometric relationship of the existing generation method. The application provides a technical scheme that a feature optimization method in a text-to-image generation model comprises the following steps: s1, sending an input text prompt to a pre-trained text encoder for processing to obtain an initial text embedded feature representation; S2, designing an embedding distinguishing mechanism based on the geometrical relationship between the nouns of the target objects in the initial text embedding characteristics, and optimizing the embedding representation of the original text prompt; the step S2 comprises the following steps: s21, analyzing the original text prompt by using a natural language processing tool, extracting a sub-prompt containing a target object, and inputting the sub-prompt into a text encoder to obtain a corresponding sub-prompt embedded representation; s22, calculating geometrical relation expression variables between the embedded optimization direction variable and the target object based on original prompt embedding and sub-prompt embedding of the target object; S23, according to the geometrical relationship representation variable among target object embedding, carrying out the embedding optimization direction adjustment on the original text embedding characteristics to obtain optimized text embedding characteristic representation; S3, inputting the optimized text embedded features as conditional signals into a denoising network, and introducing an attention separation mechanism in a diffusion reverse denoising process to guide image generation; the step S3 comprises the following steps: s31, taking the optimized text embedding characteristics as condition input, and guiding image generation through a cross attention mechanism in the denoising process; s32, applying atten