CN-122023580-A - Text image generation method based on multi-modal contrast learning and memory enhancement

CN122023580ACN 122023580 ACN122023580 ACN 122023580ACN-122023580-A

Abstract

The invention discloses a text image generation method based on multi-mode contrast learning and memory enhancement, which constructs a multi-granularity memory bank, adopts a visual transducer and a feature pyramid network to extract global and local features, and stores the features through a structured data model, wherein the visual features, text semantic representation, a spatial relationship graph and metadata fields are included. Uncertainty aware retrieval adjusts the retrieval strategy based on semantic uncertainty and structural complexity of the input text. The memory conflict detection and resolution stage analyzes and resolves the potential spatial relationship conflicts. The condition generation and fusion multi-granularity memory characteristic guided by the spatial relationship integrates the visual characteristic through a granularity perception gating mechanism and combines the spatial constraint to generate a condition characteristic diagram. The condition feature map is injected into the diffusion model by utilizing multi-mode contrast optimization, the generation process is optimized, the generated image is ensured to be accurate in semantic meaning, reasonable in layout and capable of meeting the binding requirements of spatial relations and attributes in text description.

Inventors

BAI SHUHUA
WEI BAOQUAN
LI SULING
LUO HUI

Assignees

南昌理工学院
华东交通大学

Dates

Publication Date: 20260512
Application Date: 20260121

Claims (6)

1. A method for generating an image based on multimodal contrast learning and memory enhanced text, comprising: Constructing a multi-granularity memory library, namely processing large-scale image-text data through a double-branch visual feature extraction architecture, obtaining global semantic features and local structural features, and constructing a structured memory entry by combining a spatial relation diagram; uncertainty perception retrieval, namely analyzing semantic uncertainty and structural complexity of text description input by a user, adjusting retrieval parameters, and screening multi-granularity memory sets related to semantic and structural layers from a memory bank; performing space relation consistency check on the selected memory set, and solving potential space relation conflicts by adopting a differentiated fusion algorithm based on conflict grades to obtain a final retrieval result conforming to the space constraint of the input description; Integrating the processed multi-granularity memory characteristics and the structured space relation, integrating visual characteristics of different abstract levels through a granularity perception gating mechanism, and generating a condition characteristic diagram by combining space relation constraint to provide multi-level structured guidance for image generation; And performing multi-mode contrast optimization, namely performing image generation based on the conditional feature map in a diffusion model frame, and realizing joint optimization of image generation on semantic, spatial layout and attribute binding through a multi-mode contrast learning optimization process.
2. The method for generating images based on multi-modal contrast learning and memory enhanced text of claim 1, wherein said constructing a multi-granularity memory bank further comprises: adopting a dual-branch visual feature extraction architecture, wherein a global branch adopts a visual transducer network to extract the whole semantic features of an image, and a local branch acquires multi-scale space structural features through a feature pyramid network; The method comprises the steps of constructing a structured memory entry, a spatial relation diagram field, a metadata field, a data source information and a characteristic quality index, wherein the structured memory entry comprises a visual characteristic field, a text characteristic field, a spatial relation diagram field, a directional diagram formed by an entity and a spatial relation thereof, wherein the entity is acquired through a relation extraction network, and the visual characteristic field is used for storing multi-scale visual characteristics; The method comprises the steps of carrying out a two-stage search mechanism, carrying out primary screening on the basis of global semantic features, carrying out fine sorting on the comprehensive scores obtained by calculating the local semantic features and the spatial relationship, wherein the comprehensive scores S=alpha.S_visual+beta.S_spatial, S_visual is a local feature matching score, S_spatial is a spatial relationship consistency score, and alpha and beta are weights.
3. The method for generating an image based on multimodal contrast learning and memory enhanced text of claim 1, wherein the uncertainty aware search further comprises: Calculating boundary distances between semantic vectors of input text and text feature distribution in a memory bank, analyzing frequency distribution of concept combinations in the input text, calculating an uncertainty score S_ uncertain by combining local density analysis of word embedding space, determining semantic uncertainty degree based on the uncertainty score, wherein S_ uncertain =0.4× (d_boundary/0.85) +0.3×I_ rare +0.3× (d_similarity/0.7), d_similarity is average distance between semantic vectors and the nearest 50 neighbor vectors, d_boundary is boundary distance, and I_ rare is rare concept indicating variable is 0 or 1; Extracting a grammar structure tree of a text based on dependency syntax analysis, identifying the number of entities in the description and the interrelationships thereof, evaluating the existence of multi-level spatial constraints, and quantifying the structural complexity by the topological features of the entity-relationship graph, a structural complexity score s_complex=0.4×min (n_e/5, 1) +0.3×min (n_r/4, 1) +0.3×i_hierarchy, wherein i_hierarchy is a multi-level spatial constraint indicating variable; and constructing an uncertainty-complexity two-dimensional decision space, adjusting a search range according to the uncertainty degree, and adjusting a search depth according to the structural complexity to obtain a multi-granularity memory set adapting to the input description characteristic.
4. The method for generating an image based on multimodal contrast learning and memory enhanced text of claim 1, wherein said memory conflict detection and resolution further comprises: Extracting entity and its space constraint relation in input text by semantic analyzer to construct directed graph structure, extracting pre-stored space relation graph in search memory item, calculating relation semantic similarity by pre-trained relation embedding model to generate relation consistency matrix; Establishing a conflict detection rule set based on relation path analysis, and identifying a space relation chain contradicting an input description, wherein the conflict degree conflict _score=0.5X conflict _ratio+0.3X conflict _quality+0.2Xafter-products_events_ratio, wherein conflict _ratio is the ratio of the conflict relation to the total relation, conflict _quality is the average severity of the conflict relation, and after-products_ratio is the ratio of the affected entity to the total entity; The degree of conflict was divided into four classes corresponding to the different fusion algorithms, as indicated by confidence weighted average fusion for the mild conflict, by performing local feature replacement, f_ replaced =m_ conflict +.f_input+ (1-m_ conflict) +.f_original, where the conflicting entity mask region was m_ conflict, f_ replaced was the fusion feature map obtained after local feature replacement, f_input indicated the guide feature generated based on the input text description, f_original indicated the feature map of the original memory entry retrieved from the memory bank, as indicated by element-wise multiplication, and by the severe conflict, the local feature layout was reconstructed according to the spatial constraint of the input description.
5. The method for generating an image based on multimodal contrast learning and memory enhanced text of claim 1 wherein said spatial relationship guided conditional generation further comprises: Dividing visual features into three abstract layers, namely a semantic layer, a structural layer and a detail layer; constructing a gating network, taking semantic representation and a spatial complexity index of an input text as inputs, generating fusion weights of all granularity characteristics, outputting semantic layer weights, a structural layer and a detail layer, wherein w_semmantic, w_structural and w_detail meet w_ semantic +w_ structural +w_detail=1, a weight calculation formula is w_i=exp (z_i)/Σexp (z_j), wherein z_i is an un-normalized fraction of network output, performing progressive characteristic fusion, sequentially fusing characteristics from high to low according to an abstract level, firstly fusing semantic layer characteristics F_ semantic and structural layer characteristics F_ structural, generating intermediate characteristics F_mid=w_ semantic ×F_ semantic ++w_ structural ×F_ structural, wherein%is up-sampled to 32×32 resolution, and then fusing F_mid and detail layer characteristics F_detail, and generating F_al=w_d×f_d+w_detail_f_128×128×14; processing semantic features and space complexity indexes of an input text through a multi-layer perceptron to generate granularity selection preference vectors; The method comprises the steps of obtaining a spatial relation diagram, diffusing constraint information in the spatial relation diagram to the whole feature space to construct a spatial constraint field, maintaining spatial constraint consistency of different resolution levels through upsampling and downsampling operation to generate a conditional feature diagram with multi-level spatial structure information, and executing self-checking operation on the conditional feature diagram before generation.
6. The method for generating an image based on multimodal contrast learning and memory enhanced text of claim 1, wherein the multimodal contrast optimization further comprises: in a U-Net backbone network of the diffusion model, the conditional feature map is hierarchically injected into feature layers with different resolutions; in the diffusion process, evaluating consistency of the generated intermediate result and the condition characteristics; The generated image, the input text and the conditional features are mapped into a unified semantic space, a multi-level comparison learning frame is constructed, and comparison learning targets are set at three levels of feature level, regional level and image level; Evaluating whether the generated image accurately expresses the text intention by comparing the semantic content of the generated image with the text description, verifying whether the generated content deviates from the text description, and identifying whether key elements of the text description are ignored in the generated image; Analyzing the relative position, distance and topological relation among entities in the generated image, evaluating the consistency of the spatial relation with the input description, calculating an actual spatial relation score S_spatial=0.4×S_direction+0.3×S_distance+0.3×S_morphology, wherein S_direction is calculated based on angle difference, S_distance is calculated based on normalized distance, S_morphology is calculated based on topological relation map matching degree, calculating spatial constraint gradient ∇ _ I S _spatial when S_spatial is less than 0.6, and adjusting the diffusion process by guiding sampling; The method comprises the steps of determining the subordinate relation between each attribute and the corresponding entity in the text description, identifying the actual attribute of each entity in the generated image through a regional attribute classifier, comparing the actual attribute with the expected attribute, and implementing attribute propagation constraint on a plurality of entities with the same attribute to realize consistency in visual performance.

Description

Text image generation method based on multi-modal contrast learning and memory enhancement Technical Field The invention relates to the technical field of communication, in particular to a text image generation method based on multi-mode contrast learning and memory enhancement. Background In the prior art, the method of generating an image by text mainly faces challenges in several aspects. In processing complex text descriptions, particularly containing detailed spatial structure and attribute bindings, existing methods have difficulty generating high quality images that are accurately aligned to the input semantics and that are reasonably well-laid out. Limited processing power for combinations of concepts or long-tailed concepts that are not found in training data results in the generation of results that perform poorly in the face of novelty requirements. Traditional methods are inefficient in retrieving relevant memory from large amounts of teletext data to assist in the generation process and lack the ability to dynamically adjust for inputs of varying complexity and uncertainty. The prior art has the defect of solving the problem of spatial relation conflict among searched memory items, and is easy to cause the phenomena of layout confusion and attribute mismatch in the generated image. The text image generation method based on multi-mode contrast learning and memory enhancement effectively overcomes the defects by constructing key technical means such as a multi-granularity memory bank, uncertainty perception retrieval, memory conflict detection and resolution, condition generation guided by a spatial relationship, multi-mode contrast optimization and the like. The method not only can understand and realize the binding requirement of complex space structures and attributes in the text, but also improves the flexibility and accuracy of the generation process through an efficient retrieval mechanism. By introducing a multi-level feature fusion and condition gating mechanism, the invention realizes progressive feature integration from high-level semantics to bottom details, and ensures the representation of the generated image on visual quality and semantic consistency. The multi-mode contrast optimization framework is utilized, so that the high consistency and the integrity of the generated content and the text intention are further ensured, and the overall performance and the application potential of the text generated image technology are improved. Disclosure of Invention The invention aims to provide a text generation image method based on multi-modal contrast learning and memory enhancement. The invention aims to solve the defects of the existing text generation image technology in processing complex space structure and attribute binding, and particularly aims to the challenges of novel concept combination and long-tail concept. A text generation image method based on multi-mode contrast learning and memory enhancement adopts the following technical scheme: Constructing a multi-granularity memory library, namely performing multi-granularity feature extraction and structured storage on large-scale image-text pair data, constructing memory entries containing global semantic features, local structural features and explicit spatial relation representations, and realizing efficient retrieval through a hierarchical indexing mechanism; Receiving text description input by a user, analyzing semantic uncertainty and structural complexity of the text description, adjusting search quantity and depth, and acquiring a multi-granularity memory set related to the input description on semantic and structural layers from a memory bank; The memory conflict detection and resolution is carried out by carrying out space relation consistency analysis on the retrieved multi-granularity memory set, identifying potential space relation conflicts, adopting a differentiated fusion algorithm to solve the conflicts based on different conflict grades, and ensuring that the retrieval result is consistent with the space constraint of the input description; the condition generation of the spatial relationship guidance, which is to fuse the processed multi-granularity memory characteristics and the structured spatial relationship, integrate the visual characteristics of different abstract levels through a granularity perception gating mechanism, and generate a condition characteristic diagram by combining the spatial relationship constraint so as to provide the structured guidance for the image generation; and the multi-mode contrast optimization is to guide the diffusion model to generate a high-quality image based on the condition feature map, and optimize the generation process through a multi-mode contrast learning mechanism, wherein the generation process comprises joint optimization of image-text semantic alignment, spatial layout consistency and attribute-entity binding accuracy. Further, the constructing the multi-granular