CN-121999070-A - Text-to-image generation method and system based on multi-expert programming

CN121999070ACN 121999070 ACN121999070 ACN 121999070ACN-121999070-A

Abstract

The invention discloses a text-to-image generation method and system based on multi-expert programming. According to the method, firstly, an intelligent router based on a graph neural network analyzes input text prompts, a correlation score and a self-adaptive dynamic screening mechanism are utilized to accurately activate an optimal expert subset from a heterogeneous model pool, secondly, active experts are controlled to generate diversified initial sketches in parallel under the minimum denoising budget, then, the sketches are subjected to preferential rearrangement based on semantic alignment indexes to screen out optimal candidate images, finally, complementary structural features of the candidate images are extracted, and a double control Net architecture is utilized to guide a target diffusion model to carry out collaborative fusion, so that a high-fidelity image is generated. The invention realizes high-fidelity generating effect of a comparable large-scale SOTA model while obviously reducing the occupation and reasoning time of the video memory through cascading arrangement strategies.

Inventors

CHENG GUANJIE
ZHANG YIFAN
WANG ZEKUN
ZHAO XINKUI
YIN JIANWEI
DENG SHUIGUANG

Assignees

浙江大学
浙江大学软件学院(宁波)创新与管理中心

Dates

Publication Date: 20260508
Application Date: 20251231

Claims (10)

1. A text-to-image generation method based on multi-expert programming, comprising the steps of: Intelligent expert routing, namely modeling the relation between text prompt and expert model by constructing different composition, applying graph attention network to perform relation learning, outputting relevance scores of all the experts through a predictor, and dynamically screening according to the relevance scores Expert models, which constitute an active expert subset; Parallel sketch generation, namely controlling each expert model in the active expert subset to run in parallel, and generating an initial sketch based on text prompt, ‌ Calculating semantic alignment degree between the initial sketch and the text prompt, and carrying out descending order rearrangement on the initial sketch based on semantic alignment degree score, so as to screen out a preset number of candidate images; parallel refining processing, namely performing parallel refining processing on the screened candidate images, inputting the candidate images into two independent image-to-image pipelines, and outputting optimized candidate images; And (3) double-guide fusion, namely extracting the structure control diagram of the optimized candidate image, inputting the structure control diagram into a target diffusion model of the double control Net, and fusing the structure control diagram and an original text prompt in a parallel mode to generate a high-fidelity image.
2. The method of claim 1, wherein the set of nodes in the iso-graph includes hint nodes and expert nodes, wherein hint node features are initialized using the output of the pre-trained CLIP text encoder to extract hinted global semantic information, and expert node features are initialized to learnable embedded vectors for characterizing the capabilities and characteristics of the expert model.
3. The method according to claim 2, wherein the attention network processes the prompt node features and the expert node features, and learns the relation between the semantic intent of the prompt and each expert model by aggregating neighbor node information to obtain an expert node feature vector.
4. A method according to claim 3, characterized in that for converting the expert node feature vector into a quantifiable ranking index, it is input into a shared linear prediction layer, a relevance score is calculated.
5. The method of claim 4, wherein when determining the optimal expert and computing the subset of active experts by correlation score values, computing ratios of adjacent ranked expert scores, if the ratio exceeds a preset decay threshold, indicating a first Correlation of expert model compared with the first Significant cliff drop occurred in each model and screening was stopped, leading to The expert models constitute the subset of active experts.
6. The method of claim 1, wherein when calculating the semantic alignment between the initial sketches and the text prompts, a pre-trained CLIP model is used as a reward function, and for each initial sketch, a cosine similarity between the image embedded features and the text prompt embedded features, i.e., a semantic alignment score, is calculated.
7. The method of claim 1 or 6, wherein the pipeline takes the candidate image as an initial input, strengthens image local details by spreading the denoising steps and improves the overall consistency of the image, and outputs the optimized candidate image.
8. The method of claim 1, wherein the texture control map is processed by a Canny edge detector, the texture control map retaining key geometry and contour information of the image.
9. A multi-expert programming based text-to-image generation system comprising: The intelligent expert routing module is used for modeling the relation between the text prompt and the expert model by constructing the heterogram, applying the graph attention network to perform relation learning, outputting the relevance scores of all the experts through the predictor, and dynamically screening according to the relevance scores Expert models, which constitute an active expert subset; The parallel sketch generation module is used for controlling each expert model in the active expert subset to run in parallel and generating an initial sketch based on text prompt; The preferential rearrangement and selection module is used for calculating the semantic alignment degree between the initial sketch and the text prompt, carrying out descending rearrangement on the initial sketch based on the semantic alignment degree score, and screening out a preset number of candidate images; the parallel refining processing module is used for carrying out parallel refining processing on the screened candidate images, inputting the candidate images into two independent image-to-image pipelines and outputting optimized candidate images; and the dual-guide fusion module is used for extracting the structure control diagram of the optimized candidate image, inputting the structure control diagram into a target diffusion model of the dual-control Net, and fusing the structure control diagram and an original text prompt in a parallel mode to generate a high-fidelity image.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 8 when the computer program is executed.

Description

Text-to-image generation method and system based on multi-expert programming Technical Field The invention relates to the field of artificial intelligence, in particular to a text-to-image generation method and system based on multi-expert programming. Background In recent years, text-to-Image (T2I) generation technology has significantly advanced, mainly thanks to the rapid development of Diffusion Models (Diffusion Models). The advanced T2I model represented by DALL-E, midjourney and Stable Diffusion enables a user to generate high-quality and semantically aligned complex images only by natural language description, greatly simplifies the traditional workflow of image creation, and shows great application potential in various fields of creative, design, content generation and the like. However, the current state-of-the-art (SOTA) T2I systems generally have significant computing resources and memory requirements. In particular, these large-scale models typically require tens of GB of Graphics Processor (GPU) memory and significant reasoning time, which presents significant challenges for their deployment in resource-constrained environments, such as edge devices, mobile platforms, or application scenarios that require large-scale concurrent processing. To address the resource limitations described above, the industry began to develop lightweight T2I models as more efficient alternatives. These lightweight models, while providing faster reasoning speeds and lower memory consumption, tend to be difficult to rival their large scale opponents in terms of detail, consistency, and semantic fidelity of the generated images. There are inherent limitations to the quality of the generation and overall expressive power of a single miniature model. Furthermore, there is an inherent instability in the image output generated by a single T2I model. Even if the same or highly similar text cues are entered, the same model may generate images that differ significantly in content, style, or detail due to randomness in the diffusion process and the sensitivity of the model to input variations. This high variability makes it difficult to achieve consistent and predictable results for a particular creative intent. In summary, the current T2I field has the core challenges that 1, the large-scale model is high in cost and huge in resource consumption, and wide deployment is limited, 2, the lightweight model is insufficient in generation quality and fidelity and difficult to meet professional requirements, and 3, the instability of a single model generation result influences the consistency and reliability of creation. Therefore, there is an urgent need in the art for a new technical paradigm that breaks the long-term trade-off between generation quality and computational efficiency, and achieves high fidelity, high consistency, and resource efficient image synthesis by intelligently orchestrating multiple lightweight experts, rather than by continuously expanding the scale of a single model. Disclosure of Invention Aiming at the technical problems, the invention provides a text-to-image generation method and system based on multi-expert programming. The invention aims to cooperatively synthesize the image with high fidelity and high semantic consistency by utilizing a plurality of lightweight generation models in a resource efficient mode, effectively solves the contradiction between the generation quality and the calculation cost, and adopts a cascading four-stage process to realize the effective coordination and utilization of a heterogeneous lightweight T2I model pool by using the principle of rough screening, fine screening and structure fusion. In a first aspect, the present invention provides a text-to-image generation method based on multi-expert programming, comprising the steps of: Intelligent expert routing, namely modeling the relation between text prompt and expert model by constructing different composition, applying graph attention network to perform relation learning, outputting relevance scores of all the experts through a predictor, and dynamically screening according to the relevance scores Expert models, which constitute an active expert subset; Parallel sketch generation, namely controlling each expert model in the active expert subset to run in parallel, and generating an initial sketch based on text prompt, ‌ Calculating semantic alignment degree between the initial sketch and the text prompt, and carrying out descending order rearrangement on the initial sketch based on semantic alignment degree score, so as to screen out a preset number of candidate images; parallel refining processing, namely performing parallel refining processing on the screened candidate images, inputting the candidate images into two independent image-to-image pipelines, and outputting optimized candidate images; And (3) double-guide fusion, namely extracting the structure control diagram of the optimized candidate image, inputting th