CN-122023192-A - Method and system for restoring distortion of text-generated image based on intelligent agent

CN122023192ACN 122023192 ACN122023192 ACN 122023192ACN-122023192-A

Abstract

The application provides a distortion restoration method and a system for a text-generated image based on an agent, wherein the method comprises the steps of constructing and training a distortion region perception positioning agent, a distortion region annotation description agent and a distortion region restoration agent; the method comprises the following steps of iteratively executing the following operations of detecting the distortion area of the text-generated image by using the distortion area sensing and positioning agent, if the distortion area is detected, reasoning out the distortion type and the distortion description by using the distortion area labeling description agent based on the distortion area, executing distortion restoration by using the distortion area restoration agent and updating the text-generated image based on the distortion area, the distortion type and the distortion description, and until the distortion area is no longer detected by the distortion area sensing and positioning agent. The application adopts the intelligent body to form a 'perception-reasoning-action' closed loop, can effectively solve the distortion problem of various artificial intelligence generated images, and has excellent generalization and high efficiency.

Inventors

HU QIANG
SHEN SHAOCHENG
ZHANG XIAOYUN
CHEN ZHIYONG

Assignees

上海交通大学

Dates

Publication Date: 20260512
Application Date: 20260105

Claims (10)

1. The method for repairing the distortion of the text-generated image based on the intelligent agent is characterized by comprising the following steps of: Constructing and training a saliency map prediction model, and constructing a distortion region perception positioning intelligent agent based on the trained saliency map prediction model; constructing and training a distortion region annotation description intelligent agent, and constructing a distortion region restoration intelligent agent; the following operations are iteratively performed: Detecting a distortion region of the text-generated image by using the distortion region sensing and positioning agent: if the distortion area is detected, the distortion type and the distortion description of the text-to-image are deduced by utilizing the distortion area annotation description agent based on the text-to-image and the corresponding distortion area thereof, and the distortion restoration is executed by utilizing the distortion area restoration agent and the text-to-image is updated based on the distortion area, the distortion type and the distortion description; and taking the current text-generated image as a final text-generated image distortion restoration result until the distortion area is no longer detected by the distortion area sensing and positioning agent.
2. The method for restoring distortion to an agent-based text-to-image as set forth in claim 1, wherein the constructing a distortion region-aware localization agent based on constructing and training a saliency map prediction model and on the basis of the trained saliency map prediction model comprises: The method comprises the steps of constructing a saliency map prediction model, wherein the saliency map prediction model comprises a double-encoder structure, an attention module and an output module, the double-encoder structure comprises a visual encoder and a text encoder which are arranged in parallel, the visual encoder is used for receiving an input text map image, carrying out visual feature extraction on the text map image and outputting image features, the text encoder is used for receiving an input prompt word corresponding to the text map image, carrying out semantic feature extraction on the input prompt word and outputting text features, the attention module is used for carrying out cross-modal fusion processing on the input visual features and the text features through a self-attention mechanism and generating cross-modal features which simultaneously bear visual structure information and text semantic information, and the output module is used for outputting a distortion saliency map according to the cross-modal features; Training the saliency map prediction model to obtain a trained saliency map prediction model; Constructing a mask model, and connecting the output end of a trained significant map prediction model with the input end of the mask model to form a distortion region perception positioning intelligent body, wherein the mask model is used for sequentially carrying out thresholding and morphological expansion operations on an input significant map to obtain a distortion region mask of a text-generated map image, and marking a region marked as distortion in the distortion region mask as a distortion region of the text-generated map image; and if the distortion area marked as distortion does not appear in the distortion area mask, indicating that the distortion area is not detected by the distortion area sensing and positioning agent.
3. The method for restoring distortion in an agent-based text-to-image as set forth in claim 2, wherein, In the training process of the saliency map prediction model, a loss function adopts a mixed loss function, and is defined as: Wherein, the Is loss; the balance parameters are preset; in order to be a mean square error loss, For KL divergence loss, S is a distortion saliency map predicted by the model; Is a real saliency map marked manually.
4. The method for restoring distortion to an agent-based text-to-image as set forth in claim 1, wherein said constructing and training distortion region labels describes the agent, comprising: Selecting a pre-trained multi-modal visual language big model as a basic visual model; Performing supervision and fine tuning training on the basic visual model by using the artificially marked distortion region data set to obtain a multi-mode reasoning model; And introducing an enhanced signal by adopting a group relative strategy optimization algorithm to perform enhanced learning training on the multi-mode reasoning model to obtain a distortion region annotation description intelligent agent, wherein the enhanced signal is specifically a quantized signal constructed based on the perceived preference of human on distortion diagnosis.
5. The method for restoring distortion to an agent-based text-to-image as set forth in claim 4, wherein the optimization objective function of the population relative strategy optimization algorithm is: Wherein, the For the purpose of GRPO losses, Taking expectations for input question and model answer sample pairs, wherein min () represents taking a minimum function; the probability ratio of the previous strategy to the old strategy on the same sample; is a normalized dominance function; Representing an interval restriction function; Is clip super parameter; For the KL divergence penalty factor, As the KL divergence term, For a strategic model being trained, Is a reference policy model.
6. The method for restoring distortion of an agent-based text-to-image as set forth in claim 4, wherein said inferring a distortion type and a distortion description of the text-to-image by using said distortion region labeling description agent comprises: Inputting a text-to-life image and a distortion region label description agent corresponding to the text-to-life image, performing inference diagnosis on the distortion region by the distortion region label description agent, receiving a structured interactive question text in real time in the inference diagnosis operation process of the distortion region label description agent, wherein the interactive question text is used for inputting diagnosis guide information to the distortion region label description agent, triggering the distortion region label description agent to output a structured answer text conforming to human perception preference of the distortion diagnosis, and performing programmed analysis on the structured answer text to obtain a distortion type and a distortion description, wherein the distortion type comprises hand distortion, interaction distortion, facial distortion and article redundancy, and the distortion description is descriptive text of distortion content in the text-to-life image.
7. The method for distortion restoration of an agent-based text-to-image as set forth in claim 1, wherein said performing distortion restoration and updating the text-to-image using said distortion zone restoration agent based on said distortion zone, said distortion type, and said distortion description comprises performing the following operations using said distortion zone restoration agent: Determining an editing space range of a target repair area based on the distortion area; selecting an adapted image restoration model from a restoration tool library based on the user preference; generating a patch instruction sequence based on the distortion description and the distortion type; and calling the selected image restoration model, executing distortion restoration operation on the to-be-restored text-to-be-restored image in the editing space range according to the restoration instruction sequence, and taking the restored text-to-be-restored image as an updated text-to-be-restored image.
8. A system for distortion repair of an agent-based text-to-image, comprising: The positioning module is used for constructing and training a saliency map prediction model and constructing a distortion region perception positioning intelligent agent based on the trained saliency map prediction model; The model construction module is used for constructing and training the distortion region annotation description intelligent agent and constructing a distortion region restoration intelligent agent; The iteration module is used for iteratively executing the following operations: Detecting a distortion region of the text-generated image by using the distortion region sensing and positioning agent: If the distortion area is detected, the distortion type and the distortion description of the text-to-image are generated by using the distortion area annotation description agent based on the text-to-image and the corresponding distortion area, and the distortion restoration is performed by using the distortion area restoration agent based on the distortion area, the distortion type and the distortion description and the text-to-image is updated; and taking the current text-generated image as a final text-generated image distortion restoration result until the distortion area is no longer detected by the distortion area sensing and positioning agent.
9. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor realizes the steps of the method according to any of claims 1-7.
10. An electronic device, comprising: at least one memory for storing program instructions; At least one processor for invoking program instructions stored in said memory and for performing the steps of the method according to any of the preceding claims 1-7 according to the obtained program instructions.

Description

Method and system for restoring distortion of text-generated image based on intelligent agent Technical Field The application relates to the field of image restoration, in particular to a method and a system for restoring distortion of a text-generated image based on an intelligent agent. Background In recent years, a new generation of text-generated image (text-generated image) diffusion model represented by SDXL, FLUX, imagen and the like has made a great breakthrough in terms of network scale, training data quality, multi-modal alignment capability and the like, and images with high compactibility and artistic expression can be generated in most scenes. Such models have been widely used in a variety of industrial fields such as digital art creation, visual design, advertisement production, video previewing, game content generation, and even auxiliary medical image synthesis and industrial simulation, and are becoming an important infrastructure for content production. Nevertheless, even the most advanced large-scale diffusion models currently exist, with significant shortcomings in the actual generation process, especially with the problem of semantic structural distortion of local detail. Specifically, the existing model has typical problems of human anatomy structure errors including abnormal number of fingers, fusion or fracture of fingers, unreasonable proportion of limbs, joint direction errors of elbows and knees and the like, facial detail damages such as asymmetric eye shapes, broken lip structures, improper arrangement of five sense organs, fuzzy or overflow of facial textures and the like, low character generation quality, unreadable characters, disordered strokes, inconsistent languages and the like of the generated characters, abnormal physical interaction relations such as suspended hands, fused contact surfaces, non-conforming physical logic of grasping postures and the like between people and objects and between people and animals. Recent studies have mainly addressed the above problems from three directions, prompt word enhancement, reinforcement learning-based model optimization, and fine-grained noise spatial alignment. Although these methods can improve the overall image fidelity, they lack explicit spatial reasoning capabilities and cannot interpret or correct locally failed regions. Post-generation editing (post-hoc editing) methods such as Imagic, bagel, step x-Edit can implement local image restoration, but they rely on artificially drawn distorted region masks or heuristic text indications, and thus cannot autonomously identify regions that need restoration. Full graph regeneration can lead to higher repair costs and potential style shift issues. Through the document retrieval discovery of the prior art, the Chinese patent with the publication number of CN120807681A provides a graph generation method, device, equipment and storage medium based on a graph generation inference model, adopts a visual language model to receive a graph generation request of a user, performs reasoning and analysis, acquires a retrieval instruction, plans a graph generation step and finally generates a picture. Visual Language Models (VLMs) are considered as potential automated picture distortion restoration methods due to their semantic reasoning capabilities. However, even the most advanced VLM model has difficulty in stably locating the distortion region. For a given query, its answer is still often contradictory or misjudged, even if the clearly abnormal region is misjudged as normal. The two key reasons for this are that firstly, the training target of the VLM is biased to high-level semantic alignment, but not pixel-level verification, so that the spatial accuracy is poor, fine granularity artifacts are easy to miss, and secondly, the strong knowledge priori inside the training target often covers actual visual evidence, so that "phantom judgment" is caused. Therefore, there is a need for a self-correcting image distortion repair method and system that can autonomously discover local distortions, locate accurate regions, give a diagnosis and description that is close to human aesthetic preferences, and perform a controllable repair operation. Disclosure of Invention Aiming at the defects in the prior art, the application aims to provide a method and a system for restoring distortion of a text-generated image based on an intelligent agent. According to a first aspect of the present application, there is provided a distortion repair method for an agent-based text-to-image, comprising: Constructing and training a saliency map prediction model, and constructing a distortion region perception positioning intelligent agent based on the trained saliency map prediction model; constructing and training a distortion region annotation description intelligent agent, and constructing a distortion region restoration intelligent agent; the following operations are iteratively performed: Detecting a distortion reg