CN-121982146-A - Fine-grained image editing method based on multi-modal thinking chain reasoning

CN121982146ACN 121982146 ACN121982146 ACN 121982146ACN-121982146-A

Abstract

A fine-granularity image editing method based on multi-modal thinking chain reasoning belongs to the technical field of computer vision and image processing. The invention aims at solving the problem that the existing image editing method can not simultaneously meet the requirements of controllability and fine editing under a complex editing scene. Firstly, a multi-mode generation-understanding unified model is utilized, text chain thinking reasoning is generated according to editing instructions and input images so as to determine target objects pointed by users, and pixel-level visual positioning images corresponding to the targets are generated on the basis. And secondly, the model generates editing description and semantic reasoning of editing results according to the multi-mode positioning clues, and performs local area editing to generate an accurate edited image. In the training process, positioning enhancement is realized through a multi-mode thinking chain alignment mechanism and auxiliary mask supervision, so that semantic consistency and positioning accuracy between an inference chain and an actual editing area are ensured. The invention realizes the editing capability with strong interpretation, accurate space alignment and interaction.

Inventors

Wan Yecong
WU HAO
SHAO MINGWEN
ZHANG HONGZHI
ZUO WANGMENG

Assignees

哈尔滨工业大学

Dates

Publication Date: 20260505
Application Date: 20260120

Claims (10)

1. A fine-grained image editing method based on multi-modal thinking chain reasoning is characterized by comprising the following steps of, The method comprises the steps of obtaining a multi-entity complex scene image, carrying out multi-target detection and segmentation, screening editing targets needing fine space reasoning, generating a visual positioning thinking chain corresponding to the editing targets, determining the editing targets in the multi-entity complex scene image based on the visual positioning thinking chain, and generating an editing instruction and a text thinking chain of the editing targets; constructing a multi-mode interleaving reasoning model consisting of an understanding module and a generating module, taking a multi-entity complex scene image with a configuration modification instruction as a training sample, taking a visual positioning thinking chain and a text thinking chain as intermediate reasoning, taking the modified complex scene image as a final editing target, and combining a mask supervision mechanism and a multi-mode thinking chain alignment mechanism to adjust model parameters so as to train the multi-mode interleaving reasoning model; After training is completed, the multimode interweaving reasoning model carries out fine-granularity image editing on the input image to be processed according to the input modification instruction, and the edited image is output.
2. The fine-grained image editing method based on multi-modal mental chain reasoning according to claim 1, characterized in that, The multi-entity complex scene image is subjected to multi-target identification and detection segmentation through a RAM model and a Grounded-SAM model, and then a Qwen2.5-VL-7B module is adopted for screening to obtain an editing target needing fine space reasoning; the visual localization thought chain includes a target frame and a translucent mask.
3. The fine-grained image editing method based on multi-modal mental chain reasoning according to claim 2, characterized in that, The text thinking chain of the editing target is a text thinking chain with a four-segment structure; Generating an editing instruction and a text thinking chain based on the target frame by using a Qwen2.5-VL-72B module, and then carrying out instruction rewriting and language diversity enhancement on the editing instruction by using a Qwen module; the text thought chain includes scene description, positioning reasoning, editing intent, and result interpretation.
4. The fine-grained image editing method based on multi-modal mental chain reasoning as claimed in claim 3, wherein, The method for obtaining the modified complex scene image comprises the following steps: The method comprises the steps of modifying an editing target by adopting a Bagel module according to an editing instruction, wherein the modification comprises editing processing and removing, seamlessly fusing the modified editing target subjected to the editing processing with a multi-entity complex scene image to obtain a modified complex scene image, and carrying out consistency complementation on the multi-entity complex scene by using a LaMa model and matching with an editing target mask by using the removed modified editing target to obtain the modified complex scene image; and carrying out semantic and visual consistency verification on the modified complex scene image by utilizing a Qwen2.5-VL-7B module, and reserving the modified complex scene image meeting a verification threshold.
5. The fine-grained image editing method based on multi-modal mental chain reasoning as claimed in claim 4, wherein, In the process of training the multi-mode interweaving reasoning model, the understanding module carries out text reasoning analysis on the modification instruction to obtain a text thinking chain of the editing target, and the generating module obtains a visual positioning thinking chain of the editing target according to the text thinking chain and processes the editing target to obtain an edited image.
6. The fine-grained image editing method based on multi-modal mental chain reasoning as claimed in claim 5, wherein, The mask supervision mechanism comprises: carrying out text projection on the text thinking chain output by the understanding module by adopting a text projector to obtain a text projection result; The visual projector is adopted to respectively carry out visual projection on the visual positioning thinking chain and the edited image output by the generating module, so as to obtain a visual positioning visual projection result and an edited image visual projection result; encoding the multi-entity complex scene image by adopting an image encoder to obtain encoded image characteristics; The mask encoder predicts the text prediction mask according to the coded image features and the text projection result, predicts the positioning prediction mask according to the coded image features and the visual positioning visual projection result, predicts the editing prediction mask according to the coded image features and the editing image visual projection result, and calculates the text prediction mask according to the coded image features and the editing visual projection result; And calculating mask supervision loss functions according to comparison results of the text prediction mask, the positioning prediction mask and the editing prediction mask with the real mask respectively, and adjusting model parameters according to the loss functions.
7. The fine-grained image editing method based on multi-modal mental chain reasoning as claimed in claim 6, wherein, Representing text projection results as Visual positioning visual projection results are expressed as The visual projection result of the edited image is expressed as And (3) the following steps: , , , In the middle of For the text prediction mask, The representation of the mask encoder is provided with, In order to code the features of the image, In order to locate the prediction mask, To edit the prediction mask; Mask supervision loss function The method comprises the following steps: , In the middle of Is the cross entropy of two values, Is a true mask.
8. The fine-grained image editing method based on multi-modal mental chain reasoning according to claim 7, characterized in that, The multimodal thinking chain alignment mechanism includes a text thinking chain alignment mechanism: all the minimum semantic unit features corresponding to the editing targets in the text thinking chain are averaged to obtain a semantic feature average value Averaging all the noise adding features of the editing target to obtain a noise feature average value ; Mean value of noise characteristics Through the full connection layer Projecting, calculating the mean value of the projected features and semantic features Cosine similarity of (2) to obtain text thinking chain alignment loss : 。
9. The fine-grained image editing method based on multi-modal mental chain reasoning according to claim 8, characterized in that, The multimodal mental chain alignment mechanism further includes a visual localization mental chain alignment mechanism: Representing visual characteristic units of editing targets corresponding to visual positioning thinking chains as ; Generating a noisy VAE editing feature and performing interpolation adjustment to obtain a noise feature And to make the noise characteristic And visual characteristic unit Having the same spatial resolution; For noise characteristics Through the full connection layer Projection is carried out, and noise characteristics and visual characteristic units after projection are calculated According to the cosine similarity of the model parameters, and adjusting the model parameters to make the noise characteristic And visual characteristic unit Aligned in spatial position.
10. The fine-grained image editing method based on multi-modal mental chain reasoning as claimed in claim 9, wherein, Loss of alignment of visual positioning thinking chain The method comprises the following steps: 。

Description

Fine-grained image editing method based on multi-modal thinking chain reasoning Technical Field The invention relates to a fine-granularity image editing method based on multi-modal thinking chain reasoning, belonging to the technical field of computer vision and image processing. Background With the development of image generation and processing technology, image editing based on instructions is becoming an important interaction mode. Existing image editing techniques can be largely divided into three categories: The first is an editing method based on generating an antagonism network (GAN) inversion. The method generally inverts an image to be edited into a latent space of a pre-training GAN model, and editing operation corresponding to a user instruction is realized by adjusting latent variables. The method has the characteristic of direct realization process, but the potential space expression capacity is limited, and the method is difficult to fully express for complex or fine-grained editing instructions. The second category is editing methods based on diffusion model inversion. The method realizes the modification and reconstruction of the image by reconstructing the inversion track of the input image in the diffusion model and introducing a text instruction in the inversion or inverse diffusion process to regulate and control the generation process. The diffusion model has stronger generating capacity and expression capacity, but the calculated amount of the inversion process is larger, and the phenomena of position deviation, inconsistent semantics and the like are easy to occur when the scene with higher requirement of local editing or space accuracy is processed. The third class of methods relies on joint training of large-scale teletext datasets and large diffusion models. The method carries out model training on massive marked or synthesized editing samples, so that the model can directly execute target editing operation on the input image according to the natural language instruction of the user under the condition of no need of additional inversion. The method has strong instruction understanding capability and good universality, but the performance of the method generally depends on the coverage range of training data, and the positioning accuracy of the editing task lacking clear space indication can be limited. In summary, the existing image editing method based on instructions still has limitations of different degrees in terms of expression capability, calculation cost, space positioning precision, editing consistency and the like, and is difficult to simultaneously meet the requirements of controllability and fine editing in various complex editing scenes: 1) Instruction disambiguation difficulties when the user instruction contains a description of multiple entities or compound references (e.g. "clothing of the person facing left"), it is difficult for existing text or visual models to perform accurate instruction resolution and spatial localization; 2) The reasoning and executing are disjointed, namely, a few works adopting chain thinking can give an interpretable reasoning process at a text layer, but the reasoning process is always inconsistent with an actual visual positioning and editing result; 3) The existing visual-language alignment means are mostly disposable alignment (text-to-region or vision-to-text), the interleaving type and multi-level positioning guidance are lacked, and the point-to-point and pixel level accurate editing is difficult to realize; 4) The data set coverage is limited, the existing editing data set has multiple-aspect significant objects or single significant areas, and the interleaving positioning editing samples which pay attention to the designation and the space reasoning are lacked in large scale, so that the generalization capability of the model under the complex situation is limited. Disclosure of Invention Aiming at the problem that the existing image editing method can not simultaneously meet the requirements of controllability and refinement editing under a complex editing scene, the invention provides a fine-granularity image editing method based on multi-mode thinking chain reasoning. The invention relates to a fine-grained image editing method based on multi-modal thinking chain reasoning, which comprises the following steps of, The method comprises the steps of obtaining a multi-entity complex scene image, carrying out multi-target detection and segmentation, screening editing targets needing fine space reasoning, generating a visual positioning thinking chain corresponding to the editing targets, determining the editing targets in the multi-entity complex scene image based on the visual positioning thinking chain, and generating an editing instruction and a text thinking chain of the editing targets; constructing a multi-mode interleaving reasoning model consisting of an understanding module and a generating module, taking a multi-entity c