CN-121982152-A - Instruction type image editing method and system based on cognitive reasoning

CN121982152ACN 121982152 ACN121982152 ACN 121982152ACN-121982152-A

Abstract

An instruction type image editing method and system based on cognitive reasoning relate to the field of instruction type image editing in computer vision. The problems of poor understanding accuracy of the existing image editing system on natural language instructions and poor local controllability of editing operation are solved. Dual-phase cognitive architecture including a Localization Cognitive Process (LCP) and a Modification Cognitive Process (MCP) was designed by introducing a combined disguise system hint The robustness of the model in the case of ambiguity of the instruction is improved, and the accumulation of errors in the multi-round editing process is effectively reduced. Through the staged positioning and modification cognition process, in the iterative loop of multi-round planning, generation and dislike, the accurate positioning and controllable generation from semantic understanding to pixel-level operation are systematically realized, and finally an editing result which achieves the best in terms of semantic fidelity, visual harmony and editing quality is output. The invention is mainly used for realizing accurate modification of the image by using natural language.

Inventors

Ni Minheng
ZHANG YAOWEN
FAN YUTAO
YAN ZIFEI
ZUO WANGMENG

Assignees

哈尔滨工业大学

Dates

Publication Date: 20260505
Application Date: 20260130

Claims (10)

1. The instruction type image editing method based on cognitive reasoning is characterized by comprising the following steps of: In the first stage, a cognitive process is located: in-location system cues Based on editing instructions under the constraints of (a) Semantic information of (a) and input image Editing and planning are carried out, and a positioning prompt set formed by a plurality of positioning prompts is generated In combination with a dislike system prompt Screening out optimal positioning prompts ; According to the optimal positioning prompt And input image Generating a candidate positioning mask set composed of a plurality of candidate binary positioning masks In combination with a dislike system prompt Screening out an optimal positioning mask ; Second phase, modify cognitive process: at modified system prompt Optimal positioning mask Based on editing instructions under the constraints of (a) Semantic information pair of (2) input image Performing modification planning to generate a modification prompt set composed of a plurality of modification prompts In combination with a dislike system prompt Screening out optimal modification prompts ; Optimal modification cues Optimal positioning mask Under the constraint of (a), generating a candidate edited image set composed of a plurality of edited images In combination with a dislike system prompt Screening out the image after the optimal editing 。
2. The cognitive reasoning-based instructional image editing method of claim 1, wherein the system prompts Based on editing instructions under the constraints of (a) Semantic information of (a) and input image Editing and planning are carried out, and a positioning prompt set formed by a plurality of positioning prompts is generated The implementation mode of the method is as follows: in-location system cues Under the constraint of (1), the large multi-modal model is based on editing instructions Semantic information of (a) and input image Cross-modal alignment to determine and edit the instruction Related target area and text semantic description of the target area, thereby generating a positioning prompt Repeating the above process to obtain multiple positioning prompts to form a positioning prompt set 。
3. The cognitive reasoning-based instructional image editing method according to claim 1, wherein optimal localization cues are screened out The implementation mode of the method is as follows: prompt in dislike system Under the constraint of (1) large multi-modal model pair positioning prompt set Each positioning prompt in (a) And by inputting images Editing instructions Scoring the matching degree among the formed multi-mode contexts, correspondingly obtaining a score, repeating the scoring process of the matching degree for a plurality of times to obtain a plurality of scores, and selecting the positioning prompt with the highest score as the optimal positioning prompt 。
4. The cognitive reasoning-based instructional image editing method according to claim 1, characterized in that according to the optimal localization cues And input image The implementation manner of generating the plurality of candidate binary positioning masks is as follows: Instruction segmentation model pair input image And optimal positioning cues Semantic matching is performed, and an initial binary location mask is output And morphological dilation is applied to the binary location mask to obtain corresponding candidate binary location mask Repeating the above process to obtain candidate binary location masks for multiple times to obtain a candidate location mask set composed of multiple candidate binary location masks 。
5. The cognitive reasoning-based instructed image editing method of claim 1, wherein an optimal localization mask is screened out The implementation mode of the method is as follows: prompt in dislike system Under the constraint of (1) combining large multi-mode model with optimal positioning prompt And input image Candidate positioning mask set from two dimensions of spatial positioning accuracy and semantic relativity The binary location masks in the (a) are subjected to comprehensive scoring, and the binary location mask with the highest score is used as the optimal location mask 。
6. The cognitive reasoning-based instructional image editing method of claim 1, wherein the system prompts are modified in a system of modifications Optimal positioning mask Based on editing instructions under the constraints of (a) Semantic information pair of (2) input image Performing modification planning to generate a modification prompt set composed of a plurality of modification prompts Including the implementation of (a).
7. At modified system prompt Is a hint constraint and optimal location mask of (a) Under the space constraint provided, the large multi-modal model pair edit instruction And input image Cross-modal alignment to determine a modification hint for a target area Repeating the above process to obtain modification prompt set composed of multiple modification prompts . The cognitive reasoning-based instructional image editing method according to claim 1, wherein the optimal modification cues are screened out The implementation mode of the method is as follows: prompt in dislike system Under the constraint of (1) large multi-modal model combined with editing instruction Input image Optimal positioning mask Modifying a hint set from two dimension pairs of semantic consistency and visual context Each modification prompt in (a) Comprehensively scoring and selecting the modification prompt with the highest score as the optimal modification prompt 。
8. The cognitive reasoning-based instructional image editing method according to claim 1, characterized in that in the optimal modification cues Optimal positioning mask Under the constraint of (a), generating a candidate edited image set composed of a plurality of edited images The implementation of (1) comprises: according to the optimal positioning mask Determining an input image At the same time, at the optimal modification prompt Under the constraint of (1), the image restoration model is applied to the input image Conditional generation of target area of (2) to generate an edited image Repeating the above process for generating edited images multiple times to obtain candidate edited image set composed of multiple edited images Wherein, the method comprises the steps of, The noise parameters of the image restoration model are random in each generation process of the edited image.
9. The cognitive reasoning-based instruction-type image editing method as claimed in claim 1, wherein the optimally edited image is screened out The implementation of (1) comprises: prompt in dislike system Under the constraint of (1) large multi-modal model combined with editing instruction And input image For the candidate edited image set, from three aspects of semantic loyalty, visual quality and consistency with unedited areas The edited images in the list are comprehensively scored, and the edited image with the highest score is taken as the optimal edited image 。
10. A cognitive reasoning based instructional image editing system comprising a storage device, a processor, and a computer program stored in the storage device and executable on the processor, wherein execution of the computer program by the processor implements the cognitive reasoning based instructional image editing method according to any of claims 1 to 9.

Description

Instruction type image editing method and system based on cognitive reasoning Technical Field The present invention relates to the field of instruction-based image editing in computer vision. Background Instruction-based image editing aims to achieve accurate modification of images using natural language. This paradigm allows users to express abstract or high-level editing intent in natural language, bridging between human imagination and visual manipulation. In recent years, the rapid development of large multimodal models has driven advances in commanded image editing, however existing approaches still face several long-standing challenges, especially in scenarios requiring high-level semantic reasoning or fine visual consistency. When the editing result does not meet the expectation, it is often difficult to judge whether the problem is caused by instruction understanding errors or insufficient control is generated. On the one hand, most models lack a structured reasoning mechanism, usually mapping user-supplied abstract instructions directly into pixel-level generation or transformation processes of the overall image, and difficult to disassemble complex instructions into clear, executable editing steps or editing plans, thus having limited generalization capability between different levels of abstraction or diversified visual contexts. On the other hand, the existing method generally processes the whole image in an integral mode, lacks an explicit region isolation mechanism, has poor instruction understanding accuracy, inaccurate editing region positioning and insufficient local controllability of editing operation, is easy to cause unnecessary modification on a region irrelevant to instructions, and is especially deadly in images with high sensitivity to details, such as text, precise visual layout and the like. In a continuous multi-round editing scenario, these problems can accumulate, eventually making editing results unacceptable. Disclosure of Invention The invention aims to solve the problems of poor understanding accuracy of an existing instruction type image editing system on natural language instructions, inaccurate positioning of an editing area and insufficient local controllability of editing operation. The instruction type image editing method based on cognitive reasoning comprises the following steps: first stage, localization Cognitive Process (LCP): in-location system cues Based on editing instructions under the constraints of (a)Semantic information of (a) and input imageEditing and planning are carried out, and a positioning prompt set formed by a plurality of positioning prompts is generatedIn combination with a dislike system promptScreening out optimal positioning prompts; According to the optimal positioning promptAnd input imageGenerating a candidate positioning mask set composed of a plurality of candidate binary positioning masksIn combination with a dislike system promptScreening out an optimal positioning mask; Second phase, modify Cognitive Process (MCP): at modified system prompt Optimal positioning maskBased on editing instructions under the constraints of (a)Semantic information pair of (2) input imagePerforming modification planning to generate a modification prompt set composed of a plurality of modification promptsIn combination with a dislike system promptScreening out optimal modification prompts; Optimal modification cuesOptimal positioning maskUnder the constraint of (a), generating a candidate edited image set composed of a plurality of edited imagesIn combination with a dislike system promptScreening out the image after the optimal editing。 Preferably, the system promptsBased on editing instructions under the constraints of (a)Semantic information of (a) and input imageEditing and planning are carried out, and a positioning prompt set formed by a plurality of positioning prompts is generatedThe implementation mode of the method is as follows: in-location system cues Under the constraint of (1), the large multi-modal model is based on editing instructionsSemantic information of (a) and input imageCross-modal alignment to determine and edit the instructionRelated target area and text semantic description of the target area, thereby generating a positioning promptRepeating the above process to obtain multiple positioning prompts to form a positioning prompt set。 Preferably, the optimal positioning cues are selectedThe implementation mode of the method is as follows: prompt in dislike system Under the constraint of (1) large multi-modal model pair positioning prompt setEach positioning prompt in (a)And by inputting imagesEditing instructionsScoring the matching degree among the formed multi-mode contexts, correspondingly obtaining a score, repeating the scoring process of the matching degree for a plurality of times to obtain a plurality of scores, and selecting the positioning prompt with the highest score as the optimal positioning prompt。 Preferably, the prompt i