CN-121616490-B - Training method and device for image processing model, electronic equipment and storage medium

CN121616490BCN 121616490 BCN121616490 BCN 121616490BCN-121616490-B

Abstract

The embodiment of the application discloses a training method and device of an image processing model, electronic equipment and a storage medium, and belongs to the technical field of computer processing. The method comprises the steps of marking at least partial areas of a label image in a generated spliced image to obtain a target area, obtaining label noise of the target area, conducting noise adding on the target area to obtain a noise image, obtaining noise image characteristics of the noise image and text characteristics extracted from characteristics in an instruction text, inputting the text characteristics and the noise image characteristics into an image processing model, outputting a predicted image, obtaining image noise corresponding to the target area in the predicted image, calculating noise difference between the image noise and the label noise, conducting iterative training on the image processing model according to the noise difference, and obtaining a trained image processing model. The method and the device can improve the universality of the model obtained by training, and further improve the accuracy of the task processing result output when the model is used for carrying out actual task processing.

Inventors

WANG YAOWEI
BAO BINGKUN
TAO MING
GAO FENG
YIN BING
YIN BAOCAI

Assignees

鹏城实验室

Dates

Publication Date: 20260505
Application Date: 20260130

Claims (10)

1. A method of training an image processing model, comprising: for each different image processing task, acquiring a source image corresponding to each image processing task, a label image corresponding to the source image and an instruction text corresponding to the source image; Splicing the tag image and the source image to obtain a spliced image, marking at least part of the tag image in the spliced image to obtain a target area, and obtaining tag noise corresponding to the target area; adding noise to the target area to obtain a noise image corresponding to the spliced image; acquiring noise image characteristics of the noise image and text characteristics extracted from the instruction text; Inputting the text features and the noise image features into an image processing model, outputting a predicted image, acquiring image noise corresponding to the target region in the predicted image, calculating noise difference between the image noise and the label noise, and performing iterative training on the image processing model according to the noise difference to obtain a trained image processing model, wherein the trained image processing model is used for processing a target source image according to a target instruction text to obtain a target generated image.
2. The method for training an image processing model according to claim 1, wherein the marking at least a part of the area of the label image in the stitched image to obtain a target area includes: If the instruction text representation needs to carry out image processing on all the areas of the source image, marking all the areas of the label image in the spliced image to obtain a target area; otherwise, roughly screening the area expected to be processed in the label image in the spliced image according to the instruction text to obtain an initial area; determining an expected semantic type label to which an area expected to be processed belongs according to the instruction text, and performing image semantic analysis on the source image to obtain a reference semantic type label corresponding to the source image; And fine screening is carried out on the initial region according to the expected semantic type label and the reference semantic type label to obtain an updated region, and marking is carried out on the updated region in the spliced image to obtain a target region.
3. The method for training an image processing model according to claim 1, wherein the step of adding noise to the target area to obtain a noise image corresponding to the stitched image includes: Performing downsampling processing on the spliced image to obtain a downsampled spliced image; Acquiring a preset time step, and determining a noise weight matrix according to the preset time step; And adding noise to the target area in the downsampled spliced image according to the noise weight matrix to obtain a noise image corresponding to the spliced image.
4. A training method of an image processing model according to claim 3, wherein the acquiring a preset time step, and determining a noise weight matrix according to the preset time step, comprises: when the preset time step is smaller than a preset time step threshold value, determining a noise weight matrix according to the preset time step, wherein each noise weight value in the noise weight matrix is the same and larger than a preset critical noise weight value; And when the preset time step is equal to or greater than a preset time step threshold, determining a noise weight matrix according to the preset time step, wherein each noise weight value in the noise weight matrix is in radial attenuation distribution, and each noise weight value does not exceed the critical noise weight value.
5. The method for training an image processing model according to claim 1, wherein the image processing model comprises a pre-constructed initial low-rank matrix; Performing iterative training on the image processing model according to the noise difference to obtain a trained image processing model, including: updating the initial low-rank matrix according to the noise difference to obtain an updated low-rank matrix; and taking the updated low-rank matrix as a new initial low-rank matrix in the image processing model, and returning to execute the step of acquiring a source image corresponding to each image processing task, a label image corresponding to the source image and an instruction text corresponding to the source image until the calculated noise difference between the image noise and the label noise is smaller than a preset noise threshold value, so as to obtain the trained image processing model.
6. The method of training an image processing model according to claim 1, further comprising, after the obtaining the trained image processing model: acquiring an initial source image and a target instruction text corresponding to the initial source image; initializing to obtain an image to be processed according to the initial source image, and marking a region to be processed of the image to be processed according to the initial source image and the target instruction text; Acquiring reference image features of the initial source image, target text features extracted from features in the target instruction text and features to be processed of the image to be processed; And inputting the target text features, the features to be processed and the reference image features into a trained image processing model, and determining a target predicted image after predicting a region to be processed of the image to be processed according to an output result.
7. The method for training an image processing model according to claim 6, wherein the inputting the target text feature, the feature to be processed, and the reference image feature into the trained image processing model, determining a target predicted image after predicting a region to be processed of the image to be processed according to an output result, comprises: Inputting the target text features, the features to be processed and the reference image features into a trained image processing model, and outputting an initial predicted image obtained after predicting a region to be processed of the image to be processed; if the area to be processed is not all the areas of the image to be processed, determining a foreground predicted image corresponding to the area to be processed from the initial predicted image; acquiring a target weight matrix corresponding to the foreground predicted image, and updating the foreground predicted image according to the target weight matrix to obtain an updated foreground predicted image, wherein each pixel weight value in the target weight matrix is in radial attenuation distribution; and updating the initial predicted image according to the updated foreground predicted image to obtain a target predicted image.
8. A training device for an image processing model, comprising: the acquisition module is used for acquiring a source image corresponding to each image processing task, a label image corresponding to the source image and an instruction text corresponding to the source image according to each different image processing task; The marking module is used for splicing the tag image and the source image to obtain a spliced image, marking at least part of the area of the tag image in the spliced image to obtain a target area, and obtaining tag noise corresponding to the target area; the noise adding module is used for adding noise to the target area to obtain a noise image corresponding to the spliced image; the feature extraction module is used for acquiring noise image features of the noise image and text features extracted from the instruction text; The prediction module is used for inputting the text features and the noise image features into an image processing model, outputting a predicted image, acquiring image noise corresponding to the target region in the predicted image, calculating noise difference between the image noise and the label noise, performing iterative training on the image processing model according to the noise difference to obtain a trained image processing model, and processing a target source image according to a target instruction text by the trained image processing model to obtain a target generated image.
9. An electronic device comprising a memory storing a computer program and a processor implementing the training method of the image processing model of any of claims 1 to 7 when the computer program is executed by the processor.
10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the training method of the image processing model of any one of claims 1 to 7.

Description

Training method and device for image processing model, electronic equipment and storage medium Technical Field The present application relates to the field of computer processing technologies, and in particular, to a training method and apparatus for an image processing model, an electronic device, and a storage medium. Background In recent years, a large-scale pre-training generation model has made remarkable progress in the field of image synthesis, wherein a text-to-image diffusion model becomes a research hotspot by virtue of excellent generation quality and controllability. In the related art, a pre-training model with the processing capacity for corresponding tasks is obtained by setting different task branch networks in a diffusion model and performing targeted training. However, as the types of the downstream tasks for image processing are rich, such as downstream tasks of image restoration, semantic editing and the like, task branch networks arranged in the diffusion model are limited, so that a pre-training model obtained by training the diffusion model is difficult to cover all downstream tasks of task scenes, the universality of the pre-training model is poor, and the output task processing result is inaccurate when the pre-training model is used for actual task processing. Disclosure of Invention The embodiment of the application provides a training method, a training device, electronic equipment and a storage medium for an image processing model, which can improve the universality of the model obtained by training and further improve the accuracy of a task processing result output when the model is used for carrying out actual task processing. In order to achieve the above object, an aspect of an embodiment of the present application provides a training method for an image processing model, including: Aiming at each different image processing task, acquiring a source image corresponding to each image processing task, a label image corresponding to the source image and an instruction text corresponding to the source image; Splicing the tag image and the source image to obtain a spliced image, marking at least part of the tag image in the spliced image to obtain a target area, and obtaining tag noise corresponding to the target area; adding noise to the target area to obtain a noise image corresponding to the spliced image; acquiring noise image characteristics of a noise image and instructing text characteristics extracted from the characteristics in a text; Inputting text features and noise image features into an image processing model, outputting a predicted image, acquiring image noise corresponding to a target area in the predicted image, calculating noise difference between the image noise and label noise, performing iterative training on the image processing model according to the noise difference to obtain a trained image processing model, and processing a target source image according to a target instruction text by the trained image processing model to obtain a target generated image. In some embodiments, marking at least a partial region of a label image in a stitched image to obtain a target region includes: If the instruction text representation needs to carry out image processing on all areas of the source image, marking all areas of the label image in the spliced image to obtain a target area; otherwise, roughly screening the area expected to be processed in the label image in the spliced image according to the instruction text to obtain an initial area; determining an expected semantic type label to which an area expected to be processed belongs according to the instruction text, and performing image semantic analysis on a source image to obtain a reference semantic type label corresponding to the source image; And carrying out fine screening on the initial region according to the expected semantic type label and the reference semantic type label to obtain an updated region, and marking the updated region in the spliced image to obtain a target region. In some embodiments, the step of denoising the target area to obtain a noise image corresponding to the stitched image includes: downsampling the spliced image to obtain a downsampled spliced image; acquiring a preset time step, and determining a noise weight matrix according to the preset time step; And adding noise to the target area in the downsampled spliced image according to the noise weight matrix to obtain a noise image corresponding to the spliced image. In some embodiments, obtaining a preset time step, determining a noise weight matrix according to the preset time step includes: When the preset time step is smaller than the preset time step threshold value, determining a noise weight matrix according to the preset time step, wherein each noise weight value in the noise weight matrix is the same and larger than a preset critical noise weight value; when the preset time step is equal to or greater than the preset time step