CN-122023582-A - Image generation method, apparatus, readable storage medium, and program product

CN122023582ACN 122023582 ACN122023582 ACN 122023582ACN-122023582-A

Abstract

The present application relates to an image generation method, apparatus, computer device, computer readable storage medium and computer program product. The method comprises the steps of responding to an image generation request, carrying a plurality of original images and a group photo description text, wherein the plurality of original images comprise a plurality of main bodies, the image generation request is used for requesting to generate target group photos of the plurality of main bodies, generating an initial scene image through a first diffusion model according to the group photo description text, the initial scene image comprises a plurality of areas to be filled, analyzing the initial scene image to determine main body posture features corresponding to the areas to be filled, analyzing the characteristics of the plurality of original images to obtain the attribute features of the main bodies, taking the main body posture features as posture constraint conditions and taking the attribute features as identity constraint conditions, and inputting a second diffusion model to generate the images to obtain the target group photo. By adopting the method, the visual effect of the photo combination can be improved.

Inventors

LIU TING
HE YUHAN
LIANG HUANWEI
QU XIAOCHAO
LIU LUOQI

Assignees

厦门美图之家科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260128

Claims (10)

1. An image generation method, the method comprising: The method comprises the steps of responding to an image generation request, carrying a plurality of original images and a photo combination description text, wherein the plurality of original images comprise a plurality of subjects, the image generation request is used for requesting to generate target photos of the plurality of subjects, and the photo combination description text is used for describing at least one of the gesture, the interaction relation and the background scene of each subject in the target photos; Generating an initial scene image through a first diffusion model according to the synopsis description text, wherein the initial scene image comprises a plurality of areas to be filled; Analyzing the initial scene image to determine main body posture features corresponding to the areas to be filled; performing feature analysis on the plurality of original images to obtain attribute features of each main body; and inputting a second diffusion model to generate an image by taking the main body posture feature as a posture constraint condition and taking a plurality of attribute features as identity constraint conditions to obtain the target conjunctions.
2. The method of claim 1, wherein the first diffusion model comprises a first text encoder, a first denoising network and a second image decoder, wherein the joint description text guides user input through a preset prompt word template, wherein the prompt word template comprises a first field for defining the number of people, a second field for defining a core gesture or interaction relationship and a third field for defining a background environment; Generating an initial scene image through a first diffusion model according to the joint descriptive text, wherein the initial scene image comprises a plurality of areas to be filled and comprises: inputting the combined description text to the text encoder to obtain a text embedded vector; Iteratively executing a multi-step denoising process on a random noise tensor through the first denoising network until a preset termination condition is met to obtain a target noise tensor, wherein each step of denoising process comprises the steps of inputting a current step of noise tensor and the text embedding vector into the first denoising network to obtain a noise prediction value for removing noise from the current noise tensor; and inputting the target noise tensor into the image decoder to obtain the initial scene image.
3. The method of claim 1, wherein the body pose features include contour features and body pose features of the region to be filled, wherein the analyzing the initial scene image to determine the body pose features corresponding to each region to be filled comprises: Performing instance segmentation on the initial scene image to obtain instance segmentation masks respectively corresponding to the areas to be filled; performing key point detection on the initial scene image to obtain gesture skeleton data corresponding to each region to be filled; and for each region to be filled, carrying out association matching on the instance segmentation mask corresponding to the region to be filled and the gesture bone data based on the relative position relation between the instance segmentation mask and the gesture bone data to obtain the main body gesture feature.
4. The method according to claim 3, wherein for each region to be filled, the performing association matching on the instance segmentation mask corresponding to the region to be filled and the pose bone data based on the relative positional relationship between the instance segmentation mask and the pose bone data to obtain the subject pose feature includes: determining a first circumscribed rectangle of each of the example split masks in the initial scene image; determining a second external rectangle formed by all key points in the gesture skeleton data in the initial scene image; Determining the cross-over ratio between each first circumscribed rectangle and each second circumscribed rectangle; For each instance of the segmentation mask, the pose bone data with the largest intersection ratio is determined to match the current instance of the segmentation mask.
5. The method of claim 1, wherein the performing feature analysis on the plurality of original images to obtain attribute features of each of the subjects comprises: Performing main body detection on each original image to obtain a face area image of each main body; The face region image is input into a preset identity encoder to obtain the attribute characteristics of the main body, wherein the identity encoder comprises an image coding network in a personalized image adapter model, training data of the identity encoder comprises face images of the same identity under different postures, illumination and backgrounds, and training targets of the identity encoder comprise reducing the distance between the characteristics of the same identity under different images and expanding the distance between the characteristics of different identities.
6. The method of claim 3, wherein the second diffusion model comprises a base diffusion model, a first control network, a second control network and an identity adapter network, wherein the base diffusion model comprises a second denoising network, wherein the first control network and the second control network are respectively connected to the second denoising network as adapters, wherein the main body posture feature is used as a posture constraint condition, a plurality of attribute features are used as identity constraint conditions, and the step of inputting the second diffusion model for image generation to obtain the target assembly comprises the steps of: inputting the combined description text to a text encoder coupled with the basic diffusion model to obtain text condition characteristics; Inputting the gesture skeleton data corresponding to each main body into the first control network and inputting mask data corresponding to each main body gesture into the second control network to obtain a spatial control feature map; inputting each attribute feature into the identity adapter network to inject each attribute feature as a weight condition into a cross attention layer in the second denoising network through the identity adapter network; and performing conditional denoising on the text conditional feature, the space control feature map and the attribute feature through the second denoising network to obtain the target photo.
7. The method of claim 1, wherein after the inputting a second diffusion model for image generation with the subject pose feature as a pose constraint and a plurality of the attribute features as identity constraints, the method further comprises: Displaying the target group photo through a preset interactive interface; responding to the selection of at least one main body area in the target photo and the adjustment operation of the attribute on the interactive interface; And executing the step of taking the main body posture feature as a posture constraint condition and taking a plurality of attribute features as image filling contents based on the attribute parameters corresponding to the adjustment operation, inputting a second diffusion model for image generation, and generating and displaying updated target shots, wherein the attribute adjustment operation comprises at least one of the steps of replacing identity features, fine adjusting postures, adjusting illumination hues and modifying background details.
8. An image generation apparatus, the apparatus comprising: The system comprises a response module, an image generation request, a response module and a control module, wherein the image generation request carries a plurality of original images and a photo description text, the plurality of original images comprise a plurality of subjects, the image generation request is used for requesting to generate target photos of the plurality of subjects, and the photo description text is used for describing at least one of the gesture, the interaction relation and the background scene of each subject in the target photos; the first generation module is used for generating an initial scene image through a first diffusion model according to the joint photo description text, wherein the initial scene image comprises a plurality of areas to be filled; The first analysis module is used for analyzing the initial scene image and determining the main body posture characteristics corresponding to each region to be filled; The second analysis module is used for carrying out feature analysis on the plurality of original images to obtain attribute features of each main body; and the second generation module is used for inputting a second diffusion model to generate an image by taking the main body posture feature as a posture constraint condition and taking a plurality of attribute features as identity constraint conditions, so as to obtain the target combination.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.

Description

Image generation method, apparatus, readable storage medium, and program product Technical Field The present application relates to the field of image generation technology, and in particular, to an image generation method, an image generation apparatus, a computer device, a computer readable storage medium, and a computer program product. Background The Multi-person syndication creation task (Multi-Subject Group Photo Generation) is a leading and very challenging topic in the field of Artificial Intelligence Generation of Content (AIGC). In the related art, a conventional image generation process mainly surrounds "matting-mapping", for example, a foreground segmentation technique (i.e., "matting") is first relied on to separate multiple image subjects from respective original photos, and then the subjects are "pasted" to a designated position of the same background image. Or the user is required to select a model photo template, and then the face in the user photo is replaced to the face of the appointed model in the template by using a face replacement technology. The above technique has a problem that the generated synopsis effect is unnatural. Disclosure of Invention Based on this, the present application provides an image generation method, apparatus, computer device, computer-readable storage medium, and computer program product, capable of improving the effect of the syndication feed. In one aspect, the present application provides an image generation method, the method including: The method comprises the steps of responding to an image generation request, carrying a plurality of original images and a photo combination description text, wherein the plurality of original images comprise a plurality of subjects, the image generation request is used for requesting to generate target photos of the plurality of subjects, and the photo combination description text is used for describing at least one of the gesture, the interaction relation and the background scene of each subject in the target photos; Generating an initial scene image through a first diffusion model according to the synopsis description text, wherein the initial scene image comprises a plurality of areas to be filled; Analyzing the initial scene image to determine main body posture features corresponding to the areas to be filled; performing feature analysis on the plurality of original images to obtain attribute features of each main body; and inputting a second diffusion model to generate an image by taking the main body posture feature as a posture constraint condition and taking a plurality of attribute features as identity constraint conditions to obtain the target conjunctions. In some embodiments, the first diffusion model comprises a first text encoder, a first denoising network and a second image decoder, wherein the joint photo description text guides user input through a preset prompt word template, and the prompt word template comprises a first field for defining the number of people, a second field for defining a core gesture or interaction relation and a third field for defining a background environment; Generating an initial scene image through a first diffusion model according to the joint descriptive text, wherein the initial scene image comprises a plurality of areas to be filled and comprises: inputting the combined description text to the text encoder to obtain a text embedded vector; Iteratively executing a multi-step denoising process on a random noise tensor through the first denoising network until a preset termination condition is met to obtain a target noise tensor, wherein each step of denoising process comprises the steps of inputting a current step of noise tensor and the text embedding vector into the first denoising network to obtain a noise prediction value for removing noise from the current noise tensor; and inputting the target noise tensor into the image decoder to obtain the initial scene image. In some embodiments, the main body posture features include contour features and main body posture features of the region to be filled, and the analyzing the initial scene image to determine main body posture features corresponding to each region to be filled includes: Performing instance segmentation on the initial scene image to obtain instance segmentation masks respectively corresponding to the areas to be filled; performing key point detection on the initial scene image to obtain gesture skeleton data corresponding to each region to be filled; and for each region to be filled, carrying out association matching on the instance segmentation mask corresponding to the region to be filled and the gesture bone data based on the relative position relation between the instance segmentation mask and the gesture bone data to obtain the main body gesture feature. In some embodiments, for each region to be filled, based on a relative positional relationship between the example segmentation mask and the pose bone data,