CN-122023550-A - Method, system and equipment for generating target style image

CN122023550ACN 122023550 ACN122023550 ACN 122023550ACN-122023550-A

Abstract

The invention provides a method, a system and equipment for generating a target style image, belonging to the technical field of computer vision, wherein the method comprises the steps of obtaining a source content image to be stylized provided by a user and a character style image designated by the user; the source content image and the character style image are input into a deployed image generation model to generate a target style image consistent with the style of the character style image, wherein the image generation model is obtained by performing an unsupervised training mode based on a content image sample set and a style image sample set through a generation network comprising an auxiliary network and a diffusion network, the generation network performs training, and no paired corresponding relation exists between the content image sample set and the style image sample set. The invention improves the quality and diversity of the generated style characters.

Inventors

LU XIONGBO
CHEN YAXIONG
SU WANJUAN

Assignees

武汉学院

Dates

Publication Date: 20260512
Application Date: 20251231

Claims (10)

1. A method of generating a target style image, comprising: Acquiring a source content image to be stylized provided by a user and a character style image specified by the user; inputting the source content image and the character style image into a deployed image generation model to generate a target style image consistent with the style of the character style image; The image generation model is obtained by performing an unsupervised training mode based on a content image sample set and a style image sample set through a generation network comprising an auxiliary network and a diffusion network, wherein the generation network is used for training, and no paired corresponding relation exists between the content image sample set and the style image sample set.
2. The method for generating a target style image according to claim 1, wherein before the step of obtaining the source content image to be stylized provided by the user, the method comprises: The content image sample set comprises a source content image sample and a reference content image sample, and the style image sample set comprises a character style image; inputting the source content image sample and the character style image sample into the auxiliary network to generate a pseudo-style image; Inputting the reference content image sample and the pseudo-style image into the diffusion network to generate a reference style image; And calculating to obtain a joint total loss according to the reference style image, and optimizing the generating network according to the joint total loss to obtain the image generating model.
3. The method for generating a target style image according to claim 2, wherein the generating network further includes a feature fusion module and a constraint module, the feature fusion module is integrated in the auxiliary network and the diffusion network and is used for fusing based on deformation driving features generated in the auxiliary network and the diffusion network, the constraint module is connected with the auxiliary network and is used for comparing differences based on the source content image sample and the pseudo style image, the joint total loss is calculated according to the reference style image, and the generating network is optimized according to the joint total loss to obtain the image generating model, and the method comprises the following steps: Calculating the joint total loss based on output loss, wherein the output loss at least comprises diffusion loss calculated according to the prediction result of noise added by the diffusion network to the reference style image, auxiliary network loss calculated by performing countermeasure discrimination on the pseudo-style image and the character style image sample through the auxiliary network, feature fusion loss calculated according to deformation generated by the feature fusion module, and character content consistency constraint loss calculated by comparing the difference output by the constraint module; and carrying out iterative optimization on parameters of the generation network by minimizing the joint total loss so as to obtain the trained image generation model.
4. The method of generating a target style image according to claim 2, wherein the auxiliary network comprises a generator and a discriminator; generating, by the generator, the pseudo-style image from the input source content image sample and character-style image sample; And judging the authenticity and style category of the input image in an countermeasure training mode through the judging device and the generator.
5. The method of generating a target style image according to claim 4, wherein the generator adopts an encoder-decoder structure, and comprises a content encoder, a style encoder and a first decoder, wherein the content encoder and the first decoder are connected through a first depth feature deformation injection module, the first depth feature deformation injection module comprises a first encoder feature mapping module, a first cross-attention calculation module and a first deformable convolution module, and the method further comprises: mapping and aligning the middle layer features extracted by the content encoder through the first encoder feature mapping module to obtain first auxiliary output features; mapping and aligning the middle layer features extracted by the first decoder through the first encoder feature mapping module to obtain second auxiliary output features; calculating, by the first cross-attention calculation module, a cross-attention between the first auxiliary output feature and the second auxiliary output feature to obtain a first offset parameter; Deforming the middle layer characteristic of the content encoder by the first deformable convolution module by using the first offset parameter to obtain a first deformed content structural characteristic; And fusing and decoding the first deformed content structural features and the first style features extracted based on the style encoder through the first decoder to obtain the pseudo-style image.
6. The method of generating a target style image according to claim 4, wherein the diffusion network comprises a content feature extractor, a style feature extractor, and a noise predictor; extracting structural features of the input reference content image sample and the pseudo-style image by the content feature extractor; Extracting style characteristics of the input pseudo-style image through the style characteristic extractor; and injecting and fusing the structural features and the style features through the noise predictor to generate the reference style image.
7. The method for generating a target style image according to claim 6, wherein the noise predictor adopts a U-Net architecture and comprises a content feature extraction, a style feature extractor and a second decoder, wherein the content feature extraction, the style feature extractor and the second decoder are connected through a second depth feature deformation injection module; mapping and aligning the middle layer features extracted by the content feature extractor through the second encoder feature mapping module to obtain first diffusion output features; mapping and aligning the middle layer features extracted by the second decoder through the second decoder feature mapping module to obtain second diffusion output features; Calculating, by the second cross-attention calculation module, cross-attention between the first diffuse output feature and the second diffuse output feature to obtain a second offset parameter; Deforming the middle layer characteristics of the content characteristic extractor by the second deformable convolution module by using the second offset parameters to obtain second deformed content structural characteristics; And denoising by the second encoder based on the second deformed content structural feature, the second style feature extracted by the style feature extractor and the prediction noise to generate the reference style image.
8. The method of generating a target style image according to any one of claims 1 to 7, further comprising: The deployed image generation model is the diffusion network which is independently deployed after training, the diffusion network generates the target character image according to the source content image and the character style image, or, The deployed image generation model is the auxiliary network which is independently deployed after training, and the auxiliary network generates the target character image according to the source content image and the character style image.
9. A system for generating an unsupervised style character, comprising: The acquisition module is used for acquiring a source content image to be stylized and provided by a user and a character style image designated by the user; the generation module is used for inputting the source content image and the character style image into a deployed image generation model so as to generate a target style image consistent with the style of the character style image; The image generation model is obtained by performing an unsupervised training mode based on a content image sample set and a style image sample set through a generation network comprising an auxiliary network and a diffusion network, wherein the generation network is used for training, and no paired corresponding relation exists between the content image sample set and the style image sample set.
10. An electronic device comprising a memory and a processor, wherein, The memory is used for storing programs; The processor, coupled to the memory, is configured to execute the program stored in the memory to implement the steps in the method for generating a target style image according to any one of claims 1 to 8.

Description

Method, system and equipment for generating target style image Technical Field The invention relates to the technical field of computer vision, in particular to a method, a system and equipment for generating a target style image. Background The characters are used as a core carrier for information recording and cultural expression, and have increasingly stylized requirements in the fields of visual art design, digital media creation and the like. The style character generating technology based on the generating model is widely applied to the fields of advertisement design, game interfaces, film and television special effects, cultural heritage digitization and the like. However, in practical application, due to factors such as difficulty in collecting data in a specific style, high cost for labeling paired data, limitation of data privacy and compliance, it is often difficult to obtain data in large-scale and high-quality content and style paired data, so that the traditional supervised learning method is difficult to directly apply, and becomes a key bottleneck for restricting the landing of the style character generation technology. Under the unsupervised scene of the pair training data deficiency, the traditional supervision method highly depends on the clear guidance of the pair data to learn the characteristics and modes, and cannot accurately learn the mapping relation between the input and the output in a conventional manner, so that the problem of mismatching of the training mechanism is caused. To address these issues, generation of loop-consistency-based constraint antagonizes the network to maintain content consistency by constructing a bi-directional map, enabling image translation without pairing data. And based on a deformable convolution method, based on the spatial correspondence between explicit modeling content and style images, the migration of style details is realized through local deformation. These methods provide a viable infrastructure for unsupervised style character generation. However, there are still significant limitations to the prior art. In the task of generating the fine structure of the character based on the method of the cyclic consistency, blurring or structural artifacts are easily generated due to uncertainty of intermediate mapping, and the visual quality and the readability of the generated character are influenced. Based on the deformable convolution method, the performance of the method is highly dependent on the assumption that strong spatial correspondence exists between the content and the style image, and when the font styles with exaggerated strokes and large artistic deformation degree are handled, the deformation modeling capability is insufficient, so that the style migration is incomplete or distorted. Disclosure of Invention In view of the foregoing, it is necessary to provide a method, a system and a device for generating a target style image, so as to solve the technical problems of structural artifacts occurring in the generated result and insufficient migration capability of artistic fonts with larger stroke deformation in the prior art. In order to solve the above technical problem, in a first aspect, the present invention provides a method for generating a target style image, including: Acquiring a source content image to be stylized provided by a user and a character style image specified by the user; inputting the source content image and the character style image into a deployed image generation model to generate a target style image consistent with the style of the character style image; The image generation model is obtained by performing an unsupervised training mode based on a content image sample set and a style image sample set through a generation network comprising an auxiliary network and a diffusion network, wherein the generation network is used for training, and no paired corresponding relation exists between the content image sample set and the style image sample set. In one possible implementation manner, before the acquiring the source content image to be stylized provided by the user, the method includes: The content image sample set comprises a source content image sample and a reference content image sample, and the style image sample set comprises a character style image; inputting the source content image sample and the character style image sample into the auxiliary network to generate a pseudo-style image; Inputting the reference content image sample and the pseudo-style image into the diffusion network to generate a reference style image; And calculating to obtain a joint total loss according to the reference style image, and optimizing the generating network according to the joint total loss to obtain the image generating model. In one possible implementation manner, the generating network further includes a feature fusion module and a constraint module, where the feature fusion module is integrated in the auxiliary netw