CN-121982136-A - Image generation method, device, equipment, storage medium and product

CN121982136ACN 121982136 ACN121982136 ACN 121982136ACN-121982136-A

Abstract

The disclosure relates to an image generation method, an image generation device, electronic equipment and a storage medium, and relates to the technical field of computers. The method comprises the steps of extracting content of an input reference image through a content extraction module to obtain an image feature vector and a text feature vector, mapping the image feature vector and the text feature vector into reference image features through a feature adaptation module, injecting the reference image features into an image generation model, and carrying out image generation processing on an image generation instruction through the image generation model based on the reference image features to obtain a target generation image. According to the method, the target image is generated based on the content characteristics of the reference image, so that the generated image style is adjusted, and the personalized requirements of different application scenes on image generation are met. The image feature vector and the text feature vector are converted into reference image features which are adapted to corresponding image generation models through the feature adaptation module, so that the suitability of the image generation models with different types is improved.

Inventors

LIANG YUNHAO

Assignees

北京小米移动软件有限公司

Dates

Publication Date: 20260505
Application Date: 20241029

Claims (16)

1. An image generation method, the method comprising: the method comprises the steps of extracting content from an input reference image through a content extraction module to obtain an image feature vector and a text feature vector; Mapping the image feature vector and the text feature vector into reference image features matched with an image generation model through a feature adaptation module; Injecting the reference image features into the image generation model; Inputting an image generation instruction into the image generation model; and carrying out image generation processing on the image generation instruction through the image generation model based on the reference image characteristics to obtain a target generation image.
2. The method according to claim 1, wherein the content extracting, by the content extracting module, the input reference image to obtain an image feature vector and a text feature vector, includes: Carrying out image analysis on the reference image through a segmentation model to obtain image local marking information; the image local marking information is used for marking at least one local area in the reference image; extracting text features of the reference image through a large language model to obtain global text description; based on the image local marking information, extracting text characteristics of the reference image through the large language model to obtain at least one local text description; performing text encoding processing on the global text description and at least one local text description through a text encoder to obtain the text feature vector; determining at least one local image from the reference image based on the image local marking information; and carrying out image coding processing on the reference image and at least one local image through an image coder to obtain the image feature vector.
3. The method of claim 2, wherein the large language model performs text feature extraction on the reference image according to at least one set of target cues set in advance.
4. The method of claim 2, wherein the text feature extraction of the reference image by the large language model based on the image local marking information results in at least one local text description, further comprising: And sequencing the at least one local text description according to the area of the local area corresponding to the local text description.
5. The method according to claim 2, wherein said text encoding, by a text encoder, the global text description and the at least one local text description to obtain the text feature vector, comprises: performing text encoding processing on the global text description through the text encoder to obtain a global text feature vector; Performing text encoding processing on the at least one local text description through the text encoder to obtain at least one local text feature vector, wherein the local text feature vector corresponds to the local text description respectively; And splicing the global text feature vector with at least one local text feature vector to obtain the text feature vector.
6. The method of claim 5, wherein the stitching the global text feature vector with at least one local text feature vector results in the text feature vector, comprising: and splicing at least two global text feature vectors and at least one local text feature vector to obtain the text feature vector.
7. The method according to claim 2, wherein said image encoding of said reference image and at least one local image by an image encoder results in said image feature vector, comprising: Performing image coding processing on the reference image and at least one local image through the image coder to obtain a first image feature vector; injecting the text feature vector into a feature fusion model; And carrying out feature fusion on the first image feature vector and the text feature vector through the feature fusion model to obtain the image feature vector.
8. The method of claim 7, wherein said injecting the text feature vector into a feature fusion model comprises: Carrying out text feature fusion on the text feature vector through at least two attention layers to obtain a first text feature vector; and after the first text feature vector is subjected to size adjustment through at least one full-connection layer, injecting the first text feature vector into the feature fusion model.
9. The method according to claim 7, wherein feature fusing the first image feature vector and the text feature vector by the feature fusion model to obtain the image feature vector, comprises: Feature fusion is carried out on the first image feature vector and the text feature vector through the feature fusion model, and a second image feature vector is obtained; Downsampling the text feature vector to obtain a second text feature vector; and carrying out feature fusion on the second image feature vector and the second text feature vector through at least two layers of crossed attention layers to obtain the image feature vector.
10. The method of claim 1, wherein mapping the image feature vector and text feature vector to reference image features adapted to an image generation model by a feature adaptation module comprises: according to the feature scale of the image generation model, the scale adjustment is carried out on the image feature vector through a first feature mapping module, so that a first image feature is obtained; the first image feature is injected into the image generation model.
11. The method of claim 1, wherein mapping the image feature vector and text feature vector to reference image features adapted to an image generation model by a feature adaptation module comprises: According to the feature scale of the image generation model, the scale adjustment is carried out on the image feature vector through a second feature mapping module, so that second image features are obtained; According to the feature scale of each middle layer in the image generation model, the scale of the second image features is adjusted through a third feature mapping module to obtain at least one third image feature, wherein the third image features respectively correspond to the middle layers; and respectively injecting the third image features into corresponding intermediate layers in the image generation model.
12. The method of claim 11, wherein mapping the image feature vector and text feature vector to reference image features adapted to an image generation model by a feature adaptation module further comprises: splicing the at least one third image feature with the text feature vector to obtain a text feature; And injecting the text features into the image generation model.
13. An image generating apparatus, comprising: The content extraction unit is used for extracting the content of the input reference image through the content extraction module to obtain an image feature vector and a text feature vector; The feature adaptation unit is used for mapping the image feature vector and the text feature vector into reference image features matched with the image generation model through the feature adaptation module; a reference image feature injection unit for generating a model from the reference image feature and the injection image; An image generation instruction input unit configured to input an image generation instruction to the image generation model; And the image generation unit is used for carrying out image generation processing on the image generation instruction through the image generation model based on the reference image characteristics to obtain a target generation image.
14. An electronic device, comprising: A processor; A memory for storing processor-executable instructions; Wherein the processor is configured to implement the image generation method of any one of claims 1 to 12.
15. A non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform the image generation method of any of claims 1 to 12.
16. A computer program product comprising a computer program which, when executed by a processor, implements the image generation method of any of claims 1 to 12.

Description

Image generation method, device, equipment, storage medium and product Technical Field The present disclosure relates to the field of computer technology, and in particular, to an image generating method, an apparatus, an electronic device, a storage medium, and a computer program product. Background With the development of general artificial intelligence (ARTIFICIAL GENERAL INTELLIGENCE, AGI) technology, AI products and functions for generating images based on text are increasing. These models typically utilize a pre-trained language decoder to convert text cues to potential representations to guide the diffusion process of generating or editing images. Most Image generation models utilize text-guided diffusion architecture and CLIP (Contrastive Language-Image Pre-training) guidance to enhance the fidelity and relevance of the generated images. However, existing image generation models still have difficulty meeting the needs of users for image personalization, often requiring fine-tuning of each new reference image when generating images of different image styles. And, the corresponding reference image feature extraction module needs to be trained independently for different image generation models, and a large amount of calculation resources are consumed in the process. Disclosure of Invention To overcome the problems in the related art, the present disclosure provides an image generation method, an apparatus, an electronic device, a storage medium, and a computer program product. According to a first aspect of an embodiment of the present disclosure, an image generating method is provided, which includes extracting content of an input reference image by a content extraction module to obtain an image feature vector and a text feature vector, mapping the image feature vector and the text feature vector to reference image features adapted to an image generating model by a feature adaptation module, injecting the reference image features into the image generating model, inputting an image generating instruction into the image generating model, and performing image generating processing on the image generating instruction by the image generating model based on the reference image features to obtain a target generated image. In some exemplary embodiments of the present disclosure, the content extraction module performs content extraction on the reference image to obtain an image feature vector and a text feature vector, and the method includes performing image analysis on the reference image through a segmentation model to obtain image local marking information, marking at least one local area in the reference image by the image local marking information, performing text feature extraction on the reference image through a large language model to obtain a global text description, performing text feature extraction on the reference image through the large language model based on the image local marking information to obtain at least one local text description, performing text encoding processing on the global text description and the at least one local text description through a text encoder to obtain the text feature vector, determining at least one local image according to the reference image based on the image local marking information, and performing image encoding processing on the reference image and the at least one local image through an image encoder to obtain the image feature vector. In some exemplary embodiments of the present disclosure, the large language model performs text feature extraction on the reference image according to at least one set of target cues set in advance. In some exemplary embodiments of the present disclosure, the text feature extraction is performed on the reference image through the large language model based on the image local marking information to obtain at least one local text description, and the method further includes sorting the at least one local text description according to an area of a local area corresponding to the local text description. In some exemplary embodiments of the present disclosure, the text encoding processing is performed on the global text description and at least one local text description by a text encoder to obtain the text feature vector, and the text encoding processing includes performing text encoding processing on the global text description by the text encoder to obtain a global text feature vector, performing text encoding processing on the at least one local text description by the text encoder to obtain at least one local text feature vector, wherein the local text feature vector corresponds to the local text description respectively, and splicing the global text feature vector and the at least one local text feature vector to obtain the text feature vector. In some exemplary embodiments of the present disclosure, the stitching the global text feature vector with at least one local text feature vector to obtain the text feature vector