CN-121999081-A - Image generation method, device, equipment and medium based on natural language description

CN121999081ACN 121999081 ACN121999081 ACN 121999081ACN-121999081-A

Abstract

The application relates to the technical field of images, and provides an image generation method, device, equipment and medium based on natural language description. The method includes the steps of acquiring a natural language description and an original image, wherein the natural language description comprises a target object and constraint information of the target object. Determining a semantic alignment feature map of the natural language description and the original image, and determining multi-scale features of the semantic alignment feature map, wherein the multi-scale features comprise features of the semantic alignment feature map under different spatial resolutions. Determining the position information of the target object generated in the original image based on the multi-scale features and the semantic alignment feature map, generating the target object conforming to the constraint information in the original image based on the position information, obtaining a composite image, and outputting the position information. The image generation method provided by the application can directly mark the position of the target object while generating the synthetic image, and does not need additional manual labeling or complex post-processing algorithm to acquire the target position information.

Inventors

WANG YUANYUAN
ZHAO WEI
TAN XUECHENG
Hao Manjun

Assignees

北京神州光大科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251229

Claims (10)

1. An image generation method based on natural language description, the method comprising: Acquiring a natural language description and an original image, wherein the natural language description comprises a target object and constraint information of the target object; Determining semantic alignment feature graphs of the natural language description and the original image; Determining multi-scale features of the semantic alignment feature map, the multi-scale features comprising features of the semantic alignment feature map at different spatial resolutions; Determining, based on the multi-scale features and the semantic alignment feature map, location information for generating the target object in the original image; and generating the target object conforming to the constraint information in the original image based on the position information, obtaining a composite image, and outputting the position information.
2. The image generation method according to claim 1, wherein the determining the semantic alignment feature map of the natural language description and the original image includes: extracting text features in the natural language description and image features in the original image; and carrying out semantic alignment on the text features and the image features to obtain the semantic alignment feature map.
3. The image generation method according to claim 2, wherein the text feature includes a plurality of text feature blocks, and the image feature includes a plurality of image feature blocks; the semantic alignment of the text features and the image features is carried out to obtain the semantic alignment feature map, which comprises the following steps: for each text feature block, determining a first semantic association degree between the text feature block and each image feature block, wherein the first semantic association degree is used for representing the matching degree between the text feature block and each image feature block; For each image feature block, determining a second semantic association degree between the image feature block and each text feature block, wherein the second semantic association degree is used for representing the matching degree between the image feature block and each text feature block; and fusing the first semantic association degree and the second semantic association degree to obtain the semantic alignment feature map.
4. The image generation method of claim 3, wherein a bi-directional guided cross-modal attention network is deployed in the electronic device, the bi-directional guided cross-modal attention network comprising a text-to-image attention layer and an image-to-text attention layer; The text-to-image attention layer is used to determine the first semantic association and the image-to-text attention layer is used to determine the second semantic association.
5. The image generation method according to any one of claims 1-4, wherein the determining multi-scale features of the semantic alignment feature map comprises: and carrying out downsampling treatment on the semantic alignment feature map to obtain the multi-scale features.
6. The image generation method according to any one of claims 1 to 4, wherein the determining of the position information of the target object to be generated in the original image based on the multi-scale feature and the semantic alignment feature map includes: and performing target query embedding processing and cross attention calculation on the multi-scale feature and the semantic alignment feature map to obtain a spatial semantic map, wherein the spatial semantic map contains position information for generating the target object in the original image.
7. An image generation apparatus based on natural language description, the image generation apparatus comprising: the acquisition module is used for acquiring natural language description and an original image, wherein the natural language description comprises a target object and constraint information of the target object; The determining module is used for determining the natural language description and the semantic alignment feature map of the original image; The determining module is further configured to determine a multi-scale feature of the semantic aligned feature map, where the multi-scale feature includes features of the semantic aligned feature map under different spatial resolutions; the determining module is further configured to determine, based on the multi-scale feature and the semantic alignment feature map, to generate location information of the target object in the original image; And the generation module is used for generating the target object conforming to the constraint information in the original image based on the position information, obtaining a composite image and outputting the position information.
8. An electronic device comprising a processor and a memory storing a computer program, wherein the processor implements the natural language description based image generation method of any one of claims 1 to 6 when executing the computer program.
9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the natural language description based image generation method of any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the natural language description based image generation method of any one of claims 1 to 6.

Description

Image generation method, device, equipment and medium based on natural language description Technical Field The present application relates to the field of image technologies, and in particular, to a method, an apparatus, a device, and a medium for generating an image based on natural language description. Background In the field of target detection, data enhancement is an important means for improving the performance of a model, and the current mainstream technology is mainly divided into two types, namely a traditional pixel level enhancement method, wherein a new sample is generated through simple transformation of pixel layers such as overturn, clipping and scaling, new semantic information is not introduced, the new sample is limited to the target types and the number of original data sets, the enhancement effect is limited, and the other type is a synthetic enhancement method based on a generated model, such as generating a synthetic image containing a target by using a generated countermeasure Network (GENERATIVE ADVERSARIAL Network, GAN), a Diffusion model (Diffusion Models) and the like, and part of schemes can be used for designating the target types, but the boundary frame coordinates are difficult to accurately output, additional manual labeling or complex post-processing is needed, and the cost is high and the efficiency is low. Disclosure of Invention The application provides an image generation method based on natural language description, which can solve the technical problems that the boundary frame coordinates are difficult to accurately output in the related technology, additional manual labeling or complex post-processing is needed, and the cost is high and the efficiency is low. In a first aspect, an embodiment of the present application provides an image generating method based on natural language description, the method including: And acquiring a natural language description and an original image, wherein the natural language description comprises the target object and constraint information of the target object. Determining a semantic alignment feature map of the natural language description and the original image, and determining multi-scale features of the semantic alignment feature map, wherein the multi-scale features comprise features of the semantic alignment feature map under different spatial resolutions. Determining the position information of the target object generated in the original image based on the multi-scale features and the semantic alignment feature map, generating the target object conforming to the constraint information in the original image based on the position information, obtaining a composite image, and outputting the position information. In one embodiment, the determining the semantic alignment feature map of the natural language description and the original image may specifically include: extracting text features in natural language description and image features in an original image, and carrying out semantic alignment on the text features and the image features to obtain a semantic alignment feature map. In one embodiment, the text feature may include a plurality of text feature blocks, the image feature may include a plurality of image feature blocks, and the performing semantic alignment on the text feature and the image feature to obtain a semantic alignment feature map may specifically include: For each text feature block, a first semantic association between the text feature block and each image feature block is determined, the first semantic association being used to characterize a degree of matching between the text feature block and each image feature block. For each image feature block, a second semantic association between the image feature block and each text feature block is determined, the second semantic association being used to characterize the degree of matching between the image feature block and each text feature block. And fusing the first semantic association degree and the second semantic association degree to obtain a semantic alignment feature map. In one embodiment, a bi-directional directed cross-modal attention network is deployed in an electronic device, the bi-directional directed cross-modal attention network including a text-to-image attention layer and an image-to-text attention layer. The text-to-image attention layer is used to determine a first semantic association and the image-to-text attention layer is used to determine a second semantic association. In one embodiment, the determining the multi-scale feature of the semantically aligned feature map may specifically include: and carrying out downsampling treatment on the semantic alignment feature map to obtain multi-scale features. In one embodiment, the determining, based on the multi-scale feature and the semantic alignment feature map, the location information of the target object generated in the original image may specifically include: and performing target qu