CN-117237606-B - Interest point image generation method, interest point image generation device, electronic equipment and storage medium

CN117237606BCN 117237606 BCN117237606 BCN 117237606BCN-117237606-B

Abstract

The embodiment of the disclosure discloses a method, a device, electronic equipment and a storage medium for generating an interest point image, wherein the method comprises the steps of obtaining image description information of a target interest point, wherein the image description information comprises an image description text; inputting the image description information into a pre-trained static image generation joint model, and executing the pre-trained static image generation joint model to generate a static image of the target interest point, wherein the static image generation joint model comprises a first large-scale language model and a aragonic graph model, the first large-scale language model is used for generating text vectors according to the image description information, and the aragonic graph model is used for generating the static image of the target interest point according to the text vectors. The technical scheme can quickly generate the high-quality interest point image.

Inventors

GUO NING
SUN QI
CAI WENJING
WANG HAO
LI XIN

Assignees

北京高德云信科技有限公司

Dates

Publication Date: 20260505
Application Date: 20230915

Claims (8)

1. An interest point image generation method, comprising: acquiring image description information of a target interest point, wherein the image description information comprises an image description text and a low-quality image for describing the target interest point; Inputting the image description information into a pre-trained static image generation joint model, and executing the pre-trained static image generation joint model to generate a static image of the target interest point, wherein the static image generation joint model comprises a first large-scale language model and a aragonic graph model, the first large-scale language model is used for generating text vectors according to the image description information, the aragonic graph model is used for generating the static image of the target interest point according to the text vectors, and the first large-scale language model is correspondingly provided with a plurality of prompt language templates; wherein executing the pre-trained static image generation joint model to generate the static image of the target point of interest comprises: based on the prompt templates, executing the pre-trained static image generation joint model for multiple times to generate multiple different static images of the target interest point; Determining a prompt corresponding to each static image, wherein the prompt is generated by the first large-scale language model according to the image description information and the prompt template; inputting the static image into the first large-scale language model to obtain a static image text output by the first large-scale language model; calculating a first similarity between a static image text of the static image and a prompt corresponding to the static image; Inputting the low-quality image into the first large-scale language model to obtain a low-quality image text output by the first large-scale language model; calculating a second similarity between the still image text of the still image and the low quality image text; Determining a quality score of the static image according to the first similarity and the second similarity; And selecting at least one target static image from the plurality of different static images according to the quality scores of the plurality of different static images.
2. The method of claim 1, wherein the inputting the image description information into a pre-trained static image generation joint model, executing the pre-trained static image generation joint model to generate a static image of the target point of interest, comprises: Inputting the image description text and the low-quality image into the first large-scale language model, and executing the first large-scale language model to obtain a text vector output by the first large-scale language model; And inputting the text vector and the low-quality image into the draft image model, and executing the draft image model to obtain the static image of the target interest point output by the draft image model.
3. The method according to claim 1 or 2, wherein the method further comprises: Acquiring basic information of the target interest points; Generating a combined model by using a pre-trained dynamic image aiming at any static image, and generating a dynamic image of the target interest point according to the basic information of the target interest point and the static image, wherein the dynamic image generating combined model comprises a second large-scale language model, an image segmentation model and a dynamic image editing model, the image segmentation model is used for segmenting the static image of the target interest point into a plurality of masks and generating mask information of each mask, the second large-scale language model is used for generating an image editing instruction according to the mask information of each mask and the basic information of the target interest point, and the dynamic image editing model is used for editing the static image according to the image editing instruction to obtain the dynamic image of the target interest point.
4. A training method for generating a joint model from a static image, comprising: Acquiring a first training data set, wherein the first training data set comprises a plurality of positive samples and/or a plurality of negative samples, and the positive samples and the negative samples comprise sample images and sample image texts of sample interest points; Training an initial static image generation joint model by using the first training data set to obtain a trained static image generation joint model, wherein the static image generation joint model comprises a first large-scale language model and a draft graph model; The static image generation joint model has a loss function l=αl 1 +(1-α)L 2 , L 1 is a difference between noise of each time step predicted in a denoising process of the text-to-text graph model and gaussian noise added in a diffusion process, L 2 is a difference between a predicted image text output by the first large-scale language model and the sample image text, α is a predetermined parameter value, and a text vector generated by the first large-scale language model is an image generation condition of the text-to-text graph model.
5. The method of claim 4, wherein the first training data set comprises an original positive sample, an original negative sample, an expanded positive sample, and/or an expanded negative sample, the acquiring the first training data set comprising: Acquiring an original positive sample and/or an original negative sample, wherein the original positive sample and the original negative sample comprise an original sample image and an original sample image text of a sample interest point; performing data enhancement on the original positive sample by using at least one of the following steps to obtain an extended positive sample and/or an extended negative sample: In response to the original sample image missing, taking interest point images of other interest points under the same brand as the sample interest point as an expansion sample image of an expansion positive sample, and taking an original sample image text of the sample interest point as an expansion sample image text of the expansion positive sample; responding to the original positive sample source as comment data, scattering the comment data, and obtaining a plurality of extended positive samples; For two original positive samples, constructing an original sample image in one original positive sample and an original sample image text in the other original positive sample into an expanded negative sample; And constructing a product image of one product and a product description text of the other product into an extended negative sample in response to the source of the original positive sample being product data.
6. An interest point image generating apparatus comprising: An information acquisition module configured to acquire image description information of a target point of interest, the image description information including an image description text and a low-quality image for describing the target point of interest; The static image generation module is configured to generate a joint model by using a pre-trained static image, and generate a static image of the target interest point according to the image description information, wherein the static image generation joint model comprises a first large-scale language model and a aragonic graph model, the large-scale language model is used for generating text vectors according to the image description information, the aragonic graph model is used for generating the static image of the target interest point according to the text vectors, and the first large-scale language model is corresponding to a plurality of prompt templates; wherein executing the pre-trained static image generation joint model to generate the static image of the target point of interest comprises: based on the prompt templates, executing the pre-trained static image generation joint model for multiple times to generate multiple different static images of the target interest point; Determining a prompt corresponding to each static image, wherein the prompt is generated by the first large-scale language model according to the image description information and the prompt template; inputting the static image into the first large-scale language model to obtain a static image text output by the first large-scale language model; calculating a first similarity between a static image text of the static image and a prompt corresponding to the static image; Inputting the low-quality image into the first large-scale language model to obtain a low-quality image text output by the first large-scale language model; calculating a second similarity between the still image text of the still image and the low quality image text; Determining a quality score of the static image according to the first similarity and the second similarity; And selecting at least one target static image from the plurality of different static images according to the quality scores of the plurality of different static images.
7. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer instructions that are executed by the processor to implement the method of any of claims 1-5.
8. A computer readable storage medium having stored thereon computer instructions, wherein the computer instructions, when executed by a processor, implement the method of any of claims 1 to 5.

Description

Interest point image generation method, interest point image generation device, electronic equipment and storage medium Technical Field The disclosure relates to the technical field of image processing, and in particular relates to a method and a device for generating an interest point image, electronic equipment and a storage medium. Background With the development of science and technology, the life quality of people is continuously improved, people go out conveniently, the electronic map is applied, in the existing electronic map, in order to facilitate users to browse and inquire useful information, the data of interest points (Point of Interest, POIs) are provided, the richness and the attraction of the data of the interest points are relatively high in correlation with the use experience of the users, the content is relatively high in quality, the display form with vitality can promote the users to stay in the map client, browse deep content and further promote conversion. The interest point images in the interest point data are important rings for transmitting information to users, but in the existing electronic map, the filling rate of the interest point images is not high, a large number of interest point images are usually generated in the interest points with higher heat, but the interest point images of the interest points with colder doors are seriously missing, and in addition, some interest point images are shot by the users, the shooting angle/resolution and the like cannot reach the display standard, and the image quality is lower. How to generate high-quality interest point images is a technical problem to be solved at present. Disclosure of Invention In order to solve the problems in the related art, embodiments of the present disclosure provide a method, apparatus, electronic device, and storage medium for generating an interest point image. In a first aspect, an embodiment of the present disclosure provides a method for generating an image of an interest point. Specifically, the interest point image generation method includes: Acquiring image description information of a target interest point, wherein the image description information comprises an image description text; inputting the image description information into a pre-trained static image generation joint model, and executing the pre-trained static image generation joint model to generate a static image of the target interest point, wherein the static image generation joint model comprises a first large-scale language model and a aragonic graph model, the first large-scale language model is used for generating text vectors according to the image description information, and the aragonic graph model is used for generating the static image of the target interest point according to the text vectors. In a second aspect, in an embodiment of the present disclosure, a training method for generating a joint model by using a static image is provided, including: Acquiring a first training data set, wherein the first training data set comprises a plurality of positive samples and/or a plurality of negative samples, and the positive samples and the negative samples comprise sample images and sample image texts of sample interest points; Training an initial static image generation joint model by using the first training data set to obtain a trained static image generation joint model, wherein the static image generation joint model comprises a first large-scale language model and a draft graph model; The static image generation joint model has a loss function of L=aL1+ (1-alpha) L2, wherein L1 is the difference between noise of each time step predicted in the denoising process of the aragonic graph model and Gaussian noise added in the diffusion process, L2 is the difference between predicted image text output by the first large-scale language model and sample image text, alpha is a preset parameter value, and a text vector generated by the first large-scale language model is an image generation condition of the aragonic graph model. In a third aspect, an embodiment of the present disclosure provides a training method for generating a joint model by using a dynamic image, including: acquiring sample images and sample basic information of sample interest points; Generating mask information of each mask corresponding to the sample image according to the sample image by using a preset image segmentation model; Acquiring operation information of each mask corresponding to the sample image; Performing fine adjustment on the second large-scale language model according to a second training data set to obtain a trained second large-scale language model, wherein the second training data set comprises sample basic information of a plurality of sample interest points, a sample image, mask information and operation information of corresponding masks; The trained dynamic image generation joint model comprises a trained second large-scale language model, a