CN-119520934-B - Image generation method and device and display equipment

CN119520934BCN 119520934 BCN119520934 BCN 119520934BCN-119520934-B

Abstract

The embodiment of the application provides an image generation method, an image generation device and display equipment, wherein the method comprises the steps of acquiring multi-mode information input by a user; the method comprises the steps of respectively processing each mode information in the multi-mode information based on a coding network in an image generation model to obtain feature vectors corresponding to the mode information, respectively fusing the feature vectors corresponding to the mode information based on a fusion network in the image generation model to obtain fusion vectors, and processing the fusion vectors through a multi-stage network in the image generation model based on the multi-mode information to obtain target images corresponding to the multi-mode information. The image generation method can generate the corresponding image based on the multi-mode information.

Inventors

FU AIGUO
YANG SHANSONG

Assignees

海信视像科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20241105

Claims (8)

1. An image generation method, the method comprising: acquiring multi-mode information input by a user; Processing each mode information in the multi-mode information based on a coding network in an image generation model to obtain a feature vector corresponding to each mode information; carrying out fusion processing on the feature vectors corresponding to the modal information respectively based on a fusion network in the image generation model to obtain fusion vectors; Determining at least one piece of auxiliary characteristic information corresponding to the multi-mode information, wherein the auxiliary characteristic information is the characteristic information extracted from the fusion vector, and at least comprises a skeleton characteristic, a depth of field characteristic and an edge structure characteristic; processing the fusion vector and the first auxiliary feature information based on a first feature sub-network in a multi-stage network in the image generation model to determine a first output vector, wherein the multi-stage network comprises the first feature sub-network, an intermediate feature sub-network and a second feature sub-network which are sequentially connected, and the at least one auxiliary feature information comprises the first auxiliary feature information, the second auxiliary feature information and third auxiliary feature information; Processing the first output vector and the second auxiliary feature information based on the intermediate feature sub-network to determine a second output vector; Processing the second output vector and the third auxiliary feature information based on the second feature sub-network to obtain a target vector; and decoding the target vector based on an image decoding sub-network in the multi-stage network to obtain a target image corresponding to the multi-mode information.
2. The method according to claim 1, wherein the fusing the feature vectors corresponding to the modal information based on the fusion network in the image generation model to obtain a fusion vector includes: determining a preset length; And based on the preset length, carrying out fusion processing on the feature vectors corresponding to the modal information through the fusion network to obtain the fusion vector with the preset length.
3. The method according to claim 2, wherein the fusing, based on the preset length, the feature vector corresponding to each modality information through the fusion network to obtain the fusion vector with the preset length includes: performing feature selection processing on the feature vectors corresponding to the modal information through the fusion network to obtain selected feature vectors; mapping the selected feature vector to a preset feature space to obtain a mapped feature vector; and based on the preset length, carrying out fusion processing on the mapping feature vector to obtain the fusion vector with the preset length, wherein the fusion processing comprises at least one fusion mode of weighted fusion, feature level fusion and decision level fusion.
4. The method of claim 1, wherein the multi-modal information comprises at least two of audio information, text information, image information and video information, and wherein the encoding network in the image-based generation model respectively processes each modal information to obtain feature vectors corresponding to each modal information, and the feature vectors comprise at least two of the following: inputting the audio information into an audio coding sub-network in the coding network to generate an audio feature vector corresponding to the audio information; Inputting the text information into a text coding sub-network in the coding network, and generating a text feature vector corresponding to the text information; Inputting the image information into an image coding sub-network in the coding network to generate an image feature vector corresponding to the image information; And inputting the video information into a video coding sub-network in the coding network to generate a video feature vector corresponding to the video information.
5. The method of claim 4, wherein the inputting the audio information into the audio coding sub-network in the coding network generates an audio feature vector corresponding to the audio information, comprising: preprocessing the audio information to obtain preprocessed audio information; performing feature extraction operation on the preprocessed modal information to obtain an initial feature vector; and carrying out coding processing on the initial feature vector to obtain the audio feature vector.
6. The method according to any one of claims 1-5, further comprising: Acquiring multi-mode training information; Processing the multi-mode training information based on an image generation model to be trained to generate a predicted image, wherein the image generation model to be trained comprises a coding network to be trained, a fusion network to be trained and a multi-stage network to be trained; acquiring a sample image; And iterating the image generation model to be trained by taking the predicted image as initial training output information of the image generation model to be trained and taking the sample image as supervision information to obtain the image generation model.
7. An image generation apparatus based on multi-modal information, comprising: The acquisition module is configured to acquire multi-mode information input by a user; a determination module configured to: processing each mode information based on a coding network in an image generation model to obtain a feature vector corresponding to each mode information; carrying out fusion processing on the feature vectors corresponding to the modal information based on a fusion network in the image generation model to obtain fusion vectors; Determining at least one piece of auxiliary characteristic information corresponding to the multi-mode information, wherein the auxiliary characteristic information is the characteristic information extracted from the fusion vector, and at least comprises a skeleton characteristic, a depth of field characteristic and an edge structure characteristic; processing the fusion vector and the first auxiliary feature information based on a first feature sub-network in a multi-stage network in the image generation model to determine a first output vector, wherein the multi-stage network comprises the first feature sub-network, an intermediate feature sub-network and a second feature sub-network which are sequentially connected, and the at least one auxiliary feature information comprises the first auxiliary feature information, the second auxiliary feature information and third auxiliary feature information; Processing the first output vector and the second auxiliary feature information based on the intermediate feature sub-network to determine a second output vector; Processing the second output vector and the third auxiliary feature information based on the second feature sub-network to obtain a target vector; and decoding the target vector based on an image decoding sub-network in the multi-stage network to obtain a target image corresponding to the multi-mode information.
8. A display device, characterized by comprising: the display is configured to display an application interface of the image generation application, wherein the application interface comprises a first input area, a second input area and an image display area; a controller coupled with the display and configured to: responsive to receiving image information entered by a user in the first input area and text information entered in the second input area; Processing each mode information in the multi-mode information based on a coding network in an image generation model to obtain a feature vector corresponding to each mode information, wherein the multi-mode information comprises the image information and the text information; carrying out fusion processing on the feature vectors corresponding to the modal information based on a fusion network in the image generation model to obtain fusion vectors; Determining at least one piece of auxiliary characteristic information corresponding to the multi-mode information, wherein the auxiliary characteristic information is the characteristic information extracted from the fusion vector, and at least comprises a skeleton characteristic, a depth of field characteristic and an edge structure characteristic; processing the fusion vector and the first auxiliary feature information based on a first feature sub-network in a multi-stage network in the image generation model to determine a first output vector, wherein the multi-stage network comprises the first feature sub-network, an intermediate feature sub-network and a second feature sub-network which are sequentially connected, and the at least one auxiliary feature information comprises the first auxiliary feature information, the second auxiliary feature information and third auxiliary feature information; Processing the first output vector and the second auxiliary feature information based on the intermediate feature sub-network to determine a second output vector; Processing the second output vector and the third auxiliary feature information based on the second feature sub-network to obtain a target vector; and decoding the target vector based on an image decoding sub-network in the multi-stage network to obtain a target image corresponding to the multi-mode information.

Description

Image generation method and device and display equipment Technical Field The present application relates to the field of display devices, and in particular, to an image generating method and apparatus, and a display device. Background With the rapid development of image generation technology, artificial intelligence and machine learning technologies can realize automatic generation of images. The related art may implement image generation based on single modality information, for example, generating an image based on text information or generating a corresponding image based on image information. However, as the personalized needs of users increase, such single modality information generation images have not been able to meet the needs of users. Disclosure of Invention The embodiment of the application provides an image generation method, an image generation device and display equipment, which can generate corresponding images based on multi-mode information and meet personalized requirements of users. The embodiment of the application provides an image generation method, which comprises the steps of firstly, acquiring multi-modal information input by a user, wherein the multi-modal information comprises at least two of audio information, text information, image information and video information. And then, processing the modal information based on the coding network in the image generation model to obtain the feature vector corresponding to the modal information. And secondly, carrying out fusion processing on the feature vectors corresponding to the modal information based on a fusion network in the image generation model to obtain fusion vectors. And finally, processing the fusion vector through a multi-stage network in the image generation model based on the multi-mode information to obtain a target image corresponding to the multi-mode information. According to the image generation method provided by the embodiment of the application, after the multi-modal information input by the user is obtained, the multi-modal information is input into the image generation model, and the target image conforming to the multi-modal information is obtained after the multi-modal information is processed by the coding network, the fusion network and the multi-stage network in the image generation model in sequence. The image generation method provided by the embodiment of the application can process the multi-mode information to generate the image meeting the requirements, thereby meeting the personalized requirements of users and improving the user experience. In some embodiments, processing the fusion vector through a multi-stage network in an image generation model based on multi-mode information to obtain a target image corresponding to the multi-mode information comprises determining at least one auxiliary feature information corresponding to the multi-mode information, processing the fusion vector and the at least one auxiliary feature information based on at least one feature sub-network in the multi-stage network to obtain a target vector, and decoding the target vector based on an image decoding sub-network in the multi-stage network to obtain the target image. Based on the scheme, when the multi-stage network processes the fusion vector, auxiliary characteristic information is introduced, and the auxiliary characteristic information can enable the multi-stage network to extract more real and visual characteristics, so that the quality of a target image generated subsequently is improved, and an image generating model is enabled to generate an image which meets the requirements of a user. In some embodiments, processing the fusion vector and the at least one auxiliary feature information based on at least one feature sub-network in the multi-stage network to obtain a target vector includes processing the fusion vector and the first auxiliary feature information based on a first feature sub-network in the at least one feature sub-network to determine a first output vector, wherein the at least one auxiliary feature information includes first auxiliary feature information, second auxiliary feature information, and third auxiliary feature information, processing the first output vector and the second auxiliary feature information based on an intermediate feature sub-network in the at least one feature sub-network to determine a second output vector, and processing the second output vector and the third auxiliary feature information based on a second feature sub-network in the at least one feature sub-network to determine the target vector. Based on the scheme, in the multi-stage network, the vector output by the previous feature sub-network and the auxiliary feature information corresponding to the current feature sub-network are input into the current feature sub-network, and after the current feature sub-network is processed, the output feature vector is continuously input into the next feature sub-network. By di