CN-121982140-A - Model training method, scene image generation method, device, equipment and medium

CN121982140ACN 121982140 ACN121982140 ACN 121982140ACN-121982140-A

Abstract

The application provides a model training method, a scene image generation method, a device, equipment and a medium, and relates to the technical field of image processing. The method comprises the steps of obtaining a character table of a preset story world, wherein the character table comprises a plurality of character images in the preset story world and appearance attributes of first characters in all the character images, determining appearance description texts of second characters and event description texts associated with the second characters in all the existing story scene images according to the plurality of existing story scene images and the character table of the preset story world, and training an initial story scene diffusion model according to the appearance description texts of the second characters, the event description texts associated with the second characters and the plurality of existing story scene images to obtain a target story scene diffusion model. The new story scene image generated by using the model is more consistent with the new story scene description text, and the same character in the new story scene image can be kept consistent.

Inventors

ZHANG JINLU
Tang Jiji
ZHANG RONGSHENG
Lv Tangjie
SUN XIAOSHUAI
ZHAO ZENG
FAN CHANGJIE

Assignees

网易（杭州）网络有限公司
厦门大学

Dates

Publication Date: 20260505
Application Date: 20241031

Claims (16)

1. A method of model training, comprising: acquiring a character table of a preset story world, wherein the character table comprises a plurality of character images in the preset story world and appearance attributes of a first character in each character image; Determining appearance description text of a second role in each pre-story scene image and existing event description text associated with the second role according to a plurality of pre-story scene images of the pre-story world and the role table; Training an initial story scene diffusion model according to the appearance description text of a second role in each existing story scene image, the existing event description text related to the second role and a plurality of the existing story scene images to obtain a target story scene diffusion model, wherein the target story scene diffusion model is used for generating a new story scene image corresponding to the new story scene description text according to the new story scene description text.
2. The method of claim 1, wherein the obtaining a character table of a pre-story world comprises: Determining character description text corresponding to each character image according to the plurality of character images; analyzing the character description text corresponding to each character image to obtain a plurality of appearance attributes of the first character in each character image; and constructing the character table according to the character images and the appearance attributes of the first character in each character image.
3. The method of claim 2, wherein constructing the character table from the plurality of character images and the plurality of appearance attributes of the first character in each of the character images comprises: Taking the first role in each role image as a role node, and taking each appearance attribute of the first role in each role image as an attribute node, constructing a plurality of role attribute networks corresponding to the first role in each role image; Constructing a character network of the first character in each character image according to a plurality of character attribute networks corresponding to the first character in each character image; and constructing the character table based on the character network of the first character in each character image by taking a plurality of character images as indexes.
4. The method of claim 1, wherein the determining the appearance descriptive text of the second character in each of the pre-story scene images, the associated existing event descriptive text of the second character, from the plurality of pre-story scene images of the pre-story world and the character table comprises: Determining description information of each of the pre-story scene images according to a plurality of the pre-story scene images; Analyzing the description information of each existing story scene image to obtain the description information of the second role in each existing story scene image and the events related to the second role; and determining the appearance description text of the second role and the existing event description text associated with the second role in each existing story scene image according to the description information of the second role, the events associated with the second role and the role table in each existing story scene image.
5. The method of claim 4, wherein the determining the appearance descriptive text of the second character, the associated existing event descriptive text of the second character in each of the existing story scene images based on the descriptive information of the second character, the associated events of the second character, and the character table in each of the existing story scene images comprises: searching the description information of the second role in the prior story scene image from a plurality of role images of the role table, and searching the name of the first role in the target role image which is matched with the description information of the second role in the prior story scene image; Determining the appearance description text of the second character in each existing story scene image according to the name of the first character in the target character image and the character table; And determining the existing event description text associated with the second role according to the name of the first role in the target role image, the event associated with the second role and the relation between the first roles in the target role image.
6. The method of claim 5, wherein said determining appearance descriptive text for said second character in each of said pre-story scene images based on a name of said first character in said target character image and said character table comprises: Determining appearance attributes of the first character in the target character image according to the name of the first character in the target character image and the character table; And determining appearance description text of the second character in each pre-story scene image according to the name of the target character image and the appearance attribute of the first character in the target character image.
7. The method of claim 1, wherein training the initial story scene diffusion model based on the appearance descriptive text of the second character in each of the pre-story scene images, the pre-existing event descriptive text associated with the second character, and the plurality of pre-story scene images to obtain a target story scene diffusion model comprises: Training the initial story scene diffusion model according to the appearance description text of the second role in each existing story scene image, the existing event description text related to the second role, the preset style description text of each existing story scene image and the plurality of existing story scene images to obtain the target story scene diffusion model.
8. The method of claim 1, wherein training the initial story scene diffusion model based on the appearance descriptive text of the second character in each of the pre-story scene images, the pre-existing event descriptive text associated with the second character, and the plurality of pre-story scene images to obtain a target story scene diffusion model comprises: calculating the probability of the second role in each pre-story scene image in any pixel; correcting the probability according to the appearance description text of the second character to obtain corrected probability; modifying the cross attention weight in the initial story scene diffusion model according to the modified probability to obtain a modified cross attention weight; Training the initial story scene diffusion model based on the appearance descriptive text of the second character in each of the existing story scene images, the existing event descriptive text associated with the second character, the modified cross-attention weights, and the plurality of the existing story scene images to obtain the target story scene diffusion model.
9. The method of claim 8, wherein training the initial story scene diffusion model based on the appearance descriptive text of a second character in each of the pre-story scene images, the pre-existing event descriptive text associated with the second character, the modified cross-attention weights, and a plurality of the pre-story scene images to obtain the target story scene diffusion model comprises: generating a resultant story scene image based on the appearance descriptive text of a second character in each of the pre-story scene images, the pre-existing event descriptive text associated with the second character, the modified cross-attention weights; And updating model parameters of the initial story scene diffusion model by adopting a loss function according to the result story scene image and the plurality of existing story scene images until training termination conditions are met, so as to obtain the target story scene diffusion model.
10. The method of claim 8, wherein the correcting the probability based on the appearance descriptive text of the second character results in a corrected probability, comprising: processing the appearance description text of the second character by adopting a character knowledge encoder to obtain a mean shift amount and a variance correction amount; and correcting the probability according to the mean shift and the variance correction, and obtaining the corrected probability.
11. A method of generating a scene image, comprising: Acquiring a new story scene description text; Generating a model by using a target scene image, and generating a new story scene image corresponding to the new story scene description text according to the new story scene description text, wherein the target story scene diffusion model is a model trained by using the method of any one of claims 1-10.
12. The method of claim 11, wherein the new story scene description text comprises a third character's appearance description text, a third character's associated event description text, and style description text.
13. A model training device, comprising: The system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring a character table of a preset story world, and the character table comprises a plurality of character images in the preset story world and appearance attributes of a first character in each character image; The determining module is used for determining appearance description text of a second role in each pre-story scene image and the related existing event description text of the second role according to the plurality of pre-story scene images of the pre-story world and the role table; The training module is used for training the initial story scene diffusion model according to the appearance description text of the second role in each existing story scene image, the existing event description text related to the second role and a plurality of the existing story scene images to obtain a target story scene diffusion model, wherein the target story scene diffusion model is used for generating a new story scene image corresponding to the new story scene description text according to the new story scene description text.
14. A scene image generation device, comprising: The acquisition module is used for acquiring a new story scene description text; The generating module is used for generating a model by adopting a target scene image, and generating a new story scene image corresponding to the new story scene description text according to the new story scene description text, wherein the target story scene diffusion model is a model trained by adopting the method of any one of claims 1-10.
15. An electronic device comprising a memory and a processor, the memory storing a computer program executable by the processor, the processor implementing the model training method of any of the preceding claims 1-10 or the scene image generation method of any of the preceding claims 11-12 when the computer program is executed.
16. A computer readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when read and executed, implements the model training method of any of the preceding claims 1-10 or the scene image generation method of any of the preceding claims 11-12.

Description

Model training method, scene image generation method, device, equipment and medium Technical Field The application relates to the technical field of image processing, in particular to a model training method, a scene image generation method, a device, equipment and a medium. Background A story visualization task refers to a model that generates images frame by frame from each sentence describing a story scene to depict the occurrence of the entire story given a segment of story text containing multiple sentences and character pictures that appear in the story. In the related art, when model training is performed, a model is trained based on a plurality of character images and generalized description text of the characters, but the model obtained by training in this way easily causes the problem that the generated story scene image does not conform to sentences describing the story scene and the problem that the same character in the plurality of story scene images does not conform in the application stage. Disclosure of Invention The present application aims to solve the above-mentioned technical problems of the related art by providing a model training method, a scene image generating method, a device, equipment and a medium, which address the above-mentioned drawbacks of the related art. In order to achieve the above purpose, the technical scheme adopted by the embodiment of the application is as follows: In a first aspect, an embodiment of the present application provides a model training method, including: acquiring a character table of a preset story world, wherein the character table comprises a plurality of character images in the preset story world and appearance attributes of a first character in each character image; Determining appearance description text of a second role in each pre-story scene image and existing event description text associated with the second role according to a plurality of pre-story scene images of the pre-story world and the role table; Training an initial story scene diffusion model according to the appearance description text of a second role in each existing story scene image, the existing event description text related to the second role and a plurality of the existing story scene images to obtain a target story scene diffusion model, wherein the target story scene diffusion model is used for generating a new story scene image corresponding to the new story scene description text according to the new story scene description text. In a second aspect, an embodiment of the present application further provides a scene image generating method, including: Acquiring a new story scene description text; Generating a model by adopting a target scene image, and generating a new story scene image corresponding to the new story scene description text according to the new story scene description text, wherein the target story scene diffusion model is a model trained by adopting the method according to any one of the first aspect. In a third aspect, an embodiment of the present application further provides a model training apparatus, including: The system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring a character table of a preset story world, and the character table comprises a plurality of character images in the preset story world and appearance attributes of a first character in each character image; The determining module is used for determining appearance description text of a second role in each pre-story scene image and the related existing event description text of the second role according to the plurality of pre-story scene images of the pre-story world and the role table; The training module is used for training the initial story scene diffusion model according to the appearance description text of the second role in each existing story scene image, the existing event description text related to the second role and a plurality of the existing story scene images to obtain a target story scene diffusion model, wherein the target story scene diffusion model is used for generating a new story scene image corresponding to the new story scene description text according to the new story scene description text. In a fourth aspect, an embodiment of the present application further provides a scene image generating apparatus, including: The acquisition module is used for acquiring a new story scene description text; The generating module is used for generating a model by adopting a target scene image, and generating a new story scene image corresponding to the new story scene description text according to the new story scene description text, wherein the target story scene diffusion model is a model trained by adopting the method in any one of the first aspect. In a fifth aspect, an embodiment of the present application further provides an electronic device, including a memory and a processor, where the me