CN-121985186-A - Video generation method, device, equipment and storage medium

CN121985186ACN 121985186 ACN121985186 ACN 121985186ACN-121985186-A

Abstract

The application discloses a video generation method, a device, equipment and a storage medium, which are used for acquiring a reference video and an editing prompt; analyzing the video script to generate a structured multi-track scene video script, and generating a target video conforming to the editing prompt according to the script. The multi-track scene video script comprises a reference characteristic anchor point module, an event track module and a visual track module, wherein the reference characteristic anchor point module is used for recording unique identifiers and description information of entities in video so as to establish a main identity standard, the event track module is composed of a plurality of event units, the relationship between an audio event and an execution entity is associated through referencing the entity identifiers and used for describing audio narrative logic, and the visual track module is composed of a plurality of lens units and used for associating a minute-mirror picture with the audio event to be presented through referencing the event identifiers and used for describing a visual presentation mode. The application can improve the quality and controllability of the generated video in the aspects of main body consistency, lens language accuracy and sound and picture synchronization effect, and can be widely applied to the technical field of videos.

Inventors

WANG BIAO
Li Ruihuang
TAO JIALE
LU QINGLIN
SHAO QI
WU SHANGDA
PENG HOUWEN
WANG LEI
WANG LINQING

Assignees

腾讯科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260126

Claims (14)

1. A method of video generation, the method comprising: Acquiring a reference video and an editing prompt; analyzing the reference video and the editing prompt to generate a structured multi-track scene video script; generating a video according to the multi-track scene video script to obtain a target video conforming to the editing prompt; The multi-track scene video script comprises a reference feature anchor point module, an event track module and a visual track module, wherein the reference feature anchor point module is used for recording entity identifiers and description information of at least one entity, the event track module is used for recording at least one event unit describing audio events in the target video and event identifiers of the event units, the event units are used for associating execution entities of the audio events by referring to the entity identifiers, the visual track module is used for recording at least one lens unit describing a sub-mirror picture in the target video, and the lens unit is used for associating the audio events presented by the sub-mirror picture by referring to the event identifiers.
2. The method of claim 1, wherein parsing the reference video and the edit prompt to generate a structured multi-track scene video script comprises: Analyzing the reference video to generate a structured initial scene video script; analyzing the editing prompt on the basis of a natural language processing technology, and determining the editing intention corresponding to the editing prompt; and editing the initial scene video script according to the editing intention to obtain the multi-track scene video script.
3. The method of claim 2, wherein parsing the reference video to generate a structured initial scene video script comprises: Performing entity detection on the reference video, and identifying at least one entity and description information corresponding to the entity contained in the reference video; Distributing an entity identifier for the entity, and constructing an initial reference feature anchor point module according to the entity identifier and the description information; performing audio event detection on the reference video, and identifying at least one audio event contained in the reference video and an execution entity of the audio event; constructing an event unit according to the audio event and the execution entity, and constructing an initial event track module according to the event unit; Performing sub-mirror detection on the reference video, and identifying at least one sub-mirror picture contained in the reference video; Constructing a lens unit according to the lens dividing picture, and constructing an initial visual track module according to the lens unit; and combining the initial reference feature anchor point module, the initial event track module and the initial visual track module to obtain the initial scene video script.
4. The method of claim 3, wherein the performing entity detection on the reference video, identifying at least one entity included in the reference video and description information corresponding to the entity, includes: performing frame extraction sampling on the reference video to obtain a sequence comprising a plurality of sampled video frames; performing entity detection on each sampling video frame through an entity detection model to obtain a plurality of entity detection frames; extracting characteristic data of image contents in each entity detection frame, and clustering the image contents according to the characteristic data to obtain at least one clustering category; And determining the image contents belonging to the same clustering category as an entity, and determining the description information corresponding to the entity according to the image contents corresponding to the entity.
5. The video generation method according to claim 4, wherein the clustering the image content according to the feature data includes: Selecting any characteristic data as a cluster center, and establishing a cluster; calculating the similarity between each feature data to be clustered and the cluster center, wherein the feature data to be clustered is the feature data which is not added into the cluster; when the similarity between the feature data to be clustered and the cluster center is greater than or equal to a preset threshold, adding the image content corresponding to the feature data to be clustered into a cluster corresponding to the cluster center, and updating the cluster center position of the cluster center; And when the similarity between the feature data to be clustered and any cluster core is smaller than the preset threshold, establishing a new cluster by taking the feature data to be clustered as the cluster core.
6. The method according to claim 5, wherein the determining the description information corresponding to the entity according to the image content corresponding to the entity includes: inputting the characteristic data of the image content corresponding to the entity into an image understanding model, and generating the description information corresponding to the entity through the image understanding model; the image understanding model is obtained through training the following steps: the method comprises the steps of acquiring a training data set, wherein the training data set comprises a plurality of sample images and labels corresponding to the sample images, and the labels are used for describing image contents of the sample images; extracting sample characteristic data of the sample image, inputting the sample characteristic data into an image understanding model to be trained, and generating prediction description information corresponding to the sample image through the image understanding model; Determining a trained loss value according to the difference between the prediction description information and the label; and updating parameters of the image understanding model according to the loss value to obtain a trained image understanding model.
7. The method of claim 3, wherein the performing audio event detection on the reference video, identifying at least one audio event contained in the reference video and an executing entity of the audio event, and constructing an event unit according to the audio event and the executing entity comprises: Performing audio segmentation on the reference video to obtain at least one audio fragment and audio start-stop time corresponding to the audio fragment; content identification is carried out on the audio clips, and audio events and execution entities corresponding to the audio clips are determined; And extracting event information of the audio event, constructing the event unit according to the audio start-stop time, the event information and the execution entity, and distributing the event identifier for the event unit.
8. The method of claim 7, wherein the performing the split-mirror detection on the reference video identifies at least one split-mirror picture included in the reference video, and wherein constructing a lens unit based on the split-mirror picture comprises: Performing sub-mirror detection on the reference video to obtain at least one sub-mirror picture and video start-stop time corresponding to the sub-mirror picture; determining a target audio event presented by the sub-mirror picture according to the video start-stop time and the audio start-stop time; And extracting picture information of the split-lens picture, constructing the lens unit according to the video start-stop time, the picture information and the target audio event, and distributing a lens identifier for the lens unit.
9. The video generation method according to claim 2, wherein the editing operation of the initial scene video script according to the editing intention includes at least one of: Updating the description information of the target entity according to the modification instruction aiming at the target entity in the editing intention; Or updating a lens unit corresponding to the target lens according to a modification instruction aiming at the target lens in the editing intention; Or adding a new event unit in the event track module or adding a new lens unit in the visual track module according to a modification instruction of the newly added content in the editing intention.
10. The method according to claim 9, wherein updating the description information of the target entity according to the modification instruction for the target entity in the editing intent comprises: Extracting an entity identifier, a target attribute field and a target attribute value corresponding to the target entity according to a modification instruction aiming at the target entity in the editing intention; Inquiring and obtaining description information of the target entity in the reference characteristic anchor point module according to the entity identifier corresponding to the target entity; and setting the value of the target attribute field in the description information of the target entity as the target attribute value.
11. A video generating apparatus, the apparatus comprising: the acquisition unit is used for acquiring the reference video and the editing prompt; the analysis unit is used for analyzing the reference video and the editing prompt to generate a structured multi-track scene video script; the execution unit is used for generating video according to the multi-track scene video script to obtain a target video conforming to the editing prompt; The multi-track scene video script comprises a reference feature anchor point module, an event track module and a visual track module, wherein the reference feature anchor point module is used for recording entity identifiers and description information of at least one entity, the event track module is used for recording at least one event unit describing audio events in the target video and event identifiers of the event units, the event units are used for associating execution entities of the audio events by referring to the entity identifiers, the visual track module is used for recording at least one lens unit describing a sub-mirror picture in the target video, and the lens unit is used for associating the audio events presented by the sub-mirror picture by referring to the event identifiers.
12. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the video generation method of any one of claims 1 to 10 when executing the computer program.
13. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the video generation method of any one of claims 1 to 10.
14. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the video generation method of any of claims 1 to 10.

Description

Video generation method, device, equipment and storage medium Technical Field The present application relates to the field of video technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating video. Background As video content creation has entered an intelligent age, automatic generation of video content from textual descriptions using a generative artificial intelligence model has become a leading direction of industry attention. Currently, in the related art, a reference video and an editing instruction (such as an editing prompt) are generally provided by an object, then the reference video provided by the object is analyzed and understood through a video understanding task, and after video description information is obtained, the generation of a target video is realized by combining the content of the editing instruction. In the process, the quality and the form of the video description information generated by the driving model determine the accuracy and the controllability of the target video to a large extent. However, in practical application, it is found that the video description information obtained through the video understanding task aims at summarizing the understanding formed after the human being watches the reference video, and the content description has single and mixed problems, so that when the video description information is applied to generating the target video, the accuracy and the disambiguation are difficult to achieve, the quality of generating the target video is poor, and the actual authoring intention cannot be met. Disclosure of Invention The embodiment of the application provides a video generation method, a device, equipment and a storage medium, which can improve the quality and controllability of generated videos in the aspects of main body consistency, lens language accuracy and sound and picture synchronization effect. An aspect of an embodiment of the present application provides a video generating method, including: Acquiring a reference video and an editing prompt; analyzing the reference video and the editing prompt to generate a structured multi-track scene video script; generating a video according to the multi-track scene video script to obtain a target video conforming to the editing prompt; The multi-track scene video script comprises a reference feature anchor point module, an event track module and a visual track module, wherein the reference feature anchor point module is used for recording entity identifiers and description information of at least one entity, the event track module is used for recording at least one event unit describing audio events in the target video and event identifiers of the event units, the event units are used for associating execution entities of the audio events by referring to the entity identifiers, the visual track module is used for recording at least one lens unit describing a sub-mirror picture in the target video, and the lens unit is used for associating the audio events presented by the sub-mirror picture by referring to the event identifiers. In another aspect, an embodiment of the present application provides a video generating apparatus, including: the acquisition unit is used for acquiring the reference video and the editing prompt; the analysis unit is used for analyzing the reference video and the editing prompt to generate a structured multi-track scene video script; the execution unit is used for generating video according to the multi-track scene video script to obtain a target video conforming to the editing prompt; The multi-track scene video script comprises a reference feature anchor point module, an event track module and a visual track module, wherein the reference feature anchor point module is used for recording entity identifiers and description information of at least one entity, the event track module is used for recording at least one event unit describing audio events in the target video and event identifiers of the event units, the event units are used for associating execution entities of the audio events by referring to the entity identifiers, the visual track module is used for recording at least one lens unit describing a sub-mirror picture in the target video, and the lens unit is used for associating the audio events presented by the sub-mirror picture by referring to the event identifiers. Optionally, in some embodiments, the parsing unit is specifically configured to: Analyzing the reference video to generate a structured initial scene video script; analyzing the editing prompt on the basis of a natural language processing technology, and determining the editing intention corresponding to the editing prompt; and editing the initial scene video script according to the editing intention to obtain the multi-track scene video script. Optionally, in some embodiments, the parsing unit is specifically configured to: Performing entity detection on the