CN-122002100-A - Video generation method, device, electronic equipment and storage medium

CN122002100ACN 122002100 ACN122002100 ACN 122002100ACN-122002100-A

Abstract

The disclosure provides a video generation method, a video generation device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the technologies of computer vision, video content understanding, automatic video editing, intelligent media synthesis and the like. The method comprises the steps of extracting a plurality of candidate video clips containing each target object from an original video based on a source video identifier and reference information of at least two target objects, respectively carrying out quality screening on the candidate video clips of each target object, combining the screened clips into a plurality of clip sets according to a preset matching rule, respectively carrying out vertical clipping on each video clip according to face position information of each video clip in the clip set aiming at each clip set, synthesizing the clipped video clips according to a preset multi-split-screen layout to obtain a synthesized frame sequence, and combining the synthesized frame sequence with target audio data to generate a vertical split-screen video.

Inventors

OUYANG CHAO
LI HUACHAO
PANG LEI
BAI YUNLONG
WANG YING

Assignees

北京百度网讯科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260126

Claims (20)

1. A video generation method, comprising: Extracting a plurality of candidate video clips containing each target object from an original video based on a source video identifier and reference information of at least two target objects, wherein the source video identifier and the reference information are acquired according to a user input instruction; Quality screening is respectively carried out on candidate video segments of each target object, the screened segments are combined into a plurality of segment sets according to a preset matching rule, and each segment set contains one video segment of each target object; Aiming at each fragment set, based on face position information of each video fragment in the fragment set, respectively carrying out vertical edition cutting on each video fragment, and synthesizing the cut video fragments according to a preset multi-split-screen layout to obtain a synthesized frame sequence; And combining the synthesized frame sequence with target audio data to generate a vertical split-screen video.
2. The method of claim 1, wherein extracting a plurality of candidate video segments containing each target object from the original video based on the source video identification and reference information of at least two target objects, comprises: acquiring the original video according to the source video identification; Extracting characteristic information of each target object according to the reference information of the at least two target objects; and carrying out frame analysis on the original video based on the characteristic information, and extracting a plurality of candidate video fragments containing each target object.
3. The method of claim 2, wherein the frame analyzing the original video based on the feature information to extract a plurality of candidate video segments including each target object comprises: performing face detection on video frames in the original video, and screening out video frames which only contain a single Zhang Ren face in a picture and have the face size larger than a preset threshold as candidate close-up frames; Identifying the face features extracted from the candidate close-up frames based on the feature information so as to determine the attribution relation between each candidate close-up frame and the target object; And aggregating the candidate close-up frames which have the same attribution relation and are adjacent in time according to the time sequence to form candidate video fragments corresponding to the target object.
4. A method according to claim 3, wherein the method further comprises: Recording the horizontal center position of the face area in the candidate close-up frame in the picture; The step of performing vertical clipping on each video clip according to face position information of each video clip in each clip set, includes: and cutting each frame of each video clip in the clip set according to a preset vertical aspect ratio by taking the horizontal center position of the face recorded by the frame as a reference point.
5. The method of claim 1, wherein the quality filtering the candidate video segments for each target object separately comprises: And optimizing the content quality of the candidate video segments of each target object to obtain an optimized segment set corresponding to each target object.
6. The method of claim 5, wherein the content quality optimization comprises at least one of: removing the repeated segments of the content by applying a visual hash algorithm; for each candidate video segment, calculating the standard deviation of a sequence formed by the face horizontal center position of each video frame, and eliminating the candidate video segment of which the standard deviation exceeds a preset stability threshold; And performing alignment optimization on the start and stop boundaries of the candidate video segments to align the boundaries of the candidate video segments with the scene transition points of the original video.
7. The method of claim 5, wherein the performing quality screening further comprises: Content attribute analysis is carried out on the fragments in the optimized fragment set, and at least one content attribute label is generated for each fragment; The content attribute tag comprises at least one of an expression state, an emotion state or an action type of a target object in the segment.
8. The method of claim 7, wherein when the number of target objects is two, the combining the screened segments into a plurality of segment sets according to a preset matching rule comprises: And selecting fragments of the content attribute tags conforming to the preset emotion corresponding relation from the optimized fragment sets of the two target objects based on the content attribute tags, and combining to form the fragment set.
9. The method of claim 8, wherein the preset emotion correspondence is any one of: The emotional states indicated by the fragment content attribute labels of the two target objects are different; the emotional states indicated by the fragment content attribute tags of the two target objects are the same.
10. The method of claim 5, wherein the combining the screened segments into a plurality of segment sets according to a preset matching rule comprises: And selecting one fragment from the optimized fragment set of each target object according to a time length matching rule based on the time length information of each fragment in the optimized fragment set of each target object so as to combine and form the fragment sets.
11. The method of claim 10, wherein when the number of target objects is two, the duration matches a rule comprising: Determining the party with the smaller number of fragments in the two optimized fragment sets as a reference set; And for each fragment in the reference set, selecting the fragment with the smallest time length difference value from the other optimized fragment set for pairing.
12. The method of claim 1, wherein the preset multi-split screen layout is a multi-picture simultaneous presentation layout comprising a top-bottom layout, a side-to-side layout, a grid layout, or a picture-in-picture layout.
13. The method of claim 1, wherein the combining the sequence of synthetic frames with target audio data generates a portrait split screen video, comprising: encoding the synthesized frame sequence into a silent vertical split-screen video clip; Sequentially splicing a plurality of silent vertical split screen video clips to form a silent video clip sequence; and carrying out mixed flow processing on the target audio data and the silent video clip sequence to generate the vertical split screen video.
14. The method of claim 13, wherein the method further comprises: And responding to the generated vertical split screen video, wherein the time length of the generated vertical split screen video exceeds a preset target time length threshold, and splitting the vertical split screen video into at least one video with the time length which accords with the target time length threshold.
15. The method of claim 1, wherein after generating the portrait split video, the method further comprises: Extracting at least one frame meeting the preset polyhedron state requirement from the synthesized frame used for generating the vertical split screen video as a candidate cover frame; and sorting the comprehensive scores of the candidate cover frames, and selecting at least one frame as the cover of the vertical split screen video according to the sorting result.
16. The method of claim 15, wherein the preset polyhedral state requirement comprises: the synthetic frame needs to contain the faces with the same number as the target objects, and the biological characteristic index of each face meets the preset condition; the biological characteristic index at least comprises an eye aspect ratio and a mouth aspect ratio, and the preset condition is that the eye aspect ratio is larger than a first threshold value and the mouth aspect ratio is smaller than a second threshold value.
17. A video generating apparatus comprising: The system comprises a segment extraction module, a segment extraction module and a segment extraction module, wherein the segment extraction module is used for extracting a plurality of candidate video segments containing each target object from an original video based on a source video identifier and reference information of at least two target objects, wherein the source video identifier and the reference information are acquired according to a user input instruction; The segment processing and pairing module is used for respectively carrying out quality screening on candidate video segments of each target object, combining the screened segments into a plurality of segment sets according to a preset matching rule, wherein each segment set comprises one video segment of each target object; The rendering synthesis module is used for respectively carrying out vertical edition cutting on each video clip according to the face position information of each video clip in each clip set, and synthesizing the cut video clips according to a preset multi-split-screen layout to obtain a synthesized frame sequence; And the video generation module is used for combining the synthesized frame sequence with the target audio data to generate a vertical split-screen video.
18. An electronic device, comprising: At least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-16.
19. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-16.
20. A computer program product comprising a computer program stored on a storage medium, which, when executed by a processor, implements the method according to any one of claims 1-16.

Description

Video generation method, device, electronic equipment and storage medium Technical Field The disclosure relates to the technical field of artificial intelligence, in particular to technologies such as computer vision, video content understanding, automatic video editing and intelligent media synthesis, and the like, and can be applied to scenes such as secondary creation of film and television content, automatic generation of vertical short videos and the like, and particularly relates to a video generation method, a device, electronic equipment and a storage medium. Background Currently, secondary creation of film and television contents, particularly production of vertical split screen video, is highly dependent on manual 'pull tab' and manual operation of editing personnel. The creator needs to repeatedly watch and screen out clear close-up shots with specific roles from a large amount of film and television materials frame by frame, and manually completes time sequence alignment, composition cutting and split screen synthesis of two or more persons. The process has the problems of high creation threshold, long manufacturing period, low production efficiency, difficult quality stabilization and the like. Disclosure of Invention The disclosure provides a video generation method, a video generation device, electronic equipment and a storage medium. According to a first aspect of the present disclosure, a video generation method is provided, which includes extracting a plurality of candidate video clips including each target object from an original video based on a source video identifier and reference information of at least two target objects, obtaining the source video identifier and the reference information according to a user input instruction, respectively performing quality screening on the candidate video clips of each target object, combining the screened clips into a plurality of clip sets according to a preset matching rule, each clip set including one video clip of each target object, respectively performing vertical clipping on each video clip based on face position information of each video clip in the clip set for each clip set, and synthesizing the clipped video clips according to a preset multi-split-screen layout to obtain a synthesized frame sequence, and combining the synthesized frame sequence with target audio data to generate a vertical split-screen video. According to a second aspect of the present disclosure, a video generating device is provided, which includes a segment extraction module configured to extract a plurality of candidate video segments including each target object from an original video based on a source video identifier and reference information of at least two target objects, wherein the source video identifier and the reference information are acquired according to a user input instruction, a segment processing and pairing module configured to perform quality screening on the candidate video segments of each target object, respectively, combine the screened segments into a plurality of segment sets according to a preset matching rule, each segment set includes one video segment of each target object, and a rendering and synthesizing module configured to, for each segment set, perform vertical clipping on each video segment based on face position information of each video segment in the segment set, and synthesize the clipped video segments according to a preset multi-split-screen layout to obtain a synthesized frame sequence, and a video generating module configured to combine the synthesized frame sequence with target audio data to generate a vertical split-screen video. According to a third aspect of the present disclosure there is provided an electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present disclosure. According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a method according to any one of the embodiments of the present disclosure. According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the embodiments of the present disclosure. According to the scheme, the automatic identification and fragment extraction of specific characters in an original video are realized by acquiring the characteristic information of a target object, the candidate fragments are subjected to intelligent quality screening and pairing combination according to a multi-dimensional rule to form a fragment set with coordinated contents, the multi-split screen pictu