CN-122027862-A - Video generation method, related device, equipment and storage medium

CN122027862ACN 122027862 ACN122027862 ACN 122027862ACN-122027862-A

Abstract

The application discloses a video generation method, a related device, equipment and a storage medium. The method comprises the steps of obtaining target audio materials, obtaining target text materials corresponding to the target audio materials, extracting characteristics of the target audio materials to obtain audio description information related to the target audio materials, obtaining K candidate videos through a video generation model based on the target text materials and the audio description information, and editing the K candidate videos according to the target audio materials to obtain target synthesized videos, wherein the target synthesized videos comprise video fragments derived from at least one candidate video, and background sounds of the target synthesized videos are generated based on the target audio materials. The method provided by the application can improve the video editing efficiency, thereby enhancing the use experience of the user.

Inventors

CUI QI
LIU HONGDA
WU JINFA
ZHANG XIAOYI
ZHANG ZHIQIANG
WANG SHAOMING
GUO RUNZENG
HOU JINKUN

Assignees

腾讯科技（深圳）有限公司

Dates

Publication Date: 20260512
Application Date: 20241111

Claims (16)

1. A method of video generation, comprising: Acquiring target audio materials, wherein the target audio materials comprise at least one of music and voice; Acquiring a target text material corresponding to the target audio material; Extracting characteristics of the target audio material to obtain audio description information related to the target audio material, wherein the audio description information comprises at least one of tone information, beat information, rhythm information, intensity information, emotion information, frequency spectrum information and pitch information; Obtaining K candidate videos through a video generation model based on the target text material and the audio description information, wherein each candidate video comprises at least one video segment, and K is an integer greater than 1; And editing the K candidate videos according to the target audio material to obtain a target synthesized video, wherein the target synthesized video comprises video fragments derived from at least one candidate video, the at least one candidate video belongs to the K candidate videos, and the background sound of the target synthesized video is generated based on the target audio material.
2. The method of claim 1, wherein the obtaining the target audio material comprises: Responding to uploading operation for an audio file, and acquiring the target audio material according to the audio file; Or alternatively And responding to input operation for the audio link, and acquiring the target audio material according to the audio link.
3. The method according to claim 1 or 2, wherein the obtaining the target text material corresponding to the target audio material includes: Responding to input operation for text content, and taking the text content as the target text material corresponding to the target audio material; Or alternatively And responding to the voice recognition operation aiming at the target audio material, and acquiring the target text material corresponding to the target audio material.
4. A method according to any one of claims 1 to 3, wherein the feature extraction of the target audio material to obtain audio description information related to the target audio material includes: preprocessing the target audio material to obtain T audio frames, wherein T is an integer greater than 1; Performing Fourier transform on each audio frame in the T audio frames to obtain a frequency spectrum of each audio frame; The frequency spectrum of each audio frame is subjected to modular squaring to obtain a power spectrum; Filtering the power spectrum through a Mel filter group to obtain a filtering result corresponding to each Mel filter, wherein the Mel filter group comprises at least one Mel filter; And carrying out logarithmic compression and discrete cosine transformation on the filtering result corresponding to each Mel filter to obtain tone information corresponding to the target audio material, wherein the tone information is expressed as Mel frequency cepstrum coefficient.
5. A method according to any one of claims 1 to 3, wherein the feature extraction of the target audio material to obtain audio description information related to the target audio material includes: Converting the target audio material into a time domain signal; performing Fourier transform on the time domain signal to obtain a frequency domain signal; generating a spectrogram according to the frequency domain signal; Analyzing the spectrogram through a peak detection algorithm to obtain the peak position in the spectrogram; And determining beat information corresponding to the target audio material according to the peak position in the spectrogram.
6. A method according to any one of claims 1 to 3, wherein the feature extraction of the target audio material to obtain audio description information related to the target audio material includes: Converting the target audio material into a time domain signal; performing Fourier transform on the time domain signal to obtain a frequency domain signal; generating a spectrogram according to the frequency domain signal; And acquiring intensity information corresponding to the target audio material according to the spectrogram, wherein the intensity information comprises intensities of different frequency components in the frequency domain signal.
7. A method according to any one of claims 1 to 3, wherein the feature extraction of the target audio material to obtain audio description information related to the target audio material includes: Acquiring emotion probability distribution through an emotion recognition model based on the target audio material, wherein the emotion probability distribution comprises at least one probability value; and determining emotion information corresponding to the target audio material according to the emotion probability distribution.
8. The method of any of claims 1 to 7, wherein the obtaining K candidate videos by a video generation model based on the target text material and the audio description information comprises: processing the target text material through a text coding model to obtain a first text feature vector; Carrying out characterization processing on the audio description information to obtain an audio feature vector; Generating a target feature vector according to the text feature vector and the audio feature vector; And acquiring the K candidate videos through the video generation model based on the target feature vector.
9. The method of any of claims 1 to 7, wherein the obtaining K candidate videos by a video generation model based on the target text material and the audio description information comprises: generating an audio description text according to the audio description information; Splicing the target text material and the audio description text to obtain a comprehensive description text; Processing the comprehensive description text through a text coding model to obtain a second text feature vector; And acquiring the K candidate videos through the video generation model based on the second text feature vector.
10. The method according to any one of claims 1 to 9, wherein the editing the K candidate videos according to the target audio material to obtain a target synthesized video includes: Selecting N video clips from the K candidate videos according to the target audio material, wherein N is an integer greater than or equal to 1; And splicing the N video clips to obtain the target synthesized video.
11. The method according to any one of claims 1 to 9, wherein the editing the K candidate videos according to the target audio material to obtain a target synthesized video includes: Selecting N video clips from the K candidate videos according to the target audio material, wherein N is an integer greater than or equal to 1; splicing the N video clips to obtain a video to be synthesized; And processing the video to be synthesized according to visual effect parameters to obtain the target synthesized video, wherein the visual effect parameters comprise at least one of video duration, video image quality, video image proportion, video filter, video style, transition effect and dubbing style.
12. The method according to claim 10 or 11, wherein said selecting N video clips from said K candidate videos according to said target audio material comprises: Performing key frame identification on each candidate video in the K candidate videos to obtain a video segment set, wherein the video segment set comprises each video segment included in each candidate video in the K candidate videos, and the video segments are determined based on key frames; and screening the N video clips meeting the time dimension of the target audio material from the video clip set according to the audio description information.
13. A video generating apparatus, comprising: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring target audio materials, and the target audio materials comprise at least one of music and voice; the acquisition module is further used for acquiring a target text material corresponding to the target audio material; The extraction module is used for extracting the characteristics of the target audio material to obtain audio description information related to the target audio material, wherein the audio description information comprises at least one of tone information, beat information, rhythm information, intensity information, emotion information, frequency spectrum information and pitch information; The acquisition module is further configured to acquire K candidate videos through a video generation model based on the target text material and the audio description information, where each candidate video includes at least one video segment, and K is an integer greater than 1; And the editing module is used for editing the K candidate videos according to the target audio material to obtain a target synthesized video, wherein the target synthesized video comprises video clips from at least one candidate video, the at least one candidate video belongs to the K candidate videos, and the background sound of the target synthesized video is generated based on the target audio material.
14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 12 when the computer program is executed.
15. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 12.
16. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 12.

Description

Video generation method, related device, equipment and storage medium Technical Field The present application relates to the field of computer technologies, and in particular, to a method, a related device, an apparatus, and a storage medium for video generation. Background With the increasing progress of technology and the popularity of the internet, more and more people transmit information and share life segments through videos. When music is combined with video, music not only provides a basis for emotion and rhythm for the video, but also enables emotion and mood of the music to be embodied and visualized through specific pictures and actions. Currently, video processing software is widely used in various scenes as a common software on a terminal. When a user makes a video, firstly, the user needs to manually shoot or select video materials, then the user can clip the video materials according to the music rhythm, and finally, the music and the clipped video materials are combined into a video through video processing software. The inventor finds that at least the following problems exist in the current scheme, in the related art, when a user clips video materials by using video processing software, a great deal of effort and time are required, and certain expertise is also required, so that the authoring threshold is high, and the authoring experience of the user is poor. For this reason, an effective method is needed to solve such problems. Disclosure of Invention The embodiment of the application provides a video generation method, a related device, equipment and a storage medium, which can improve video editing efficiency, thereby enhancing the use experience of a user. In view of this, the present application provides, in one aspect, a method of video generation, including: acquiring target audio materials, wherein the target audio materials comprise at least one of music and voice; Acquiring a target text material corresponding to the target audio material; extracting characteristics of the target audio material to obtain audio description information related to the target audio material, wherein the audio description information comprises at least one of tone information, beat information, rhythm information, intensity information, emotion information, frequency spectrum information and pitch information; Obtaining K candidate videos through a video generation model based on the target text material and the audio description information, wherein each candidate video comprises at least one video segment, and K is an integer greater than 1; And editing the K candidate videos according to the target audio material to obtain a target synthesized video, wherein the target synthesized video comprises video fragments derived from at least one candidate video, the at least one candidate video belongs to the K candidate videos, and the background sound of the target synthesized video is generated based on the target audio material. Another aspect of the present application provides a video generating apparatus, including: The system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring target audio materials, and the target audio materials comprise at least one of music and voice; the acquisition module is also used for acquiring a target text material corresponding to the target audio material; The extraction module is used for extracting the characteristics of the target audio materials to obtain audio description information related to the target audio materials, wherein the audio description information comprises at least one of tone information, beat information, rhythm information, intensity information, emotion information, frequency spectrum information and pitch information; the acquisition module is further used for acquiring K candidate videos through the video generation model based on the target text material and the audio description information, wherein each candidate video comprises at least one video segment, and K is an integer greater than 1; And the clipping module is used for clipping the K candidate videos according to the target audio materials to obtain a target synthesized video, wherein the target synthesized video comprises video fragments derived from at least one candidate video, the at least one candidate video belongs to the K candidate videos, and the background sound of the target synthesized video is generated based on the target audio materials. In one possible design, in another implementation of another aspect of the embodiments of the present application, The acquisition module is specifically used for responding to the uploading operation of the audio file and acquiring the target audio material according to the audio file; Or alternatively In response to an input operation for the audio link, target audio material is obtained from the audio link. In one possible design, in another implemen