CN-116524898-B - Sound video generation method and device, electronic equipment and storage medium

CN116524898BCN 116524898 BCN116524898 BCN 116524898BCN-116524898-B

Abstract

The invention provides an audio and video generation method, an audio and video generation device, electronic equipment and a storage medium, which belong to the technical field of computers and comprise the steps of preprocessing texts to be inferred to obtain text identification sequences corresponding to the texts to be inferred, inputting the text identification sequences into an audio and video generation model to generate audio and video corresponding to the texts to be inferred, wherein the audio and video generation model comprises an autoregressive audio and video sequence generation model, an audio and video vector quantization self-encoder, a video insertion frame model, an audio conversion model and an audio and video processing tool. According to the invention, the audio video is generated through the audio video generation model, so that the video signal and the audio signal can be generated simultaneously, further the audio video which is matched with the text semantics to be inferred and has good generalization can be synthesized according to the video signal and the audio signal, the audio mode information can be effectively focused, important data is provided for artificial intelligent research, and the use requirement of a user is effectively met.

Inventors

LIU JING
WANG WEINING
LIU JIAWEI

Assignees

中国科学院自动化研究所

Dates

Publication Date: 20260505
Application Date: 20230323

Claims (13)

1. A method of generating an audio video, comprising: preprocessing a text to be inferred to obtain a text identification sequence corresponding to the text to be inferred; Inputting the text identification sequence into an audio-video generation model to generate an audio-video corresponding to the text to be inferred; The audio and video generation model comprises an autoregressive audio and video sequence generation model, an audio and video vector quantization self-encoder, a video frame insertion model, an audio conversion model and an audio and video processing tool; the autoregressive audio-video sequence generation model is an autoregressive three-mode joint transducer neural network Decoder model, and is used for identifying the text identification sequence corresponding to the text to be inferred, which is input by a user, and generating a visual identification sequence and an audio identification sequence through text semantic understanding and cross-mode connection, wherein the cross-mode connection is an association relation among multi-mode information, and the multi-mode information comprises text information, video image information and audio information; Inputting the text identification sequence into an audio video generation model to generate an audio video corresponding to the text to be inferred, wherein the audio video generation model comprises the following steps: Step 21, inputting the text identification sequence into the autoregressive audio-video sequence generation model to generate a video image frame discrete identification sequence and an audio frequency spectrum discrete identification sequence corresponding to the text identification sequence; step 22, inputting the video image frame discrete identification sequence and the audio spectrum discrete identification sequence to a decoder of the audio-video vector quantization self-encoder to generate a video image frame and an audio mel frequency spectrum; Step 23, inputting the generated video image frames to the video interpolation model to synthesize silent video; Step 24, inputting the audio Mel frequency spectrum into the audio conversion model to synthesize audio signals, wherein the duration of the silent video is matched with the duration of the audio signals; and step 25, inputting the silent video and the audio signal to the audio-video processing tool to generate the audio-video.
2. The method of claim 1, wherein the audio-video vector quantization self-encoder is an SVG-VQGAN model; and/or, the audio conversion model is HiFiGAN decoder; And/or the audio and video processing tool is a ffmpeg multimedia processing tool; and/or, the video interpolation model is constructed based on a frame interpolation neural network model.
3. The method for generating an audio-visual signal according to claim 1, wherein the audio conversion model is trained by: Step 101, preprocessing an audio and video sample corresponding to each text sample, obtaining an audio signal sample and a video image frame sample corresponding to each audio and video sample, and obtaining an audio Mel spectrum sample corresponding to each audio signal sample; 102, taking any one of the audio Mel spectrum samples as input of an audio conversion model to be trained, taking the audio signal sample corresponding to any one of the audio Mel spectrum samples as an output tag of the audio conversion model to be trained, and pre-training the audio conversion model to be trained; And step 102 is iteratively executed until the pre-training of the audio conversion model to be trained is completed, and the trained audio conversion model is obtained.
4. The method for generating an audio-visual signal according to claim 3, wherein said step 101 specifically comprises: Sparse sampling is carried out on the audio and video samples by adopting a preset sampling frame rate, and a plurality of continuous video frames are randomly selected to form video fragments to serve as the video image frame samples; sampling the audio and video samples by adopting a preset audio sampling rate to obtain the audio signal samples; acquiring a mel frequency spectrum corresponding to the audio signal sample; and carrying out normalization processing on the Mel frequency spectrum, intercepting the Mel frequency spectrum according to the timestamp information of the randomly selected multiple continuous video frames to obtain Mel frequency spectrum fragments aligned with the randomly selected multiple continuous video frames in time, and constructing the audio Mel frequency spectrum sample.
5. A method of generating an audio-video signal according to claim 3, wherein the encoder of the audio-video vector quantization self-encoder comprises a visual encoder and an audio encoder, and the decoder of the audio-video vector quantization self-encoder comprises a visual decoder and an audio decoder; the audio and video vector quantization self-encoder is trained by the following steps: step 201, an audio mel spectrum sample and a video image frame sample corresponding to any text sample are obtained; Step 202, inputting the audio mel spectrum sample to the audio encoder to obtain audio quantization coding; inputting the audio quantization code to the audio decoder to obtain an audio mel spectrum reconstruction sample; Step 203, inputting the video image frame samples to the visual encoder to obtain visual quantization codes; inputting the visual quantization code to the visual decoder to obtain a video image frame reconstruction sample; Step 204, pre-training the audio-video vector quantization self-encoder by using losses among the audio mel spectrum reconstruction sample, the audio mel spectrum sample, the video image frame reconstruction sample and the video image frame sample; Iteratively executing the steps 201 to 204 until the pre-training of the audio and video vector quantization self-encoder is completed, and obtaining the trained audio and video vector quantization self-encoder; The loss includes reconstruction loss, quantization coding loss, perception loss and antagonism loss.
6. The method of generating an audio-visual signal according to claim 5, further comprising, prior to acquiring the audio quantization code and acquiring the visual quantization code: acquiring visual features extracted by the visual encoder and audio features extracted by the audio encoder; Acquiring global features of video image frames and global features of audio frequency spectrum frames by associating the visual features and the audio features through a cross-modal attention module; training the visual encoder and the audio encoder using a hybrid contrast learning penalty between the video image frame global feature and the visual feature, the audio spectrum frame global feature and the audio feature.
7. The method for generating an audio-visual signal according to any one of claims 3 to 6, wherein the model for generating the autoregressive audio-visual sequence is trained by: Step 301, acquiring an audio mel spectrum sample and a video image frame sample corresponding to any text sample, and acquiring a text identification sequence sample corresponding to any text sample; step 302, inputting the audio mel spectrum sample and the video image frame sample to the audio-video vector quantization self-encoder, and obtaining a video image frame discrete identification sequence sample and the audio spectrum discrete identification sequence sample; Step 303, constructing a three-mode joint training sample from the text identification sequence sample, the video image frame discrete identification sequence sample and the audio frequency spectrum discrete identification sequence sample; step 304, performing autoregressive training on the autoregressive audio and video sequence generation model by using the three-mode joint training sample; And iteratively executing the steps 301 to 304 until the pre-training of the autoregressive audio-video sequence generation model is completed, and obtaining the trained autoregressive audio-video sequence generation model.
8. The method for generating an audio-visual signal according to claim 7, wherein said step 303 comprises: performing frame-by-frame splicing processing on the video image frame discrete identification sequence sample and the audio frequency spectrum discrete identification sequence sample according to a time sequence to obtain a spliced bimodal identification sequence; Splicing the text identification sequence sample and the bimodal identification sequence to obtain a spliced trimodal identification sequence; And acquiring the tri-modal joint training sample based on the spliced tri-modal identification sequence and the preset sequence length.
9. The method for generating the audio and video according to claim 1, wherein the preprocessing the text to be inferred to obtain the text identification sequence corresponding to the text to be inferred specifically comprises: And encoding the text to be inferred based on a byte pair encoding method to obtain a text identification sequence corresponding to the text to be inferred.
10. An audio/video generation device, comprising: the text processing module is used for preprocessing the text to be inferred to obtain a text identification sequence corresponding to the text to be inferred; The video generation module is used for inputting the text identification sequence into an audio video generation model and generating an audio video corresponding to the text to be inferred; The audio and video generation model comprises an autoregressive audio and video sequence generation model, an audio and video vector quantization self-encoder, a video frame insertion model, an audio conversion model and an audio and video processing tool; the autoregressive audio-video sequence generation model is an autoregressive three-mode joint transducer neural network Decoder model, and is used for identifying the text identification sequence corresponding to the text to be inferred, which is input by a user, and generating a visual identification sequence and an audio identification sequence through text semantic understanding and cross-mode connection, wherein the cross-mode connection is an association relation among multi-mode information, and the multi-mode information comprises text information, video image information and audio information; Inputting the text identification sequence into an audio video generation model to generate an audio video corresponding to the text to be inferred, wherein the audio video generation model comprises the following steps: Step 21, inputting the text identification sequence into the autoregressive audio-video sequence generation model to generate a video image frame discrete identification sequence and an audio frequency spectrum discrete identification sequence corresponding to the text identification sequence; step 22, inputting the video image frame discrete identification sequence and the audio spectrum discrete identification sequence to a decoder of the audio-video vector quantization self-encoder to generate a video image frame and an audio mel frequency spectrum; Step 23, inputting the generated video image frames to the video interpolation model to synthesize silent video; Step 24, inputting the audio Mel frequency spectrum into the audio conversion model to synthesize audio signals, wherein the duration of the silent video is matched with the duration of the audio signals; and step 25, inputting the silent video and the audio signal to the audio-video processing tool to generate the audio-video.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of generating an audio-visual according to any one of claims 1 to 9 when executing the computer program.
12. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the method of generating an audio-visual according to any of claims 1 to 9.
13. A computer program product comprising a computer program which, when executed by a processor, implements the method of generating an audio-visual according to any one of claims 1 to 9.

Description

Sound video generation method and device, electronic equipment and storage medium Technical Field The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for generating an audio and video, an electronic device, and a storage medium. Background The generation of text to audio video is a research that needs to span multiple fields and involves multi-modal information, which has great significance to artificial intelligence research. However, the existing video generation method is often focused on the generation of video images, that is, the generation mode from a text mode to a video mode. Therefore, how to implement text-to-audio video generation has become a problem to be solved in the industry. Disclosure of Invention The invention provides a method, a device, electronic equipment and a storage medium for generating an audio video, which are used for solving the technical requirement of generating a text-to-audio video in the prior art. The invention provides audio and video generation, which comprises preprocessing a text to be inferred to obtain a text identification sequence corresponding to the text to be inferred, and inputting the text identification sequence into an audio and video generation model to generate an audio and video corresponding to the text to be inferred, wherein the audio and video generation model comprises an autoregressive audio and video sequence generation model, an audio and video vector quantization self-encoder, a video frame insertion model, an audio conversion model and an audio and video processing tool. The audio-video generation method includes the steps of inputting the text identification sequence into an audio-video generation model to generate an audio video corresponding to the text to be inferred, inputting the text identification sequence into the autoregressive audio-video sequence generation model to generate a video image frame discrete identification sequence and an audio frequency spectrum discrete identification sequence corresponding to the text identification sequence, inputting the video image frame discrete identification sequence and the audio frequency spectrum discrete identification sequence into a decoder of the audio-video vector quantization self-encoder to generate a video image frame and an audio Mel frequency spectrum, inputting the generated video image frame into the video interpolation model to synthesize a silent video, inputting the audio Mel frequency spectrum into the audio conversion model to synthesize an audio signal, matching the duration of the silent video with the duration of the audio signal, and inputting the video image frame and the audio signal into the audio-video processing tool to generate the audio-video. According to the audio and video generation method, the autoregressive audio and video sequence generation model is an autoregressive three-mode joint transducer neural network Decoder model, and/or the audio and video vector quantization self-encoder is an SVG-VQGAN model, and/or the audio conversion model is a HiFiGAN Decoder, and/or the audio and video processing tool is a ffmpeg multimedia processing tool, and/or the video interpolation frame model is constructed based on a frame interpolation neural network model. The audio-video generation method comprises the steps of 101, preprocessing an audio-video sample corresponding to each text sample, obtaining an audio signal sample and a video image frame sample corresponding to each audio-video sample, obtaining an audio Mel spectrum sample corresponding to each audio signal sample, 102, taking any audio Mel spectrum sample as input of an audio conversion model to be trained, taking the audio signal sample corresponding to any audio Mel spectrum sample as output label of the audio conversion model to be trained, pre-training the audio conversion model to be trained, and iteratively executing 102 until the pre-training of the audio conversion model to be trained is completed, and obtaining the trained audio conversion model. The method for generating the audio and video specifically comprises the steps of performing sparse sampling on an audio and video sample by adopting a preset sampling frame rate, randomly selecting a plurality of continuous video frames to form video clips as the video image frame sample, sampling the audio and video sample by adopting a preset audio sampling rate to obtain an audio signal sample, obtaining a Mel frequency spectrum corresponding to the audio signal sample, performing normalization processing on the Mel frequency spectrum, intercepting the Mel frequency spectrum according to timestamp information of the randomly selected plurality of continuous video frames to obtain Mel frequency spectrum clips aligned with the randomly selected plurality of continuous video frames, and constructing the audio Mel frequency spectrum sample. According to the audio and video generation method p