CN-122024757-A - Audio generation method and device, electronic equipment and storage medium

CN122024757ACN 122024757 ACN122024757 ACN 122024757ACN-122024757-A

Abstract

The embodiment of the invention provides an audio generation method, an audio generation device, electronic equipment and a storage medium, and relates to the technical field of TTS dubbing. The method comprises the steps of obtaining original audio to be dubbed and original text representing voice content of the original audio, converting the original text into text to be utilized, conducting audio synthesis on the text to be utilized to obtain current audio to be screened, conducting feature fusion on acoustic features of the current audio to be screened and text features of the text to be utilized to obtain current fusion features to be utilized, taking the current audio to be screened as target audio after dubbing under the condition that audio availability conditions are met, wherein the audio availability conditions comprise that subjective quality score of the current audio to be screened is larger than a score threshold value, and fusion feature similarity of the current fusion features to be utilized and the original fusion features is larger than a similarity threshold value. The efficiency of determining the target audio is improved, and the interference of artificial subjective factors to the determining process of the target audio is reduced.

Inventors

LI NA
LI HAI
CHEN HAITAO
WEN BOLONG
YAN YING

Assignees

成都爱奇艺智能创新科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260203

Claims (12)

1. A method of audio generation, the method comprising: acquiring original audio to be dubbed and an original text representing voice content of the original audio; converting the original text into a text to be utilized, wherein the language types of the original text and the text to be utilized are different; Performing audio synthesis on the text to be utilized to obtain current audio to be screened; Carrying out feature fusion on the acoustic features of the current audio to be screened and the text features of the text to be utilized to obtain current fusion features to be utilized; Under the condition that the audio availability condition is met, taking the current audio to be screened as target audio after dubbing, wherein the audio availability condition comprises that the subjective quality score of the current audio to be screened is larger than a score threshold value, the fusion feature similarity of the current fusion feature to be utilized and the original fusion feature is larger than a similarity threshold value, the subjective quality score of one audio is obtained by processing the audio by using a pre-trained subjective opinion evaluation model, and the original fusion feature is obtained by feature fusion of the acoustic feature of the original audio and the text feature of the original text.
2. The method of claim 1, wherein the subjective quality score of an audio is a mean subjective opinion score MOS of the audio, wherein the MOS of an audio is used for describing subjective quality of the audio from a preset dimension, and wherein the preset dimension comprises at least one of a voice naturalness of the audio, a voice clarity of the audio, a similarity of emotion expressed by voice content of the audio and emotion expressed by mood, and a similarity of emotion expressed by rhythm of utterances of the voice content of the audio.
3. The method according to claim 2, wherein the subjective opinion assessment model is trained based on: Acquiring first sample audio and a first sample tag indicating MOS of the first sample audio; Inputting the first sample audio to a subjective opinion evaluation model of an initial structure to obtain a prediction MOS of the first sample audio output by the subjective opinion evaluation model; And adjusting model parameters of the subjective opinion evaluation model based on the difference between the prediction MOS and the first sample label until the model converges to obtain a trained subjective opinion evaluation model.
4. A method according to any one of claims 1-3, wherein before the feature fusion is performed on the acoustic feature of the current audio to be screened and the text feature of the text to be utilized to obtain the current fusion feature to be utilized, the method further comprises: Acquiring an original video added with the original audio; The method comprises the steps of obtaining video characteristics of an original video, wherein the video characteristics of the original video represent at least one of opening time and closing time of an object in the original video and identity information of the object; the feature fusion is performed on the acoustic feature of the current audio to be screened and the text feature of the text to be utilized to obtain the current fusion feature to be utilized, which comprises the following steps: And carrying out feature fusion on the acoustic features of the current audio to be screened, the text features of the text to be utilized and the video features of the original video to obtain the current fusion features to be utilized.
5. The method of any of claims 1-3, wherein the acoustic features of an audio include at least one of emotion features, fundamental frequency features, and time length features; before the feature fusion is performed on the acoustic features of the current audio to be screened and the text features of the text to be utilized to obtain the current fusion features to be utilized, the method further comprises: Inputting the current audio to be screened into a pre-trained acoustic feature extraction network to obtain at least one of emotion features and fundamental frequency features of the current audio to be screened.
6. The method of claim 5, wherein the acoustic feature extraction network belongs to an audio selection model, the audio selection model further comprising a two-classification network; in the case that the audio availability condition is met, before taking the current audio to be screened as the target audio after dubbing, the method further comprises: and the current fusion feature to be utilized and the original fusion feature are spliced and then input into the two-class network, so that a class result which indicates whether the similarity of the fusion feature is larger than a similarity threshold value is obtained.
7. The method of claim 6, wherein the audio selection model is trained based on: Acquiring a second sample audio, an original sample audio and a second sample tag for indicating whether a first sample fusion feature is similar to a second sample fusion feature, wherein the first sample fusion feature is obtained by carrying out feature fusion on an acoustic feature of the second sample audio and a text feature of a text representing voice content of the second sample audio; Inputting the second sample audio and the original sample audio into an audio selection model of an initial structure to obtain a prediction classification result output by the audio selection model; And adjusting model parameters of the audio selection model based on the difference between the prediction classification result and the second sample label until the model converges to obtain a trained audio selection model.
8. The method of any of claims 1-3, wherein the acoustic features of an audio include emotion features, fundamental frequency features, and time length features; Performing feature fusion on the acoustic features of the current audio to be screened and the text features of the text to be utilized to obtain the current fusion features to be utilized, wherein the feature fusion comprises the following steps: Carrying out feature fusion on the emotion features of the current audio to be screened and the emotion features of the text to be utilized to obtain the current fusion emotion features to be utilized; in the case that the audio availability condition is met, before taking the current audio to be screened as the target audio after dubbing, the method further comprises: calculating the similarity of the current emotion characteristics to be utilized and the original fusion emotion characteristics as the current emotion similarity, wherein the original fusion emotion characteristics are obtained by carrying out characteristic fusion on the emotion characteristics of the original audio and the emotion characteristics of the original text; calculating the similarity between the semantic features of the text to be utilized and the semantic features of the original text, and taking the similarity as the current semantic integrity; calculating the similarity between the tone color characteristics of the current audio to be screened and the tone color characteristics of the original audio to be used as the current tone color similarity, wherein the tone color characteristics of one audio comprise fundamental frequency characteristics and duration characteristics of the audio; Under the condition that the preset similarity condition is met, determining that the similarity between the current fusion feature to be utilized and the original fusion feature is larger than a similarity threshold value, otherwise, determining that the similarity between the current fusion feature to be utilized and the original fusion feature is not larger than the similarity threshold value; wherein the preset similar condition includes at least one of: the current emotion similarity is greater than an emotion similarity threshold, The current semantic integrity is greater than the semantic similarity threshold, The current timbre similarity is greater than the timbre similarity threshold, The weighted sum of the current emotion similarity, the current semantic integrity, and the current timbre similarity is greater than the total similarity threshold.
9. A method according to any one of claims 1-3, wherein performing audio synthesis on the text to be utilized to obtain current audio to be screened comprises: Performing audio synthesis on the text to be utilized by using a text-to-speech (TTS) model to obtain current audio to be screened; the method further comprises the steps of: And under the condition that the audio availability condition is not met, returning to the step of executing the audio synthesis on the text to be utilized by using the TTS model to obtain the current audio to be screened.
10. An audio generating apparatus, the apparatus comprising: The first acquisition module is used for acquiring original audio to be dubbed and original text representing voice content of the original audio; The conversion module is used for converting the original text into a text to be utilized, wherein the language types of the original text and the text to be utilized are different; the synthesis module is used for carrying out audio synthesis on the text to be utilized to obtain current audio to be screened; The fusion module is used for carrying out feature fusion on the acoustic features of the current audio to be screened and the text features of the text to be utilized to obtain the current fusion features to be utilized; the method comprises a determining module, wherein the determining module is used for taking current audio to be screened as target audio after dubbing under the condition that audio availability conditions are met, the audio availability conditions comprise that subjective quality score of the current audio to be screened is larger than a score threshold value, fusion feature similarity of the current fusion feature to be utilized and original fusion feature is larger than a similarity threshold value, the subjective quality score of one audio is obtained by processing the audio by using a pre-trained subjective opinion evaluation model, and the original fusion feature is obtained by feature fusion of acoustic features of the original audio and text features of the original text.
11. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; a memory for storing a computer program; A processor for implementing the method of any of claims 1-9 when executing a program stored on a memory.
12. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-9.

Description

Audio generation method and device, electronic equipment and storage medium Technical Field The invention relates to the technical field of TTS (Text-to-Speech) dubbing, in particular to an audio generation method, an audio generation device, electronic equipment and a storage medium. Background In the technical field of TTS dubbing, dubbing videos corresponding to original videos can be generated based on a TTS model. If the original video can be a television play comprising Chinese speech, generating English audio corresponding to each speech in the television play based on a TTS model, and adding the English audio corresponding to each speech in the video of the television play according to the time point of each speech in the original video to obtain the dubbing video. In an actual scenario, for each text (such as a sentence of a speech in a television play), the text may be input to a TTS model, and the TTS model outputs audio corresponding to the text. In the prior art, each audio generated by the TTS model needs to be played, and after a technician hears the audio, whether the quality of the audio meets the requirements is judged manually. If the quality is not in accordance with the requirements, the audio corresponding to the text can be regenerated based on the TTS until the audio corresponding to the text with the quality meeting the requirements is obtained. However, this approach requires playing each audio generated by the TTS model, resulting in a long time consuming process of judgment, and a manual determination of whether the quality is satisfactory is subject to subjective impact by the individual. Disclosure of Invention The embodiment of the invention aims to provide an audio generation method, an audio generation device, electronic equipment and a storage medium, so as to improve the efficiency of determining target audio and reduce the interference of artificial subjective factors on the determination process of the target audio. The specific technical scheme is as follows: in a first aspect of the present invention, there is provided an audio generation method, the method comprising: acquiring original audio to be dubbed and an original text representing voice content of the original audio; converting the original text into a text to be utilized, wherein the language types of the original text and the text to be utilized are different; Performing audio synthesis on the text to be utilized to obtain current audio to be screened; Carrying out feature fusion on the acoustic features of the current audio to be screened and the text features of the text to be utilized to obtain current fusion features to be utilized; Under the condition that the audio availability condition is met, taking the current audio to be screened as target audio after dubbing, wherein the audio availability condition comprises that the subjective quality score of the current audio to be screened is larger than a score threshold value, the fusion feature similarity of the current fusion feature to be utilized and the original fusion feature is larger than a similarity threshold value, the subjective quality score of one audio is obtained by processing the audio by using a pre-trained subjective opinion evaluation model, and the original fusion feature is obtained by feature fusion of the acoustic feature of the original audio and the text feature of the original text. In a second aspect of the present invention, there is also provided an audio generating apparatus, the apparatus comprising: The first acquisition module is used for acquiring original audio to be dubbed and original text representing voice content of the original audio; The conversion module is used for converting the original text into a text to be utilized, wherein the language types of the original text and the text to be utilized are different; the synthesis module is used for carrying out audio synthesis on the text to be utilized to obtain current audio to be screened; The fusion module is used for carrying out feature fusion on the acoustic features of the current audio to be screened and the text features of the text to be utilized to obtain the current fusion features to be utilized; the method comprises a determining module, wherein the determining module is used for taking current audio to be screened as target audio after dubbing under the condition that audio availability conditions are met, the audio availability conditions comprise that subjective quality score of the current audio to be screened is larger than a score threshold value, fusion feature similarity of the current fusion feature to be utilized and original fusion feature is larger than a similarity threshold value, the subjective quality score of one audio is obtained by processing the audio by using a pre-trained subjective opinion evaluation model, and the original fusion feature is obtained by feature fusion of acoustic features of the original audio and text features of