CN-122024691-A - Audio processing method, device, equipment and storage medium

CN122024691ACN 122024691 ACN122024691 ACN 122024691ACN-122024691-A

Abstract

The disclosure provides an audio processing method, an audio processing device, audio processing equipment and a storage medium, and relates to the technical field of digital persons. In some embodiments of the present disclosure, user acoustic audio is obtained, preprocessing operation is performed on the user acoustic audio to obtain preprocessed audio, the preprocessed audio is input into a real-time voice conversion model to perform real-time voice conversion to obtain target tone audio, the target tone audio is subjected to voice synthesis to obtain synthesized audio, the synthesized audio is subjected to audio track coding to obtain coded audio, and the coded audio is issued to a user terminal to be played by the user terminal.

Inventors

SU QICHANG
DU JIZHONG
YAN YUZHI

Assignees

北京智谱华章科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260206

Claims (10)

1. An audio processing method, comprising: Acquiring user acoustic audio; preprocessing the user acoustic audio to obtain preprocessed audio; inputting the preprocessed audio into a real-time voice conversion model to perform voice conversion to obtain target tone audio; performing voice synthesis on the target tone color audio to obtain synthesized audio; and carrying out audio track coding on the synthesized audio to obtain coded audio, and transmitting the coded audio to a user terminal so that the user terminal can play the coded audio.
2. The method of claim 1, wherein after speech converting the preprocessed audio into a real-time speech conversion model to obtain target timbre audio, the method further comprises: Extracting audio characteristics of the target tone color audio to obtain audio characteristics; generating a lip animation key frame according to the audio characteristics; Synthesizing the lip animation key frame with the original video image frame to obtain a new video frame; And carrying out video track coding on the new video frame to obtain a coded video.
3. The method of claim 2, wherein after said video track encoding said new video frame to obtain an encoded video, the method further comprises: aligning the coded audio and the coded video to obtain an audio-video synchronization frame; and transmitting the audio and video synchronization frame to the user terminal so that the user terminal can play the audio and video synchronization frame.
4. The method of claim 2, wherein generating a lip animation key frame from the audio feature comprises: and obtaining the lip animation key frame from the audio characteristics, the target tone audio and the reference portrait input port type synchronous model.
5. The method of claim 4, wherein the mouth sync model is any one of Wav2Lip, musetalk, and latentsync.
6. The method of claim 1, wherein the preprocessing the user acoustic audio to obtain preprocessed audio comprises: Performing voice activity detection on the original voice audio of the user, and removing a mute segment to obtain a processed first audio; Denoising the first audio to obtain denoised second audio; and carrying out resampling operation on the second audio to obtain preprocessed audio.
7. The method of claim 1, wherein the speech synthesizing the target tone color audio to obtain synthesized audio comprises: and splicing, transitional smoothing, energy matching and time alignment are carried out on the target tone color audio to obtain synthesized audio.
8. An audio processing apparatus, comprising: The acquisition module is used for acquiring the original sound audio of the user; The preprocessing module is used for preprocessing the user acoustic audio to obtain preprocessed audio; The conversion module is used for inputting the preprocessed audio into a real-time voice conversion model to perform voice conversion to obtain target tone audio; The synthesis module is used for carrying out voice synthesis on the target tone color audio to obtain synthesized audio; And the encoding module is used for carrying out audio track encoding on the synthesized audio to obtain encoded audio, and transmitting the encoded audio to a user terminal so that the user terminal can play the encoded audio.
9. An electronic device, comprising: A processor; A memory for storing processor-executable instructions; wherein the processor is configured to execute instructions to implement the steps in the method of any of claims 1-7.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1-7.

Description

Audio processing method, device, equipment and storage medium Technical Field The present disclosure relates to the field of digital personal technologies, and in particular, to an audio processing method, apparatus, device, and storage medium. Background With the rapid development of artificial intelligence, computer graphics and real-time communication technologies, interactive digital people have been widely used in virtual customer service, online education, teleconferencing, entertainment live broadcasting, and other scenes. Particularly, under the promotion of the metauniverse and immersive interaction trend, users have higher requirements on 'personification' and 'real-time' of digital people. At present, the mainstream real-time digital man system mostly adopts the process of 'text input-large model generation answer-speech synthesis (TTS) -driving digital man', namely, after a user inputs a text, an AI model generates a reply text, and then the reply text is converted into speech through the TTS and drives digital man lip and expression. At present, digital live broadcasting is driven by texts, the real-time performance is poor, and the digital live broadcasting effect is poor. Disclosure of Invention The disclosure provides an audio processing method, device, equipment and storage medium, which are used for at least solving the problems of poor real-time performance and poor live broadcasting effect of digital persons. The technical scheme of the present disclosure is as follows: the embodiment of the disclosure provides an audio processing method, which comprises the following steps: Acquiring user acoustic audio; preprocessing the user acoustic audio to obtain preprocessed audio; inputting the preprocessed audio into a real-time voice conversion model to perform voice conversion to obtain target tone audio; performing voice synthesis on the target tone color audio to obtain synthesized audio; and carrying out audio track coding on the synthesized audio to obtain coded audio, and transmitting the coded audio to a user terminal so that the user terminal can play the coded audio. Optionally, after performing voice conversion in the pre-processed audio input real-time voice conversion model to obtain the target tone audio, the method further includes: Extracting audio characteristics of the target tone color audio to obtain audio characteristics; generating a lip animation key frame according to the audio characteristics; Synthesizing the lip animation key frame with the original video image frame to obtain a new video frame; And carrying out video track coding on the new video frames to align the new video frames, so as to obtain coded video. Optionally, after the video track encoding is performed on the new video frame to obtain an encoded video, the method further includes: The coded audio and the coded video are subjected to audio-video synchronization to obtain an audio-video synchronization frame; and transmitting the audio and video synchronization frame to the user terminal so that the user terminal can play the audio and video synchronization frame. Optionally, the generating a lip animation key frame according to the audio feature includes: and obtaining the lip animation key frame from the audio characteristics, the target tone audio and the reference portrait input port type synchronous model. Optionally, the mouth shape synchronization model is any one of Wav2Lip, musetalk and latentsync. Optionally, the preprocessing operation is performed on the user acoustic audio to obtain preprocessed audio, including: Performing voice activity detection on the original voice audio of the user, and removing a mute segment to obtain a processed first audio; Denoising the first audio to obtain denoised second audio; and carrying out resampling operation on the second audio to obtain preprocessed audio. Optionally, the performing speech synthesis on the target tone color audio to obtain synthesized audio includes: and splicing, transitional smoothing, energy matching and time alignment are carried out on the target tone color audio to obtain synthesized audio. The embodiment of the present disclosure also provides an audio processing apparatus, including: The acquisition module is used for acquiring the original sound audio of the user; The preprocessing module is used for preprocessing the user acoustic audio to obtain preprocessed audio; The conversion module is used for inputting the preprocessed audio into a real-time voice conversion model to perform voice conversion to obtain target tone audio; The synthesis module is used for carrying out voice synthesis on the target tone color audio to obtain synthesized audio; And the encoding module is used for carrying out audio track encoding on the synthesized audio to obtain encoded audio, and transmitting the encoded audio to a user terminal so that the user terminal can play the encoded audio. The embodiment of the disclosure also provides an electron