CN-122024747-A - Voice conversion method, device, equipment and storage medium based on filtering noise

CN122024747ACN 122024747 ACN122024747 ACN 122024747ACN-122024747-A

Abstract

The application provides a voice conversion method, device, equipment and storage medium based on filtering noise, which relate to the technical field of voice processing and comprise the steps of carrying out feature extraction on reference audio of a target speaker to obtain a reference deep feature frame sequence, a reference harmonic amplitude frame sequence and a noise envelope coefficient frame sequence, constructing a reference feature library, carrying out feature extraction on source audio to obtain a source deep feature frame sequence and a source fundamental frequency track, determining a matched target feature frame in the reference feature library according to the source deep feature frame sequence, generating a harmonic signal corresponding to the source audio according to the reference deep feature, the reference harmonic amplitude and the source fundamental frequency track of the target feature frame, carrying out noise filtering according to the noise envelope coefficient in the target feature frame to generate a noise signal matched with the noise characteristic of the reference audio, mixing the harmonic signal with the noise signal, and outputting target conversion voice of the target speaker. The application can improve the naturalness of voice conversion.

Inventors

SHENG LEYUAN
TANG GANG

Assignees

杭州小影创新科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260306

Claims (10)

1. A method of speech conversion based on filtering noise, the method comprising: Extracting features of reference audio of a target speaker to obtain a reference deep feature frame sequence, a reference harmonic amplitude frame sequence and a noise envelope coefficient frame sequence of the reference audio; Extracting features of source audio to obtain a source deep feature frame sequence and a source fundamental frequency track of the source audio; constructing a reference feature library of the reference audio according to the reference deep feature frame sequence, the reference harmonic amplitude frame sequence and the noise envelope coefficient frame sequence; determining a target feature frame matched with the source audio in the reference feature library according to the source deep feature frame sequence; generating a harmonic signal corresponding to the source audio according to the reference deep layer feature and the reference harmonic amplitude of the target feature frame and the source fundamental frequency track; Noise filtering is carried out according to the noise envelope coefficient in the target characteristic frame, and a noise signal matched with the noise characteristic of the reference audio is generated; and mixing the harmonic signal with the noise signal, and outputting target conversion voice of the target speaker.
2. The method of claim 1, wherein the feature extraction of the reference audio of the target speaker to obtain a reference deep feature frame sequence, a reference harmonic amplitude frame sequence, and a noise envelope coefficient frame sequence of the reference audio comprises: extracting the reference deep feature frame sequence, the reference fundamental frequency track and spectrum information of the reference audio; extracting the reference harmonic amplitude frame sequence from the frequency spectrum information according to the reference fundamental frequency track; And carrying out noise envelope extraction on the reference audio according to the reference fundamental frequency track and the frequency spectrum information to obtain the noise envelope coefficient frame sequence.
3. The method according to claim 2, wherein said performing noise envelope extraction on said reference audio according to said reference baseband track and said spectral information to obtain said noise envelope coefficient frame sequence comprises: according to the frequency spectrum information, obtaining an amplitude spectrum of the reference audio; constructing a harmonic mask matrix of the reference audio according to the reference fundamental frequency track; extracting a noise spectrum of the reference audio according to the magnitude spectrum and the harmonic mask matrix; And carrying out frequency band aggregation processing on the noise spectrum to obtain the noise envelope coefficient frame sequence.
4. The method of claim 1, wherein the target feature frame comprises a plurality of feature frames, the method further comprising: Respectively carrying out feature fusion on the reference deep features, the harmonic amplitudes and the noise envelope coefficients in the feature frames to obtain fused reference deep features, fused harmonic amplitudes and fused noise envelope coefficients; The generating a harmonic signal corresponding to the source audio according to the reference deep feature and the reference harmonic amplitude in the target feature frame and the source fundamental frequency track comprises the following steps: generating the harmonic signal according to the fused reference deep layer characteristic, the fused harmonic amplitude and the source fundamental frequency track; The generating a noise signal matched with the noise characteristic of the reference audio according to the noise envelope coefficient in the target feature frame comprises the following steps: And carrying out noise filtering according to the fused noise envelope coefficient to generate the noise signal.
5. The method of claim 4, wherein the generating the harmonic signal from the fused reference depth feature, the fused harmonic amplitude, and the source fundamental frequency trace comprises: Performing gamut adjustment on the source fundamental frequency track, and adjusting the fundamental frequency track matched with the pitch range of the target speaker; and generating the harmonic signal according to the fused reference deep layer characteristic, the fused harmonic amplitude and the adjusted fundamental frequency track.
6. The method of claim 4, wherein said noise filtering based on noise envelope coefficients in said target feature frame to generate a noise signal matching noise characteristics of said reference audio comprises: Generating a white noise signal equal in length to the reference audio; Performing linear interpolation up-sampling on the fused noise envelope coefficient to obtain a time-varying filter frequency response coefficient; Performing time-varying filtering on the white noise signal according to the time-varying filtering frequency response coefficient to obtain a filtering noise signal; And carrying out amplitude correction on the filtered noise signal to obtain the noise signal.
7. The method of claim 1, wherein the mixing the harmonic signal with the noise signal to output the target converted voice of the target speaker comprises: Estimating the signal energy of the harmonic signal and the noise signal to obtain the harmonic signal energy and the noise signal energy; According to the harmonic signal energy, the noise signal energy and the preset target energy proportion, carrying out noise signal amplitude adjustment on the noise signal to obtain an amplitude-adjusted noise signal; And superposing the noise signal with the adjusted amplitude and the harmonic signal to obtain the target conversion voice.
8. A voice conversion apparatus based on filtering noise, the apparatus comprising: The extraction module is used for extracting the characteristics of the reference audio of the target speaker to obtain a reference deep characteristic frame sequence, a reference harmonic amplitude frame sequence and a noise envelope coefficient frame sequence of the reference audio; The construction module is used for constructing a reference feature library of the reference audio according to the reference deep feature frame sequence, the reference harmonic amplitude frame sequence and the noise envelope coefficient frame sequence; the matching module is used for determining a target feature frame matched with the source audio in the reference feature library according to the source deep feature frame sequence; The generation module is used for generating harmonic signals corresponding to the source audio according to the reference deep layer features and the reference harmonic amplitudes in the target feature frames and the source fundamental frequency track; and the mixing module is used for mixing the harmonic signals with the noise signals and outputting target conversion voice of the target speaker.
9. An electronic device comprising a processor, a memory and a bus, the memory storing program instructions executable by the processor, the processor and the memory being in communication via the bus when the electronic device is in operation, the processor executing the program instructions to perform the steps of the filtering noise based speech conversion method of any one of claims 1 to 7.
10. A computer readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, performs the filtering noise based speech conversion method according to any of claims 1-7.

Description

Voice conversion method, device, equipment and storage medium based on filtering noise Technical Field The present application relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for speech conversion based on filtering noise. Background The voice conversion technology is to convert the voice of the source speaker into the voice with the tone characteristic of the target speaker on the premise of keeping the semantic content unchanged, and has wide application prospect in the fields of entertainment, education and the like. The existing methods mainly comprise an end-to-end method based on a neural network and a parameterization method based on signal processing. Among them, the k-nearest neighbor matching and differentiable digital signal processing based methods are receiving attention because of their strong interpretability, less training data requirements, and the like. The scheme adopts harmonic addition synthesis to realize voice generation, and can well restore strong periodic voice components such as vowels. However, the existing method has obvious defects that firstly, only sine wave addition is adopted to model harmonic components, non-harmonic components in voice such as aero-acoustic and fricative sounds are ignored, secondly, the effect of synthesizing clear consonants is poor, distortion or pronunciation loss is easy to occur, in addition, the synthesized voice lacks the sense of breath and texture which the real voice should have, the tone similarity with a target speaker is limited, the naturalness of voice conversion is seriously influenced, and the application of the technology is restricted. Disclosure of Invention The embodiment of the application provides a voice conversion method, a device, equipment and a storage medium based on filtering noise, which can improve the naturalness of voice conversion. In a first aspect, an embodiment of the present application provides a method for voice conversion based on filtering noise, where the method includes: Extracting features of reference audio of a target speaker to obtain a reference deep feature frame sequence, a reference harmonic amplitude frame sequence and a noise envelope coefficient frame sequence of the reference audio; Extracting features of source audio to obtain a source deep feature frame sequence and a source fundamental frequency track of the source audio; constructing a reference feature library of the reference audio according to the reference deep feature frame sequence, the reference harmonic amplitude frame sequence and the noise envelope coefficient frame sequence; determining a target feature frame matched with the source audio in the reference feature library according to the source deep feature frame sequence; generating a harmonic signal corresponding to the source audio according to the reference deep layer feature and the reference harmonic amplitude of the target feature frame and the source fundamental frequency track; Noise filtering is carried out according to the noise envelope coefficient in the target characteristic frame, and a noise signal matched with the noise characteristic of the reference audio is generated; and mixing the harmonic signal with the noise signal, and outputting target conversion voice of the target speaker. Optionally, the feature extraction is performed on the reference audio of the target speaker to obtain a reference deep feature frame sequence, a reference harmonic amplitude frame sequence, and a noise envelope coefficient frame sequence of the reference audio, which includes: extracting the reference deep feature frame sequence, the reference fundamental frequency track and spectrum information of the reference audio; extracting the reference harmonic amplitude frame sequence from the frequency spectrum information according to the reference fundamental frequency track; And carrying out noise envelope extraction on the reference audio according to the reference fundamental frequency track and the frequency spectrum information to obtain the noise envelope coefficient frame sequence. Optionally, the extracting the noise envelope of the reference audio according to the reference baseband track and the spectrum information to obtain the noise envelope coefficient frame sequence includes: according to the frequency spectrum information, obtaining an amplitude spectrum of the reference audio; constructing a harmonic mask matrix of the reference audio according to the reference fundamental frequency track; extracting a noise spectrum of the reference audio according to the magnitude spectrum and the harmonic mask matrix; And carrying out frequency band aggregation processing on the noise spectrum to obtain the noise envelope coefficient frame sequence. Optionally, the target feature frame comprises a plurality of feature frames, and the method further comprises: Respectively carrying out feature fusion on t