US-20260128048-A1 - VOICE ENHANCEMENT METHOD AND APPARATUS, ELECTRONIC DEVICE, AND STORAGE MEDIUM

US20260128048A1US 20260128048 A1US20260128048 A1US 20260128048A1US-20260128048-A1

Abstract

This application discloses a voice enhancement method and apparatus, an electronic device, and a storage medium. The method includes: extracting first excitation information and first envelope information from a first frequency domain signal corresponding to a first voice signal, where the first voice signal includes a noise signal and a first clean voice signal; performing optimal excitation codebook matching on the first excitation information to obtain second excitation information; performing optimal envelope codebook matching on the first envelope information to obtain second envelope information; synthesizing target spectral information based on the first excitation information, the second excitation information, and the second envelope information; and performing voice enhancement processing on the first voice signal based on the target spectral information.

Inventors

Hongbo Yang
Qi Hao

Assignees

VIVO MOBILE COMMUNICATION CO., LTD.

Dates

Publication Date: 20260507
Application Date: 20260102
Priority Date: 20230706

Claims (20)

1 . A voice enhancement method, wherein the method comprises: extracting first excitation information and first envelope information from a first frequency domain signal corresponding to a first voice signal, wherein the first voice signal comprises a noise signal and a first clean voice signal; performing optimal excitation codebook matching on the first excitation information to obtain second excitation information; performing optimal envelope codebook matching on the first envelope information to obtain second envelope information; synthesizing target spectral information based on the first excitation information, the second excitation information, and the second envelope information; and performing voice enhancement processing on the first voice signal based on the target spectral information.
2 . The method according to claim 1 , wherein the performing optimal excitation codebook matching on the first excitation information to obtain second excitation information comprises: determining, from a preset excitation codebook, the second excitation information that matches the first excitation information; or the performing optimal envelope codebook matching on the first envelope information to obtain second envelope information comprises: determining, from a preset envelope codebook, the second envelope information that matches the first envelope information, wherein the preset excitation codebook is NC representative clean voice excitation information subsets obtained by training an excitation information set of clean voice signals; and the preset envelope codebook is NA representative clean voice envelope information subsets obtained by training an envelope information set of clean voice signals.
3 . The method according to claim 2 , wherein the method further comprises: performing linear prediction analysis on the first frequency domain signal to obtain the first excitation information; and the determining, from a preset excitation codebook, the second excitation information that matches the first excitation information comprises: determining a pitch period of the first frequency domain signal based on a peak value of the first excitation information; and determining, from the preset excitation codebook, the second excitation information corresponding to the pitch period, wherein the second excitation information is clean voice excitation information.
4 . The method according to claim 3 , wherein the performing linear prediction analysis on the first frequency domain signal to obtain the first excitation information comprises: performing linear prediction processing on the first frequency domain signal, to obtain a residual signal corresponding to the first frequency domain signal; and performing frequency domain transformation processing on the residual signal, to obtain a residual signal in cepstral domain, wherein the residual signal in cepstral domain is the first excitation information.
5 . The method according to claim 2 , wherein the method further comprises: performing linear prediction analysis on the first frequency domain signal to obtain the first envelope information; and the determining, from a preset envelope codebook, the second envelope information that matches the first envelope information comprises: determining, as the second envelope information, clean voice envelope information that is in the preset envelope codebook and that has a smallest vector distance to the first envelope information.
6 . The method according to claim 5 , wherein the performing linear prediction analysis on the first frequency domain signal to obtain the first envelope information comprises: performing linear prediction processing on the first frequency domain signal, to obtain a linear predictive coding coefficient corresponding to the first frequency domain signal; and performing transformation processing on the linear predictive coding coefficient, to obtain a line spectral pair corresponding to the linear predictive coding coefficient, wherein the line spectral pair corresponding to the linear predictive coding coefficient is the first envelope information.
7 . The method according to claim 2 , wherein the method further comprises: obtaining a clean voice signal set, wherein the clean voice signal set comprises at least one clean voice signal; performing linear prediction processing on each clean voice signal in the clean voice signal set, to obtain a feature information set corresponding to the clean voice signal set; performing peak detection on the feature information set, to obtain a pitch period corresponding to the feature information set; and clustering feature information in the feature information set based on the pitch period, to obtain the preset excitation codebook.
8 . The method according to claim 2 , wherein the method further comprises: obtaining a clean voice signal set, wherein the clean voice signal set comprises at least one clean voice signal; performing linear prediction processing on each clean voice signal in the clean voice signal set, to obtain a feature information set corresponding to the clean voice signal set; and clustering the feature information set by using a preset algorithm, to obtain the preset envelope codebook.
9 . The method according to claim 1 , wherein the performing optimal excitation codebook matching on the first excitation information to obtain second excitation information comprises: inputting the first excitation information to a first network model, to obtain the second excitation information, wherein the first network model is an excitation matching model.
10 . The method according to claim 1 , wherein the performing optimal envelope codebook matching on the first envelope information to obtain second envelope information comprises: inputting the first envelope information to a second network model, to obtain the second excitation information, wherein the second network model is an envelope matching model.
11 . The method according to claim 1 , wherein the performing optimal excitation codebook matching on the first excitation information to obtain second excitation information and performing optimal envelope codebook matching on the first envelope information to obtain second envelope information comprises: inputting the first excitation information and the first envelope information to a third network model, to obtain the second excitation information and the second envelope information, wherein the third network model is a cross-matching model.
12 . An electronic device, comprising a processor, a memory, and a program or an instruction stored in the memory and executable on the processor, wherein the program or the instruction, when executed by the processor, causes the electronic device to perform: extracting first excitation information and first envelope information from a first frequency domain signal corresponding to a first voice signal, wherein the first voice signal comprises a noise signal and a first clean voice signal; performing optimal excitation codebook matching on the first excitation information to obtain second excitation information; performing optimal envelope codebook matching on the first envelope information to obtain second envelope information; synthesizing target spectral information based on the first excitation information, the second excitation information, and the second envelope information; and performing voice enhancement processing on the first voice signal based on the target spectral information.
13 . The electronic device according to claim 12 , wherein when performing optimal excitation codebook matching on the first excitation information to obtain second excitation information, the program or the instruction, when executed by the processor, causes the electronic device to perform: determining, from a preset excitation codebook, the second excitation information that matches the first excitation information; or when performing optimal envelope codebook matching on the first envelope information to obtain second envelope information, the program or the instruction, when executed by the processor, causes the electronic device to perform: determining, from a preset envelope codebook, the second envelope information that matches the first envelope information, wherein the preset excitation codebook is NC representative clean voice excitation information subsets obtained by training an excitation information set of clean voice signals; and the preset envelope codebook is NA representative clean voice envelope information subsets obtained by training an envelope information set of clean voice signals.
14 . The electronic device according to claim 13 , wherein the program or the instruction, when executed by the processor, causes the electronic device to further perform: performing linear prediction analysis on the first frequency domain signal to obtain the first excitation information; and when determining, from a preset excitation codebook, the second excitation information that matches the first excitation information, the program or the instruction, when executed by the processor, causes the electronic device to perform: determining a pitch period of the first frequency domain signal based on a peak value of the first excitation information; and determining, from the preset excitation codebook, the second excitation information corresponding to the pitch period, wherein the second excitation information is clean voice excitation information.
15 . The electronic device according to claim 14 , wherein when performing linear prediction analysis on the first frequency domain signal to obtain the first excitation information, the program or the instruction, when executed by the processor, causes the electronic device to perform: performing linear prediction processing on the first frequency domain signal, to obtain a residual signal corresponding to the first frequency domain signal; and performing frequency domain transformation processing on the residual signal, to obtain a residual signal in cepstral domain, wherein the residual signal in cepstral domain is the first excitation information.
16 . The electronic device according to claim 13 , wherein the program or the instruction, when executed by the processor, causes the electronic device to further perform: performing linear prediction analysis on the first frequency domain signal to obtain the first envelope information; and when determining, from a preset envelope codebook, the second envelope information that matches the first envelope information, the program or the instruction, when executed by the processor, causes the electronic device to perform: determining, as the second envelope information, clean voice envelope information that is in the preset envelope codebook and that has a smallest vector distance to the first envelope information.
17 . The electronic device according to claim 16 , wherein when performing linear prediction analysis on the first frequency domain signal to obtain the first envelope information, the program or the instruction, when executed by the processor, causes the electronic device to perform: performing linear prediction processing on the first frequency domain signal, to obtain a linear predictive coding coefficient corresponding to the first frequency domain signal; and performing transformation processing on the linear predictive coding coefficient, to obtain a line spectral pair corresponding to the linear predictive coding coefficient, wherein the line spectral pair corresponding to the linear predictive coding coefficient is the first envelope information.
18 . The electronic device according to claim 13 , wherein the program or the instruction, when executed by the processor, causes the electronic device to further perform: obtaining a clean voice signal set, wherein the clean voice signal set comprises at least one clean voice signal; performing linear prediction processing on each clean voice signal in the clean voice signal set, to obtain a feature information set corresponding to the clean voice signal set; performing peak detection on the feature information set, to obtain a pitch period corresponding to the feature information set; and clustering feature information in the feature information set based on the pitch period, to obtain the preset excitation codebook.
19 . The electronic device according to claim 13 , wherein the program or the instruction, when executed by the processor, causes the electronic device to further perform: obtaining a clean voice signal set, wherein the clean voice signal set comprises at least one clean voice signal; performing linear prediction processing on each clean voice signal in the clean voice signal set, to obtain a feature information set corresponding to the clean voice signal set; and clustering the feature information set by using a preset algorithm, to obtain the preset envelope codebook.
20 . A non-transitory readable storage medium, wherein the non-transitory readable storage medium stores a program or an instruction, wherein the program or the instruction, when executed by a processor of an electronic device, causes the electronic device to perform: extracting first excitation information and first envelope information from a first frequency domain signal corresponding to a first voice signal, wherein the first voice signal comprises a noise signal and a first clean voice signal; performing optimal excitation codebook matching on the first excitation information to obtain second excitation information; performing optimal envelope codebook matching on the first envelope information to obtain second envelope information; synthesizing target spectral information based on the first excitation information, the second excitation information, and the second envelope information; and performing voice enhancement processing on the first voice signal based on the target spectral information.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a Bypass continuation application of PCT International Application No. PCT/CN2024/103189 filed on Jul. 2, 2024, which claims priority to Chinese Patent Application No. 202310833021.0, filed in China on Jul. 6, 2023, which are incorporated herein by reference in their entireties. TECHNICAL FIELD This application relates to the field of audio noise reduction technologies, and specifically, to a voice enhancement method and apparatus, an electronic device, and a storage medium. BACKGROUND Currently, when an electronic device performs noise reduction processing on voice signals of the electronic device through Wiener filtering and statistical models, the electronic device can perform noise reduction processing on the voice signals based on a prior signal-to-noise ratio, to obtain noise-free voice signals. Usually, an accurate prior signal-to-noise ratio needs to be obtained based on a power spectrum of a clean voice signal. However, in practical situations, the electronic device can obtain only noisy voice signals. Therefore, in related technologies, the electronic device may estimate a prior signal-to-noise ratio by using a decision-directed method, that is, the electronic device may estimate the prior signal-to-noise ratio based on a posterior signal-to-noise ratio. However, in a process of obtaining the posterior signal-to-noise ratio, in a case that power of a clean voice signal is close to power of a noise signal, the electronic device cannot obtain an accurate posterior signal-to-noise ratio. This leads to a significant error in a prior signal-to-noise ratio estimated by the electronic device based on the posterior signal-to-noise ratio. Consequently, accuracy of the prior signal-to-noise ratio determined by the electronic device is low, leading to a poor noise reduction effect for a voice signal. SUMMARY Embodiments of this application aim to provide a voice enhancement method and apparatus, an electronic device, and a storage medium. According to a first aspect, an embodiment of this application provides a voice enhancement method. The method for determining spectral information includes: extracting first excitation information and first envelope information from a first frequency domain signal corresponding to a first voice signal, where the first voice signal includes a noise signal and a first clean voice signal; performing optimal excitation codebook matching on the first excitation information to obtain second excitation information; performing optimal envelope codebook matching on the first envelope information to obtain second envelope information; synthesizing target spectral information based on the first excitation information, the second excitation information, and the second envelope information; and performing voice enhancement processing on the first voice signal based on the target spectral information. According to a second aspect, an embodiment of this application provides a voice enhancement apparatus. The voice enhancement apparatus includes: an extraction module, a matching module, a synthesis module, and a processing module. The extraction module is configured to extract first excitation information and first envelope information from a first frequency domain signal corresponding to a first voice signal, where the first voice signal includes a noise signal and a first clean voice signal. The matching module is configured to: perform optimal excitation codebook matching on the first excitation information extracted by the extraction module to obtain second excitation information, and perform optimal envelope codebook matching on the first envelope information extracted by the extraction module to obtain second envelope information. The synthesis module is configured to synthesize target spectral information based on the first excitation information, the second excitation information, and the second envelope information. The processing module is configured to perform voice enhancement processing on the first voice signal based on the target spectral information. According to a third aspect, an embodiment of this application provides an electronic device. The electronic device includes a processor and a memory, the memory stores a program or an instruction executable on the processor, and the program or the instruction is executed by the processor to implement the steps of the method according to the first aspect. According to a fourth aspect, an embodiment of this application provides a readable storage medium. The readable storage medium stores a program or an instruction, and the program or the instruction is executed by a processor to implement the steps of the method according to the first aspect. According to a fifth aspect, an embodiment of this application provides a chip. The chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configure