EP-4283618-B1 - SPEECH ENHANCEMENT METHOD AND APPARATUS, AND DEVICE AND STORAGE MEDIUM

EP4283618B1EP 4283618 B1EP4283618 B1EP 4283618B1EP-4283618-B1

Inventors

XIAO, WEI
SHI, Yupeng
WANG, MENG
SHANG, SHIDONG
WU, ZURONG

Dates

Publication Date: 20260513
Application Date: 20220127

Claims (12)

A speech enhancement method, executed by a computer device, comprising: determining a glottal parameter corresponding to a target speech frame according to a frequency domain representation of the target speech frame; determining a gain corresponding to the target speech frame according to a gain corresponding to a historical speech frame of the target speech frame; determining an excitation signal corresponding to the target speech frame according to the frequency domain representation of the target speech frame; and synthesizing the determined glottal parameter, the determined gain, and the determined excitation signal, to obtain an enhanced speech signal; wherein the determining a glottal parameter corresponding to a target speech frame according to a frequency domain representation of the target speech frame comprises: obtaining a first neural network by the first neural network being trained according to a frequency domain representation of a sample speech frame, a glottal parameter corresponding to the sample speech frame, and a glottal parameter corresponding to a historical speech frame of the sample speech frame; inputting the frequency domain representation of the target speech frame and the glottal parameter corresponding to the historical speech frame of the target speech frame into the first neural network; and performing, by the first neural network, prediction according to the frequency domain representation of the target speech frame and the glottal parameter corresponding to the historical speech frame of the target speech frame, and outputting the glottal parameter corresponding to the target speech frame; wherein the first neural network is obtained by a training process of the speech enhancement method comprising: inputting the frequency domain representation of the sample speech frame and the glottal parameter corresponding to the historical speech frame of the sample speech frame into the first neural network; outputting a predicted glottal parameter from the first neural network; and when the predicted glottal parameter is inconsistent with a glottal parameter corresponding to an original speech signal in the sample speech frame, adjusting a parameter of the first neural network until the predicted glottal parameter is consistent with the glottal parameter corresponding to the original speech signal.
The method according to claim 1, wherein the synthesizing the determined glottal parameter, the determined gain, and the determined excitation signal, to obtain an enhanced speech signal corresponding to the target speech frame, comprises: constructing a glottal filter according to the glottal parameter corresponding to the target speech frame; filtering the excitation signal corresponding to the target speech frame by using the glottal filter, to obtain a first speech signal; and amplifying the first speech signal according to the gain corresponding to the target speech frame, to obtain the enhanced speech signal corresponding to the target speech frame.
The method according to claim 2, wherein the target speech frame comprises a plurality of sample points; the glottal filter is a K-order filter, K being a positive integer; the excitation signal comprises excitation signal values respectively corresponding to the plurality of sample points in the target speech frame; and the filtering the excitation signal corresponding to the target speech frame by using the glottal filter, to obtain a first speech signal, comprises: for one sample point in the target speech frame, performing convolution on excitation signal values corresponding to K sample points before the sample point in the target speech frame and the K-order filter, to obtain a target signal value of the sample point in the target speech frame; and combining target signal values corresponding to the sample points in the target speech frame chronologically, to obtain the first speech signal.
The method according to claim 2, wherein the glottal filter is a K-order filter, and the glottal parameter comprises a K-order line spectral frequency parameter or a K-order linear prediction coefficient, K being a positive integer.
The method according to claim 1, wherein the determining a gain corresponding to the target speech frame according to a gain corresponding to a historical speech frame of the target speech frame comprises: inputting the gain corresponding to the historical speech frame of the target speech frame to a second neural network, the second neural network being obtained by training according to a gain corresponding to a sample speech frame and a gain corresponding to a historical speech frame of the sample speech frame; and outputting, by the second neural network, the target gain according to the gain corresponding to the historical speech frame of the target speech frame.
The method according to claim 1, wherein the determining an excitation signal corresponding to the target speech frame according to the frequency domain representation of the target speech frame comprises: inputting the frequency domain representation of the target speech frame to a third neural network, the third neural network being obtained by training according to a frequency domain representation of a sample speech frame and a frequency domain representation of an excitation signal corresponding to the sample speech frame; and outputting, by the third neural network according to the frequency domain representation of the target speech frame, a frequency domain representation of the excitation signal corresponding to the target speech frame.
The method according to claim 1, wherein before the determining a glottal parameter corresponding to a target speech frame according to a frequency domain representation of the target speech frame, the method further comprises: obtaining a time domain signal of the target speech frame; performing a time-frequency transform on the time domain signal of the target speech frame, to obtain the frequency domain representation of the target speech frame.
The method according to claim 7, wherein the obtaining a time domain signal of the target speech frame comprises: obtaining a second speech signal, the second speech signal being an acquired speech signal or a speech signal obtained by decoding an encoded speech; and framing the second speech signal, to obtain the time domain signal of the target speech frame.
The method according to claim 1, wherein after the synthesizing the determined glottal parameter, the determined gain, and the determined excitation signal, to obtain an enhanced speech signal corresponding to the target speech frame, the method further comprises: playing or encoding and transmitting the enhanced speech signal corresponding to the target speech frame.
A speech enhancement apparatus, comprising: a glottal parameter prediction module, configured to determine a glottal parameter corresponding to a target speech frame according to a frequency domain representation of the target speech frame; a gain prediction module, configured to determine a gain corresponding to the target speech frame according to a gain corresponding to a historical speech frame of the target speech frame; an excitation signal prediction module, configured to determine an excitation signal corresponding to the target speech frame according to the frequency domain representation of the target speech frame; and a synthesis module, configured to synthesize the determined glottal parameter, the determined gain, and the determined excitation signal, to obtain an enhanced speech signal corresponding to the target speech frame; wherein the determining a glottal parameter corresponding to a target speech frame according to a frequency domain representation of the target speech frame comprises: obtaining a first neural network by the first neural network being trained according to a frequency domain representation of a sample speech frame, a glottal parameter corresponding to the sample speech frame, and a glottal parameter corresponding to a historical speech frame of the sample speech frame; inputting the frequency domain representation of the target speech frame and the glottal parameter corresponding to the historical speech frame of the target speech frame into the first neural network; and performing, by the first neural network, prediction according to the frequency domain representation of the target speech frame and the glottal parameter corresponding to the historical speech frame of the target speech frame, and outputting the glottal parameter corresponding to the target speech frame; wherein the speech enhancement apparatus is configured to obtain the first neural network by a training process comprising: inputting the frequency domain representation of the sample speech frame and the glottal parameter corresponding to the historical speech frame of the sample speech frame into the first neural network; outputting a predicted glottal parameter from the first neural network; and when the predicted glottal parameter is inconsistent with a glottal parameter corresponding to an original speech signal in the sample speech frame, adjusting a parameter of the first neural network until the predicted glottal parameter is consistent with the glottal parameter corresponding to the original speech signal.
An electronic device, comprising: a processor; and a memory, storing computer-readable instructions, the computer-readable instructions, when executed by the processor, implementing the method according to any one of claims 1 to 9.
A computer-readable storage medium, storing computer-readable instructions, the computer-readable instructions, when executed by a processor, implementing the method according to any one of claims 1 to 9.

Description

FIELD OF THE TECHNOLOGY The present disclosure relates to the field of speech processing technologies, and specifically, to a speech enhancement method and apparatus, a device, and a storage medium. BACKGROUND OF THE DISCLOSURE Due to the convenience and timeliness of voice communication, voice communication is increasingly widely applied. For example, speech signals are transmitted between conference participants of cloud conferencing. However, in voice communication, noise may be mixed in speech signals, and the noise mixed in the speech signals leads to poor communication quality and greatly affects the auditory experience of the user. Therefore, how to enhance the speech to remove noise is a technical problem urgently needs to be resolved in the related art. CN 111554323 A discloses a method for voice processing in a voice communication system for compensating lost packets. SUMMARY Embodiments of the present disclosure provide a speech enhancement method and apparatus, a device, and a storage medium, to implement speech enhancement and improve quality of a speech signal. Other features and advantages of the present disclosure become obvious through the following detailed descriptions, or may be partially learned through the practice of the present disclosure. According to an aspect of the embodiments of the present disclosure, a speech enhancement method is provided according to claim 1. According to another aspect of the present disclosure embodiment, a speech enhancement apparatus is provided according to claim 10. According to another aspect of the present disclosure embodiment, an electronic device is provided, including: a processor; a memory, storing computer-readable instructions, the computer-readable instructions, when executed by the processor, implementing the speech enhancement method described above. According to another aspect of the present disclosure embodiment, a computer-readable storage medium is provided, storing computer-readable instructions, the computer-readable instructions, when executed by a processor, implementing the speech enhancement method described above. It is to be understood that the foregoing general descriptions and the following detailed descriptions are merely for illustration and explanation purposes and are not intended to limit the present disclosure. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings herein, which are incorporated into the specification and constitute a part of this specification, show embodiments that conform to the present disclosure, and are used for describing a principle of the present disclosure together with this specification. Apparently, the accompanying drawings in the following description show merely some embodiments of the present disclosure, and a person of ordinary skill in the art may still derive other drawings from the accompanying drawings without creative efforts. In the accompanying drawings: FIG. 1 is a schematic diagram of a voice communication link in a Voice over Internet Protocol (VoIP) system according to one embodiment.FIG. 2 is a schematic diagram of a digital model of generation of a speech signal.FIG. 3a is a schematic diagram of a frequency response of an original speech signal.FIG. 3b is a schematic diagram of a frequency response of a glottal filter obtained by decomposing the original speech signal.FIG. 3c is a schematic diagram of a frequency response of an excitation signal obtained by decomposing the original speech signal.FIG. 4 is a flowchart of a speech enhancement method according to an embodiment of the present disclosure.FIG. 5 is a flowchart of step 440 of the embodiment corresponding to FIG. 4 in an embodiment.FIG. 6 is a schematic diagram of performing a short-time Fourier transform on a speech frame in a windowed overlapping manner according to an embodiment of the present disclosure.FIG. 7 is a flowchart of speech enhancement according to a specific embodiment of the present disclosure.FIG. 8 is a schematic diagram of a first neural network according to an embodiment of the present disclosure.FIG. 9 is a schematic diagram of an input and an output of a first neural network according to another embodiment of the present disclosure.FIG. 10 is a schematic diagram of a second neural network according to an embodiment of the present disclosure.FIG. 11 is a schematic diagram of a third neural network according to an embodiment of the present disclosure.FIG. 12 is a block diagram of a speech enhancement apparatus according to an embodiment of the present disclosure.FIG. 13 is a schematic structural diagram of a computer system adapted to implement an electronic device according to an embodiment of the present disclosure. DESCRIPTION OF EMBODIMENTS Now, exemplary implementations are described comprehensively with reference to the accompanying drawings. However, the exemplary implementations can be implemented in various forms and are not to be understood as being limited to the examples describ