EP-4315326-B1 - CONTEXT-BASED SPEECH ENHANCEMENT

EP4315326B1EP 4315326 B1EP4315326 B1EP 4315326B1EP-4315326-B1

Inventors

BYUN, Kyungguen
ZHANG, SHUHUA
KIM, LAE-HOON
VISSER, ERIK
Moon, Sunkuk
MONTAZERI, VAHID

Dates

Publication Date: 20260513
Application Date: 20220204

Claims (12)

A device (102) to perform speech enhancement, the device comprising: one or more processors (190) configured to: obtain input spectral data based on an input signal, the input signal representing sound that includes speech; and process, using a multi-encoder transformer, the input spectral data and context data to generate output spectral data that represents a speech enhanced version of the input signal, the processing comprising: providing the input spectral data to a first encoder of the multi-encoder transformer to generate first encoded data, the first encoder including a first attention network; obtaining the context data based on one or more data sources; providing the context data to at least a second encoder of the multi-encoder transformer to generate second encoded data, the second encoder including a second attention network; and providing the first encoded data and the second encoded data to a decoder attention network of a decoder of the multi-encoder transformer to generate output spectral data that corresponds to a speech enhanced version of the input spectral data.
The device of claim 1, wherein the one or more data sources includes at least one of the input signal or image data.
The device of claim 2, further comprising a camera configured to generate the image data.
The device of claim1, wherein the decoder attention network comprises: a first multi-head attention network configured to process the first encoded data; a second multi-head attention network configured to process the second encoded data; and a combiner configured to combine outputs of the first multi-head attention network and the second multi-head attention network.
The device of claim 1, wherein the decoder further comprises: a masked multi-head attention network coupled to an input of the decoder attention network; and a decoder feed forward network coupled to an output of the decoder attention network.
The device of claim 1, the first encoder including a Mel filter bank configured to filter the input spectral data.
The device of claim 1, further comprising an automatic speech recognition engine configured to generate text based on the input signal, wherein the context data includes the text.
The device of claim 7, wherein the second encoder includes a grapheme-to-phoneme convertor configured to process the text.
A method of speech enhancement, the method comprising: obtaining (1602) input spectral data based on an input signal, the input signal representing sound that includes speech; and processing (1604), using a multi-encoder transformer, the input spectral data and context data to generate output spectral data that represents a speech enhanced version of the input signal, the processing comprising: providing the input spectral data to a first encoder of the multi-encoder transformer to generate first encoded data, the first encoder including a first attention network; obtaining the context data based on one or more data sources; providing the context data to at least a second encoder of the multi-encoder transformer to generate second encoded data, the second encoder including a second attention network; and providing the first encoded data and the second encoded data to a decoder attention network of a decoder of the multi-encoder transformer to generate output spectral data that corresponds to a speech enhanced version of the input spectral data.
The method of claim 9, further comprising obtaining the context data from one or more data sources, the one or more data sources including at least one of the input signal or image data.
A non-transitory computer-readable medium (1854) storing instructions that, when executed by one or more processors (190), cause the one or more processors (190) to: obtain input spectral data based on an input signal, the input signal representing sound that includes speech; and process, using a multi-encoder transformer, the input spectral data and context data to generate output spectral data that represents a speech enhanced version of the input signal, the processing comprising: providing the input spectral data to a first encoder of the multi-encoder transformer to generate first encoded data, the first encoder including a first attention network; obtaining the context data based on one or more data sources; providing the context data to at least a second encoder of the multi-encoder transformer to generate second encoded data, the second encoder including a second attention network; and providing the first encoded data and the second encoded data to a decoder attention network of a decoder of the multi-encoder transformer to generate output spectral data that corresponds to a speech enhanced version of the input spectral data.
An apparatus comprising: means for obtaining input spectral data based on an input signal, the input signal representing sound that includes speech; and means for processing, using a multi-encoder transformer, the input spectral data and context data to generate output spectral data that represents a speech enhanced version of the input signal, the processing comprising: providing the input spectral data to a first encoder of the multi-encoder transformer to generate first encoded data, the first encoder including a first attention network; obtaining the context data based on one or more data sources; providing the context data to at least a second encoder of the multi-encoder transformer to generate second encoded data, the second encoder including a second attention network; and providing the first encoded data and the second encoded data to a decoder attention network of a decoder of the multi-encoder transformer to generate output spectral data that corresponds to a speech enhanced version of the input spectral data.

Description

I. II. Field The present disclosure is generally related to speech enhancement. III. Description of Related Art Advances in technology have resulted in smaller and more powerful computing devices. For example, there currently exist a variety of portable personal computing devices, including wireless telephones such as mobile and smart phones, tablets and laptop computers that are small, lightweight, and easily carried by users. These devices can communicate voice and data packets over wireless networks. Further, many such devices incorporate additional functionality such as a digital still camera, a digital video camera, a digital recorder, and an audio file player. Also, such devices can process executable instructions, including software applications, such as a web browser application, that can be used to access the Internet. As such, these devices can include significant computing capabilities. Such computing devices often incorporate functionality to receive an audio signal from one or more microphones. For example, the audio signal may represent user speech captured by the microphones, external sounds captured by the microphones, or a combination thereof. Such devices may include applications that perform noise suppression and speech enhancement. For example, a device can perform an analysis of a noisy speech signal in a frequency domain that uses a deep neural network to reduce the noise and then reconstruct the speech. However, under some conditions, such techniques can fail to suppress noise. In an example, abrupt and stationary noise, such as clapping, can be difficult to remove from the noisy speech signal. Improving a device's speech enhancement capability improves performance of various speech-related applications that may be performed at the device, such as communications and speech-related recognition systems, including automatic speech recognition (ASR), speaker recognition, emotion recognition, and event detection. CN 111 863 009 A describes training systems and methods for a context information prediction model. WO 2020/042707 A1 describes convolutional recurrent neural network-based single-channel real-time noise reduction methods. CN 110 867 192 A describes a speech enhancement method based on gated cycle encoding and decoding network. IV. Summary The present invention is defined by the appended claims. Further restricted embodiments are provided by the description. V. Brief Description of the Drawings FIG. 1 is a block diagram of a particular illustrative aspect of a system operable to perform context-based speech enhancement, in accordance with some examples of the present disclosure.FIG. 2 is a diagram of a particular implementation of a speech enhancer of the system of FIG. 1, in accordance with some examples of the present disclosure.FIG. 3 is a diagram of another particular implementation of the speech enhancer of FIG. 1, in accordance with some examples of the present disclosure.FIG. 4 is a diagram of another particular implementation of the speech enhancer of FIG. 1, in accordance with some examples of the present disclosure.FIG. 5 is a diagram of an illustrative aspect of an encoder of the speech enhancer of FIG. 1, in accordance with some examples of the present disclosure.FIG. 6 is a diagram of an illustrative aspect of operations of components of the system of FIG. 1, in accordance with some examples of the present disclosure.FIG. 7 illustrates an example of an integrated circuit operable to generate enhanced speech, in accordance with some examples of the present disclosure.FIG. 8 is a diagram of a mobile device operable to generate enhanced speech, in accordance with some examples of the present disclosure.FIG. 9 is a diagram of a headset operable to generate enhanced speech, in accordance with some examples of the present disclosure.FIG. 10 is a diagram of a wearable electronic device operable to generate enhanced speech, in accordance with some examples of the present disclosure.FIG. 11 is a diagram of a voice-controlled speaker system operable to generate enhanced speech, in accordance with some examples of the present disclosure.FIG. 12 is a diagram of a camera operable to generate enhanced speech, in accordance with some examples of the present disclosure.FIG. 13 is a diagram of a headset, such as a virtual reality or augmented reality headset, operable to generate enhanced speech, in accordance with some examples of the present disclosure.FIG. 14 is a diagram of a first example of a vehicle operable to generate enhanced speech, in accordance with some examples of the present disclosure.FIG. 15 is a diagram of a second example of a vehicle operable to generate enhanced speech, in accordance with some examples of the present disclosure.FIG. 16 is diagram of a particular implementation of a method of speech enhancement that may be performed by the device of FIG. 1, in accordance with some examples of the present disclosure.FIG. 17 is diagram of another particular implemen