US-12620406-B2 - System and method for speech enhancement in multichannel audio processing systems

US12620406B2US 12620406 B2US12620406 B2US 12620406B2US-12620406-B2

Abstract

A method, computer program product, and computing system for enhancement of audio signals received from a plurality of microphones. A multichannel audio signal is received from a plurality of microphones and is processed with a short-time discrete cosine transform (STDCT) to generate a real-valued spectral representation of the multichannel signal encoding both magnitude and phase information. Magnitude- and phase-dependent weights are generated, and an enhanced single-channel signal is produced based upon, at least in part, the spectral representation of the multichannel signal and the magnitude- and phase-dependent weights.

Inventors

Stanislav Kruchinin
Dushyant Sharma
Rong Gong

Assignees

MICROSOFT TECHNOLOGY LICENSING, LLC

Dates

Publication Date: 20260505
Application Date: 20230913

Claims (18)

1 . A method comprising: receiving a multichannel audio signal from a multichannel audio frontend that includes a plurality of microphones, the multichannel audio signal including spatial information associated with the multichannel audio frontend; processing the multichannel audio signal using a short-time discrete cosine transform (STDCT) to generate a real-valued spectral representation of the multichannel audio signal, the spatial information associated with the multichannel audio frontend being encoded within the real-valued spectral representation of the multichannel audio signal; generating magnitude- and phase-dependent weights associated with the real-valued spectral representation of the multichannel audio signal and the spatial information encoded therein; generating a single-channel representation of the multichannel signal based upon, at least in part, the spectral representation of the multichannel audio signal and the magnitude- and phase-dependent weights; generating, by a neural network, direction of arrival (DOA) information based on the magnitude- and phase-dependent weights; and transmitting the single-channel representation of the multichannel signal, along with metadata that includes the DOA information, to a cloud device, wherein the cloud device includes an automatic speech recognition (ASR) model configured to perform speech recognition on the single-channel representation of the multichannel signal, and wherein the cloud device is configured to use the DOA information for speaker localization or diarization to improve the speech recognition performance of the ASR model.
2 . The method of claim 1 , wherein the STDCT comprises a modified discrete cosine transform (MDCT).
3 . The method of claim 2 , wherein the MDCT comprises one of a floating point MDCT and an integer MDCT.
4 . The method of claim 2 , further comprising generating direction of arrival information for the multichannel signal based on, at least in part, the magnitude- and phase-dependent weights.
5 . The method of claim 1 , further comprising performing an inverse DCT on the single-channel representation to obtain an audio signal representation of the multichannel audio signal.
6 . The method of claim 1 , further comprising: encoding the single-channel representation signal prior to transmission of the single-channel representation signal over a transmission channel.
7 . The method of claim 1 , wherein the single-channel representation of the multichannel signal is further based upon a direction of arrival (DOA) of the signal.
8 . The method of claim 1 , further comprising: transmitting the single-channel representation of the multichannel signal to an automatic speech recognition (ASR) backend configured for single-channel speech recognition.
9 . A system comprising: one or more processors; and a memory storing programming instructions for execution by the one or more processors, the programming instructions, upon execution by the one or more processors, causing the system to perform the following operations: receiving a multichannel audio signal from a multichannel audio frontend that includes a plurality of microphones, the multichannel audio signal including spatial information associated with the multichannel audio frontend; processing the multichannel audio signal using a short-time discrete cosine transform (STDCT) to generate a real-valued spectral representation of the multichannel audio signal, the spatial information associated with the multichannel audio frontend being encoded within the real-valued spectral representation of the multichannel audio signal; generating magnitude- and phase-dependent weights associated with the real-valued spectral representation of the multichannel audio signal and spatial information encoded therein; generating a single-channel representation of the multichannel signal based upon, at least in part, the spectral representation of the multichannel audio signal and the magnitude- and phase-dependent weights; generating, by a neural network, direction of arrival (DOA) information based on the magnitude- and phase-dependent weights; and transmitting the single-channel representation of the multichannel signal, along with metadata that includes the DOA information, to a cloud device, wherein the cloud device includes an automatic speech recognition (ASR) model configured to perform speech recognition on the single-channel representation of the multichannel signal, and wherein the cloud device is configured to use the DOA information for speaker localization or diarization to improve the speech recognition performance of the ASR model.
10 . The system of claim 9 , wherein the STDCT comprises a modified discrete cosine transform (MDCT).
11 . The system of claim 10 , wherein the MDCT comprises one of a floating point MDCT and an integer MDCT.
12 . The system of claim 9 , wherein the programming instructions further cause the system to perform the following operation: generating direction of arrival information for the multichannel signal based on, at least in part, the magnitude- and phase-dependent weights.
13 . The system of claim 9 , wherein the programming instructions further cause the system to perform the following operation: performing an inverse STDCT on the single-channel representation to obtain an audio signal representation of the multichannel audio signal.
14 . The system of claim 9 , wherein the programming instructions further cause the system to perform the following operation: encoding the single-channel representation signal prior to transmission of the single-channel representation signal over a transmission channel.
15 . A computer program product residing on a non-transitory computer readable medium having programming instructions stored thereon which, when executed by one or more processors of a system, cause the system to perform the following operations comprising: receiving a multichannel audio signal from a multichannel audio frontend that includes a plurality of microphones, the multichannel audio signal including spatial information associated with the multichannel audio frontend; processing the multichannel audio signal with a modified discrete cosine transform (MDCT) to generate a spectral representation of the multichannel audio signal, the spatial information associated with the multichannel audio frontend being encoded within the spectral representation of the multichannel audio signal; generating magnitude- and phase-dependent weights associated with the spectral representation of the multichannel audio signal and spatial information encoded therein; generating a single-channel representation of the multichannel signal based upon, at least in part, the spectral representation of the multichannel audio signal and the magnitude- and phase-dependent weights; generating, by a neural network, direction of arrival (DOA) information based on the magnitude- and phase-dependent weights; and transmitting the single-channel representation of the multichannel signal, along with metadata that includes the DOA information, to a cloud device, wherein the cloud device includes an automatic speech recognition (ASR) model configured to perform speech recognition on the single-channel representation of the multichannel signal, and wherein the cloud device is configured to use the DOA information for speaker localization or diarization to improve the speech recognition performance of the ASR model.
16 . The computer program product of claim 15 , wherein the MDCT comprises one of a floating point MDCT and an integer MDCT.
17 . The computer program product of claim 15 , further comprising generating direction of arrival information for the multichannel signal based on, at least in part, the magnitude- and phase-dependent weights.
18 . The computer program product of claim 15 , wherein the programming instructions further cause the system to perform the following operation: performing an inverse MDCT on the single-channel representation to obtain an audio signal representation of the multichannel audio signal.

Description

BACKGROUND Speech processing systems (e.g., automatic speech recognition (ASR) systems, bio-metric voice systems, etc.) suffer recognition accuracy degradation from the input of a far-field audio signal when the speakers are distant from the microphone. The degradation of recognition quality is due to the signal corruption effect of the far-field speech caused by reverberation and background noise. Compared with a single microphone, a microphone array device which comprises multiple microphones can be utilized to capture multichannel audio as the input to a speech processing backend system (e.g., an ASR backend system) for alleviating such a degradation problem. However, since a speech processing backend is usually designed to receive a single-channel audio input, a speech processing frontend component which receives the multichannel audio and emits a single-channel audio may be utilized to bridge the gap between the multichannel audio input and the speech processing backend. Multichannel signals may be processed by an ASR system to transcribe or otherwise process conversational speech. Spatial information contained in the multichannel audio can enhance the system's ability to accurately capture the nuances of conversational speech, enabling more robust transcription and analysis. Speech processing methods considering this information are particularly beneficial in scenarios where distinguishing between multiple speakers or capturing environmental cues is desirable for the transcription process. BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1-4 are diagrammatic views of various audio codecs in accordance with the implementation of a multichannel audio signal enhancement process; FIG. 5 is a flow chart of one implementation of the audio signal enhancement process; FIG. 6 is a diagrammatic view of an implementation of the audio signal enhancement process; FIG. 7 is a further diagrammatic view of an implementation of the audio signal enhancement process of FIG. 6; and FIG. 8 is a diagrammatic view of a computer system and the audio signal enhancement process coupled to a distributed computing network. Like reference symbols in the various drawings indicate like elements. DETAILED DESCRIPTION As will be discussed in greater detail below, implementations of the present disclosure address the problems of effective utilization of a multichannel (e.g., microphone array) edge device to capture a distant speech signal and usage of spatial information in an efficient way to enhance the signal for an end-to-end ASR system. One aspect regarding solving the distant ASR problem lies in the employment of microphone arrays and the exploitation of spatial information to improve signal quality. Microphone arrays include multiple microphones spaced apart by known distances and relative positions. Therefore, the array is capable of capturing multichannel audio signals which contain spatial information. Spatial information in a microphone array refers to the location and orientation of individual microphones relative to each other within the array, as well as the sound sources in the surrounding environment. This spatial arrangement of microphones enables the array to capture and analyze audio signals from different directions, providing valuable cues for various audio processing applications. Microphone arrays with well-defined spatial arrangements are particularly beneficial in challenging audio environments where noise, reverberation, and multiple sound sources are present. However, since ASR systems are configured to process single-channel signals, the spatial information contained in a multichannel signal needs to be accessible when the multichannel signal is transformed into a single-channel signal for processing by the ASR system. A Short-Time Discrete Cosine Transform and, particularly, the Modified Discrete Cosine Transform (MDCT) is used to generate a time-frequency representation (i.e., the individual channel's spectrum) of the multichannel audio signal in a speech enhancement frontend for ASR. This representation includes encoded phase information within a real-valued representation of the spectrum, which results in better speech enhancement because the neural network computes the magnitude- and phase-dependent weights similar to weights computed in beamforming processes. This representation also facilitates reconstruction of the time-domain waveform of the enhanced signal and extracting spatial information, e.g., direction of arrival (DOA), from the computed weights. The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features and advantages will become apparent from the description, the drawings, and the claims. The Multichannel Signal Enhancement Process Referring to FIGS. 1-7, implementations of the present disclosure are directed to a multichannel signal enhancement method and system using the spatial information encoded in a phase-dependent signal to