US-12620404-B2 - Deep source separation architecture
Abstract
A speech separation server comprises a deep-learning encoder with nonlinear activation. The encoder is programmed to take a mixture audio waveform in the time domain, learn generalized patterns from the mixture audio waveform, and generate an encoded representation that effectively characterizes the mixture audio waveform for speech separation.
Inventors
- Berkan KADIOGLU
- Michael Getty HORGAN
- Jordi Pons PUIG
- Xiaoyu Liu
Assignees
- DOLBY LABORATORIES LICENSING CORPORATION
- DOLBY INTERNATIONAL AB
Dates
- Publication Date
- 20260505
- Application Date
- 20201020
- Priority Date
- 20191021
Claims (14)
- 1 . A computer-implemented method of separating audio signals from different speech sources, comprising: receiving, by a processor, a mixture audio signal comprising audio signals from a plurality of speech sources in a time domain; transforming, by the processor, the mixture audio signal into an encoded representation by an encoder convolutional neural network (CNN) with multiple convolutional layers and nonlinear activation, wherein the encoder CNN comprises a first stack of convolutional layers, each convolutional layer comprising a respective one-dimensional (1-D) convolution followed by a gated linear unit (GLU) activation, and wherein transforming the mixture audio signal into the encoded representation comprises: transforming, by the processor, the mixture audio signal in the time domain into an intermediate encoded representation using at least a first convolutional layer, the intermediate encoded representation being represented by N dimensions; transforming, by the processor, the intermediate encoded representation to the encoded representation using the first stack of convolutional layers including at least three subsequent convolutional layers configured to hierarchically transform the intermediate encoded representation into a non-linear latent space, each of the at least three subsequent convolutional layers implementing a respective 1-D convolutional operation with N learnable kernels and outputting an output representation with N dimensions, wherein an initial convolutional layer of the at least three subsequent convolutional layers extracts a first pattern from the intermediate encoded representation, wherein a final convolutional layer of the at least three subsequent convolutional layers extracts a second pattern from the intermediate encoded representation, the second pattern having a lower resolution than the first pattern, and wherein the output representation of the final layer of said at least three subsequent convolutional layers is the encoded representation; and separating the encoded representation into a plurality of individual representations corresponding to the plurality of speech sources; and transforming the plurality of individual representations into a plurality of audio signals corresponding to the plurality of speech sources by a decoder CNN including multiple convolutional layers and nonlinear activation.
- 2 . The computer-implemented method of claim 1 , wherein each of the plurality of audio signals comprises a waveform in a time domain.
- 3 . The computer-implemented method of claim 1 , wherein the first convolutional layer has linear activation or no activation.
- 4 . The computer-implemented method of claim 1 , wherein the nonlinear activation is selected from the group consisting of a parametric rectified linear unit (PReLU), a gated linear unit (GLU), a GLU with normalization, a leaky ReLU, a Sigmoid function, or a Tan H function.
- 5 . The computer-implemented method of claim 1 , wherein the encoder CNN comprises one or more residual and skip connections.
- 6 . The computer-implemented method of claim 1 , wherein a structure of the decoder CNN corresponds to a structure of the encoder CNN.
- 7 . The computer-implemented method of claim 1 , further comprising: receiving a plurality of sample audio signals corresponding to the plurality of audio sources; and building the encoder CNN based on the plurality of sample audio signals using an objective function comprising scale-invariant signal-to-noise ratio (SI-SNR) with permutation-invariant training.
- 8 . The computer-implemented method of claim 1 , wherein the separating is performed by a separator CNN comprising stacked dilated convolutional blocks.
- 9 . A non-transitory, computer-readable storage medium storing computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the method of claim 1 .
- 10 . The non-transitory, computer-readable storage medium of claim 9 , wherein the method further comprises: receiving the plurality of individual representations from a separator; applying the decoder CNN to each individual representation of the plurality of individual representations to generate an audio signal that spans a range of time, the decoder CNN having multiple convolutional layers and nonlinear activation; and transmitting the plurality of audio signals.
- 11 . The non-transitory, computer-readable storage medium of claim 9 , wherein the nonlinear activation comprises a PRELU, a GLU, a GLU with normalization, a leaky ReLU, a Sigmoid function, or a Tan H function.
- 12 . The non-transitory, computer-readable storage medium of claim 9 , wherein the encoder CNN comprises one or more residual and skip connections.
- 13 . A system for separating audio signals from different sources, comprising: one or more processors; a memory storing computer-executable instructions which when executed by the one or more processors causing the one or more processors to perform the method of claim 1 .
- 14 . The system of claim 13 , wherein the decoder CNN comprises one or more convolutional layers with residual and skip connections.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS This patent application claims priority to Spanish Patent Application No. P201930932 filed on Oct. 21, 2019, U.S. Provisional Patent Application No. 62/957,870 filed on Jan. 7, 2020 and U.S. Provisional Patent Application No. 63/087,788 filed on Oct. 5, 2020, each of which is incorporated by reference in its entirety. TECHNICAL FIELD The present Application relates to speech recognition and deep machine learning. More specifically, example embodiment(s) described below relate to improving neural network architecture for better separation of speech sources. BACKGROUND The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Identifying individual speech sources from mixture speech has been challenging. Learning from a large amount of data has led to some progress in such identification. It can be helpful to further utilize deep machine learning to improve the separation of speech sources. BRIEF DESCRIPTION OF THE DRAWINGS The example embodiment(s) of the present invention are illustrated by way of example, and not in way by limitation, in the figures of the accompanying drawings in which: FIG. 1 illustrates example components of a speech separation server computer in accordance with the disclosed embodiments. FIG. 2 illustrates the Conv-TasNet neural network. FIG. 3 illustrates an example neural network for speech separation in accordance with the disclosed embodiments. FIG. 4 illustrates an example neural network for speech separation in accordance with the disclosed embodiments. FIG. 5 illustrates an example convolutional layer having a modified gated linear unit in accordance with the disclosed embodiments. FIG. 6 illustrates an example neural network for speech separation in accordance with the disclosed embodiments. FIG. 7 illustrates an example convolutional layer having skip and residual connections in accordance with the disclosed embodiments. FIG. 8 illustrates an example process performed with a speech separation server computer in accordance with some embodiments described herein. FIG. 9 illustrates an example process performed with a speech separation server computer in accordance with some embodiments described herein. FIG. 10 is a block diagram that illustrates a computer system upon which an embodiment of the invention may be implemented. DESCRIPTION OF THE EXAMPLE EMBODIMENTS In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the example embodiment(s). It will be apparent, however, that the example embodiment(s) may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the example embodiment(s). Embodiments are described in sections below according to the following outline: 1. GENERAL OVERVIEW2. EXAMPLE COMPUTER COMPONENTS3. FUNCTIONAL DESCRIPTIONS 3.1. CONV-TASNET3.2. DEEP, NONLINEAR ENCODER AND DECODER 3.2.1.1. ARCHITECTURE3.2.1.2. TRAINING 4. EXAMPLE PROCESSES5. EXPERIMENTAL RESULTS6. HARDWARE IMPLEMENTATION7. EXTENSIONS AND ALTERNATIVES 1. GENERAL OVERVIEW A speech separation server computer (“server”) and related methods are disclosed. In some embodiments, the server comprises a deep-learning encoder with nonlinear activation programmed to take a mixture audio waveform in the time domain, learn generalized patterns from the mixture audio waveform, and generate an encoded representation that effectively characterizes the mixture audio waveform for speech separation. The mixture audio waveform comprises utterances from multiple vocal sound sources over a period of time. The server also comprises a deep-learning decoder with nonlinear activation programmed to take an encoded representation of individual waveforms corresponding to separate speech sources, and generate the individual waveforms. In some embodiments, the encoder is a convolutional network comprising multiple convolutional layers. At least one of the convolutional layers includes a one-dimensional (1-D) filter of a relatively small size. At least one of the convolutional layers includes a nonlinear activation function, such as a parametric rectified linear unit (PReLU) or a gated linear unit (GLU). In some embodiments, the server is programmed to receive a mixture audio waveform spanning a time period in the time domain. For example, the mixture audio waveform could be the mixture of two utterances within ten minutes from two different speakers. The server is programmed to further generate, from the mixture audio waveform, waveform segments spannin