KR-20260064352-A - Method And Apparatus for Speech Enhancement

KR20260064352AKR 20260064352 AKR20260064352 AKR 20260064352AKR-20260064352-A

Abstract

A method and apparatus for voice enhancement are disclosed. According to one aspect of the present disclosure, a computer-implemented method performed by one or more computers comprises: receiving an input audio waveform including a raw voice signal; generating one or more feature vectors representing the input audio waveform using an encoder neural network; generating a plurality of quantized vectors for the feature vector using a plurality of vector quantizers, wherein one of the plurality of quantized vectors is a first quantized vector containing semantic information of the voice signal, and another of the plurality of quantized vectors is a second quantized vector containing acoustic information of the voice signal; and generating an output audio waveform by processing some of the plurality of quantized vectors using a decoder neural network.

Inventors

정원진

Assignees

에스케이텔레콤 주식회사

Dates

Publication Date: 20260507
Application Date: 20241031

Claims (20)

As a computer implementation method performed by one or more computers, A process of receiving an input audio waveform containing a raw voice signal; A process of generating one or more feature vectors representing the input audio waveform using an encoder neural network; A process of generating a plurality of quantized vectors for the feature vector using a plurality of vector quantizers—wherein one of the plurality of quantized vectors is a first quantized vector containing semantic information of the voice signal, and another of the plurality of quantized vectors is a second quantized vector containing acoustic information of the voice signal—; and A process of generating an output audio waveform by processing some of the plurality of quantized vectors using a decoder neural network. A computer implementation method including
In paragraph 1, The above input audio waveform includes a noise signal, and A computer implementation method in which the output audio waveform is a version in which a noise signal included in the raw voice signal is removed or reduced.
In paragraph 1, The above plurality of vector quantizers are arranged sequentially, and The process of generating the above plurality of quantized vectors is, A process of generating the first quantized vector using the first vector quantizer of the sequence of the plurality of vector quantizers; and A process of generating the second quantized vector using the second vector quantizer of the sequence of the plurality of vector quantizers. A computer implementation method including
In paragraph 1, The process of generating the above output audio waveform is, A computer implementation method comprising the process of generating the output audio waveform by processing the first quantized vector and the second quantized vector using the above decoder neural network.
A learning method performed by one or more computers for jointly training an encoder neural network, a decoder neural network, and a plurality of vector quantizers, wherein A process of receiving training data including clean voice signals and noise signals; A process of processing the input audio waveform using the encoder neural network to generate a feature vector of the input audio waveform including the clean voice signal and the noise signal; A process of processing the feature vector using the plurality of vector quantizers to generate a plurality of quantized vectors—wherein one of the plurality of quantized vectors is a first quantized vector for representing semantic information of the voice signal, and another of the plurality of quantized vectors is a second quantized vector for representing acoustic information of the voice signal—; and A process of processing the first quantized vector and the second quantized vector using the decoder neural network to generate an output audio waveform. A learning method including
In paragraph 5, The above plurality of vector quantizers are arranged sequentially, and The process of processing the feature vector using the aforementioned plurality of vector quantizers is, A process of quantizing the feature vector using the first vector quantizer of the sequence of vector quantizers to generate the first quantized vector; and A process of quantizing the difference between the feature vector and the first quantized vector using a second vector quantizer of the sequence of vector quantizers to generate the second quantized vector. A learning method including
In paragraph 5, A process for determining reconstruction loss based on the difference between the output audio waveform and the audio waveform of the clean voice signal; A process for determining the gradient of a loss function including the above reconstruction loss; and A learning method further comprising the process of updating one or more of the parameter set of the encoder neural network, the parameter set of the decoder neural network, and the codebooks of the plurality of vector quantizers based on the gradient of the loss function.
In Paragraph 7, The encoder neural network, the decoder neural network, and the plurality of vector quantizers are jointly trained together with the discriminator neural network, The above learning method is, The process of the above discriminator neural network receiving an audio waveform; and The process further includes processing the audio waveform using the discriminator neural network to generate a discriminator score indicating the possibility that the audio waveform is the audio waveform of the clean voice signal or the output audio waveform. The above loss function includes an adversarial loss determined based on the above discriminator score, wherein A learning method in which the above update process further includes a process of updating the parameter set of the discriminator neural network based on the gradient of the above loss function.
In Paragraph 7, A process of processing the input audio waveform using a first teacher model to generate a first latent vector representing the semantic information of the clean voice signal; A process of processing the input audio waveform using a second teacher model to generate a second latent vector representing the acoustic information of the clean voice signal; and A process of aligning the first quantized vector to the first latent vector and aligning the second quantized vector to the second latent vector A learning method including
In Paragraph 9, A learning method in which the first teacher model includes a HuBERT model, and the first latent vector includes semantic information generated by the HuBERT model.
In Paragraph 9, A learning method in which the second teacher model includes an ECAPA-TDNN model, and the second latent vector includes acoustic information generated by the ECAPA-TDNN model.
In Paragraph 9, The process of aligning the above quantized vectors to each of the above potential vectors is, A process for determining a first distillation loss based on the difference between the first quantized vector and the first potential vector; and A process of determining the second distillation loss based on the difference between the second quantized vector and the second potential vector. Includes, A learning method in which the above loss function includes the above first distillation loss and the above second distillation loss.
In Paragraph 12, A learning method in which the second distillation loss is the Kullback-Leibler divergence loss.
In Paragraph 9, The above encoder neural network and the plurality of vector quantizers are jointly trained with a multi-head attention neural network, The process of aligning the second quantized vector to the second potential vector is, A process of processing the second potential vector using the above-described multi-head attention neural network; and A process of aligning the second latent vector processed by the multi-head attention neural network to the second quantized vector A learning method including
In Paragraph 7, A process of processing the audio waveform of the noise signal using a second encoder to generate a third latent vector representing the noise signal; and The process of aligning the last residual vector generated by the plurality of vector quantizers to the third potential vector. A learning method including
In paragraph 15, The second encoder above is an encoder that processes the input audio waveform, and The above third latent vector is a learning method generated by the above encoder.
In paragraph 15, The process of aligning the above quantized vectors to each of the above potential vectors is, The method includes a process for determining noise loss based on the difference between the last residual vector and the third potential vector, wherein The above loss function is a learning method including the above noise loss.
In Paragraph 17, The above noise loss is a learning method in which the cosine similarity loss is used.
At least one memory for storing instructions; and Includes at least one processor, The above at least one processor executes the above instructions, Receive an input audio waveform containing a raw voice signal, and Using an encoder neural network, one or more feature vectors representing the input audio waveform are generated, and Using multiple vector quantizers, multiple quantized vectors for the feature vector are generated - one of the multiple quantized vectors includes semantic information of the speech signal, and another of the multiple quantized vectors includes acoustic information of the speech signal -, A device that generates an output audio waveform by processing some of the plurality of quantized vectors using a decoder neural network.
A computer program stored on a computer-readable recording medium to execute each process included in the method according to any one of paragraphs 1 through 18.

Description

Method and Apparatus for Speech Enhancement The present disclosure relates to a method and apparatus for speech enhancement. More specifically, it relates to a method and apparatus capable of enhancing speech using a neural speech codec. The following description merely provides background information related to the present embodiment and does not constitute prior art. Speech enhancement is a technology that removes noise from a speech signal. One conventional approach to improving speech enhancement performance involves converting the speech signal into a latent vector in the frequency domain using an encoder neural network and then removing the noise signal contained in the latent vector during the reconstruction process using a decoding neural network. However, while phase information is required to restore the signal converted into a spectrogram to the original signal, there is a problem where speech enhancement performance is degraded due to the difference between the phase signal containing noise and the clean signal. To address this problem, a method was proposed to transform speech signals into latent vectors in the time domain. This approach demonstrates better speech enhancement performance by using self-attention to remove noise from latent vectors and reconstruct speech without phase changes. However, since the noise contained in clean raw speech signals is random, there are limitations in predicting this noise using a pre-trained model. As described above, conventional speech enhancement methods focused on identifying noise in the raw speech signal and removing it. However, since noise has irregular values, it is difficult to identify, which has limited the improvement of speech enhancement performance. FIG. 1 is a block diagram schematically illustrating a voice enhancement system according to one embodiment of the present disclosure. Figure 2 is an illustrative diagram illustrating an example of a learning system for jointly training an encoder neural network, a residual vector quantizer, and a decoder neural network included in a speech enhancement system. FIG. 3 is a flowchart illustrating the process of performing voice enhancement according to one embodiment of the present disclosure. FIG. 4 is a flowchart illustrating the process of jointly training an encoder neural network, a decoder neural network, and a residual vector quantizer included in a speech enhancement system according to one embodiment of the present disclosure. FIG. 5 is a block diagram schematically illustrating a voice enhancement system according to another embodiment of the present disclosure. FIG. 6 is a schematic block diagram illustrating an exemplary computing device that can be used to implement a method or device according to the present disclosure. Some embodiments of the present disclosure are described in detail below with reference to exemplary drawings. It should be noted that in assigning reference numerals to the components of each drawing, the same components are given the same reference numeral whenever possible, even if they are shown in different drawings. Furthermore, in describing the present disclosure, if it is determined that a detailed description of related known components or functions could obscure the essence of the present disclosure, such detailed description is omitted. In describing the components of the embodiments according to the present disclosure, symbols such as first, second, i), ii), a), b), etc., may be used. These symbols are intended only to distinguish the components from other components, and the essence, order, or sequence of the components is not limited by the symbols. When a part in the specification is described as 'comprising' or 'having' a component, this means that, unless explicitly stated otherwise, it does not exclude other components but may include additional components. The detailed description set forth below, together with the accompanying drawings, is intended to describe exemplary embodiments of the present disclosure and is not intended to represent the only embodiment in which the present disclosure can be practiced. In the present disclosure, a raw speech signal refers to a speech signal containing a noise signal. In the present disclosure, a clean speech signal refers to a speech signal in which the noise signal is removed or reduced from the raw speech signal. In the present disclosure, the raw speech signal may be expressed as a combination of the clean speech signal and the noise signal. A voice enhancement system (100) can generate a clean voice signal by processing a raw voice signal. More specifically, the voice enhancement system (100) can generate a first token representing semantic information of the voice within the raw voice signal and a second token representing acoustic information of the voice within the raw voice signal by processing a continuous raw voice signal. Additionally, the voice enhancement system (100) can generate a clean voice