US-20260128052-A1 - APPARATUS AND METHOD FOR REMOVING AMBIENT NOISE IN SPEECH WAVEFORM BY USING BAND-PASS FILTER AND DEEP LEARNING

US20260128052A1US 20260128052 A1US20260128052 A1US 20260128052A1US-20260128052-A1

Abstract

The present invention relates to an apparatus and a method for removing ambient noise from a speech waveform by using a band-pass filter and deep learning, wherein the apparatus and method are implemented to remove ambient noise from a speech waveform combined with the ambient noise and extract only a clean speech waveform so that a human's speech can be easily understood.

Inventors

Young Hoon Park

Assignees

HEARDL LTD.

Dates

Publication Date: 20260507
Application Date: 20221102
Priority Date: 20221021

Claims (16)

1 . An apparatus for removing ambient noise from a speech waveform, the apparatus comprising: an ambient noise removal unit configured to receive a first speech waveform as an input, remove noise through filtering and deep learning, and then output a fourth speech waveform; and a deep learning training unit configured to calculate deep learning weights that are used in deep learning through the deep learning training and to provide the deep learning weights to the ambient noise removal unit.
2 . The apparatus of claim 1 , wherein the ambient noise removal unit comprises: a filter unit configured to output a plurality of second waveforms by receiving the one first speech waveform as an input; a deep learning unit configured to output a plurality of third waveforms by receiving the plurality of second waveforms as an input; and a summing unit configured to output the one fourth speech waveform by summing up the plurality of third waveforms.
3 . The apparatus of claim 2 , wherein: the filter unit comprises a plurality of delayed filters configured to output the plurality of second waveforms by receiving the one first speech waveform as an input, one delayed filter has a structure in which one band-pass filter and one delay unit are connected in series, and each of the delay units included in the plurality of delayed filters compensates for a difference between pieces of latency of the band-pass filters included in the plurality of delayed filters by delaying a signal by different latency having a predetermined value so that all of pieces of latency of the plurality of delayed filters are identical with each other.
4 . The apparatus of claim 2 , wherein the deep learning unit comprises: an encoder unit configured to output a plurality of seventh waveforms and a plurality of sixth waveforms by receiving the plurality of second waveforms as an input; a unidirectional LSTM unit configured to output a plurality of eighth waveforms by receiving the plurality of sixth waveforms as an input; and a decoder unit configured to outputs the plurality of third waveforms by receiving the plurality of seventh waveforms and the plurality of eighth waveforms as an input, wherein the encoder unit has a structure in which a plurality of CNN encoders is connected in series, and the decoder unit has a structure in which a plurality of detail decoders each outputting one waveform that constitutes the third waveform by receiving the seventh waveform and the eighth waveform as an input is connected in parallel.
5 . The apparatus of claim 4 , wherein the decoder unit further comprises one detail decoder configured to output one fifth waveform by receiving the seventh waveform and the eighth waveform as an input.
6 . The apparatus of claim 4 , wherein each of the plurality of detail decoders comprises: a first number change deep learning device configured to receive the eighth waveform as an input; and a plurality of decoder stages connected to the first number change deep learning device in series and configured to receive the seventh waveform as an additional input.
7 . The apparatus of claim 6 , wherein the deep learning training unit comprises: a second summing unit configured to receive a clean ground truth speech waveform and an ambient noise waveform as an input and to generate the first speech waveform by summing up the clean ground truth speech waveform and the ambient noise waveform; a second filter unit configured to output a plurality of thirteenth waveforms by receiving the clean ground truth speech waveform as an input; and a deep learning training engine configured to calculate the deep learning weights by receiving the plurality of thirteenth waveforms and the plurality of third waveforms generated by the ambient noise removal unit as an input and to provide the deep learning weights to the ambient noise removal unit.
8 . The apparatus of claim 7 , wherein the deep learning training unit further comprises a pitch sine wave generator configured to output a plurality of fifteenth waveforms by receiving the clean ground truth speech waveform as an input and to provide the plurality of fifteenth waveforms to the deep learning training engine.
9 . The apparatus of claim 7 , wherein the deep learning training engine comprises: a plurality of relative error calculation units configured to calculate average relative error values of the plurality of third waveforms for the plurality of thirteenth waveforms; a relative error summing unit configured to calculate an average relative error sum value by summing up the average relative error values output by the plurality of relative error calculation units; and a deep learning weight calculation unit configured to calculate the deep learning weights so that the average relative error sum value is reduced.
10 . The apparatus of claim 9 , wherein the deep learning training engine further comprises one relative error calculation unit configured to calculate average relative error values of the plurality of fifth waveforms for the plurality of fifteenth waveforms.
11 . A method of removing ambient noise from a speech waveform by using the apparatus according to claim 8 , the method comprising: generating a plurality of deep learning output waveforms by using a plurality of narrow band waveforms, which is generated by passing an input speech waveform through a plurality of band-pass filters, as an input for deep learning, and then generating an output speech waveform having ambient noise greatly reduced by summing up the plurality of output waveforms, wherein the deep learning additionally outputs one waveform in addition to the deep learning output waveforms, the deep learning is trained so that the added waveform outputs pitch information of a clean speech waveform from which ambient noise has been removed in a speech waveform combined with the ambient noise, and pitch information of one speech waveform learnt by the deep learning is used to generate the plurality of deep learning output waveforms.
12 . The method of claim 11 , wherein the pitch sine wave generator generates a twenty-first speech waveform obtained by delaying the clean ground truth speech waveform by latency of first speech waveform of the second waveform, extracts all of pitch start times of the twenty-first speech waveform during a voiced speech time interval of the twenty-first speech waveform, and generates one fifteenth waveform having a sine wave, having a period identical with a pitch period of the twenty-first speech waveform, and having a maximum value at the pitch start time of the twenty-first speech waveform.
13 . The method of claim 12 , wherein: the deep learning training engine adds an average relative error value of the fifth waveform for the fifteenth waveform to the average relative error sum value in a deep learning training process and then determines a deep learning weight value so that the added average relative error sum value is reduced, and the deep learning unit uses the pitch information of the clean ground truth speech waveform when learning the pitch information of the clean ground truth speech waveform and outputting the plurality of third waveforms.
14 . A method of removing ambient noise from a speech waveform by using the apparatus according to claim 9 , the method comprising: generating a plurality of deep learning output waveforms by using a plurality of narrow band waveforms, which is generated by passing an input speech waveform through a plurality of band-pass filters, as an input for deep learning, and then generating an output speech waveform having ambient noise greatly reduced by summing up the plurality of output waveforms, wherein the deep learning additionally outputs one waveform in addition to the deep learning output waveforms, the deep learning is trained so that the added waveform outputs pitch information of a clean speech waveform from which ambient noise has been removed in a speech waveform combined with the ambient noise, and pitch information of one speech waveform learnt by the deep learning is used to generate the plurality of deep learning output waveforms.
15 . The method of claim 14 , wherein the pitch sine wave generator generates a twenty-first speech waveform obtained by delaying the clean ground truth speech waveform by latency of first speech waveform of the second waveform, extracts all of pitch start times of the twenty-first speech waveform during a voiced speech time interval of the twenty-first speech waveform, and generates one fifteenth waveform having a sine wave, having a period identical with a pitch period of the twenty-first speech waveform, and having a maximum value at the pitch start time of the twenty-first speech waveform.
16 . The method of claim 15 , wherein: the deep learning training engine adds an average relative error value of the fifth waveform for the fifteenth waveform to the average relative error sum value in a deep learning training process and then determines a deep learning weight value so that the added average relative error sum value is reduced, and the deep learning unit uses the pitch information of the clean ground truth speech waveform when learning the pitch information of the clean ground truth speech waveform and outputting the plurality of third waveforms.

Description

TECHNICAL FIELD The present disclosure relates to an apparatus and method for removing ambient noise from a speech waveform, and more particularly, to an apparatus and method for removing ambient noise from a speech waveform by using a band-pass filter and deep learning, which have been embodied to enable a person's voice to be easily heard by removing ambient noise from a speech waveform combined with the ambient noise and extracting only a clean speech waveform. BACKGROUND ART Research of speech de-noising that removes ambient noise from a speech waveform and that extracts only a clean speech waveform has been performed for a relatively long time. A speech de-noising algorithm that is now used a lot includes a Wiener filter, which is now widely used in smartphones, etc. In general, a smartphone has two microphones embedded on upper and lower sides thereof, respectively. The lower-side microphone that is disposed close to a user's mouth receives a voice+a noise waveform, and the upper-side microphone that is disposed far away from the user's mouth generally receives a noise waveform. A relatively clean speech waveform on which the influence of noise has been reduced is obtained by applying the Wiener filter to the two waveforms. Recently, a deep learning technology is actively applied to speech de-noising research, and is commonly divided into a time-frequency mask method and a method that is directly applied to a speech waveform. The time-frequency mask method converts a speech waveform, that is, a one-dimensional matrix (vector) for [time] into a frequency spectrogram, that is, a two-dimensional matrix of [time, frequency], and makes 0 specific components related to noise, among the two-dimensional [time, frequency] components of the frequency spectrogram, or reduces the size thereof and then converts the specific component into a new speech waveform. A process of converting a speech waveform into a frequency spectrogram is as follows. First, a speech waveform is split into continued time intervals called frames, and short time Fourier transform (STFT) is performed on one frame time interval waveform. Accordingly, a speech waveform corresponding to one frame time interval is converted into a set of complex number frequency spectra. For example, when one frame time is 25 ms and a frame step time is 10 ms in a speech waveform having a sample rate of 48,000 per second, one frame includes 1,200 speech data, and the start times of two frames that temporally neighbor have a difference of a step time (10 ms). Accordingly, the two frames that temporally neighbor overlap every 15 ms (720 speech data). An STFT output for the one frame time interval includes 1,200 complex numbers. One complex number indicates one frequency component. Only 601 complex numbers of the first half, among the 1,200 complex numbers, are used in a subsequent calculation process because the second half of the 1,200 complex numbers is a complex conjugate of the first half thereof. The first of the first half 601 complex numbers is a DC component (0 Hz), the second thereof is a 40 Hz component (a value obtained by dividing the sample rate 48,000 Hz by the number 1,200 of data of one frame), the third thereof is an 80 Hz component, the fourth is a 120 Hz component to a 601-th thereof is a 24,000 Hz component. Accordingly, a speech waveform is converted into a frequency spectrogram including 601 complex numbers every 10 ms of the step time. That is, a frequency spectrogram is a two-dimensional matrix of [t, f]. In the above example, the t dimension index 1 corresponds to 10 ms, and the f dimension index 1 corresponds to 40 Hz. The time-frequency mask method generates a new frequency spectrogram by either setting to zero or reducing the magnitude of components, determined to be related to noise, among the two-dimensional matrix components of the frequency spectrogram, and generates and outputs a new speech waveform by performing an inverse STFT operation on the newly generated frequency spectrogram. When listening to an output speech waveform obtained by applying the time-frequency mask method to a speech waveform combined with noise, an unnatural portion of speech is occasionally found. In order to obtain a more natural speech output, a method of directly applying deep learning to a speech waveform without converting the speech waveform into a frequency spectrum is used. Such a method enables a real-time operation in a notebook computer because a computational load for deep learning is relatively small, and can reduce time non-stationary noise in addition to time stationary noise. However, such a conventional method receives one speech waveform including a predetermined number of speech data that temporally neighbor as an input, and outputs only another speech waveform including the same number of speech data. Accordingly, the method has a problem in that ambient noise cannot be uniformly removed in all of audible frequency bands from a speech