EP-4107723-B1 - METHOD AND SYSTEM TO IMPROVE VOICE SEPARATION BY ELIMINATING OVERLAP

EP4107723B1EP 4107723 B1EP4107723 B1EP 4107723B1EP-4107723-B1

Inventors

BI, Xiangru
LIU, Zhilei
ZHANG, Guoxia

Dates

Publication Date: 20260506
Application Date: 20200221

Claims (11)

A method for improving voice separation performance by eliminating overlaps, comprising the steps of: picking up, by at least two microphones (mic1, mic2), respectively, at least two mixtures, x 1 (t), x 2 (t), including mixed first sound and second sound; recording and storing, in a sound recording module, said at least two mixtures, x 1 (t), x 2 (t), from said at least two microphones (mic1, mic2); analyzing, in an algorithm module, said at least two mixtures, x 1 (t), x 2 (t), for recovering the first sound and the second sound, respectively, wherein analyzing said at least two mixtures, x 1 (t), x 2 (t), comprises performing a Degenerate Unmixing Estimation Technique, DUET, voice separation algorithm, wherein the Degenerate Unmixing Estimation Technique, DUET, voice separation algorithm comprises the steps of: constructing time-frequency representations x 1 ( τ, ω ) and x 2 ( τ, ω ) from the at least two mixtures, x 1 (t), x 2 (t), calculating relative attenuation-delay pairs: x ^ 2 τ ω x ^ 1 τ ω − x ^ 1 τ ω x ^ 2 τ ω , − 1 ω ∠ x ^ 2 τ ω x ^ 1 τ ω , constructing a 2D smoothed weighted histogram ( H ( α, δ )) of direction-of-arrivals, DOAs, and distances from said at least two mixtures, x 1 (t), x 2 (t), from said at least two microphones (mic1, mic2), wherein the histogram is built as: H ( α, δ ): = ∬ ( τ,ω )∈ I ( α,δ ) | x̂ 1 ( τ , ω ) x̂ 2 ( τ, ω )| p ω q d τ d ω , where, the X-axis is − 1 ω ∠ x ^ 2 τ ω x ^ 1 τ ω , which means the relative delay, the Y-axis is x 2 τ ω x ^ 1 τ ω − x 1 τ ω x ^ 2 τ ω , which indicatesa symmetric attenuation, and the Z-axis is H ( α, δ ), which represents a weight, locating peaks and peak centers (Pc_1, Pc_2) in the histogram ( H ( α, δ )) , eliminating overlapping points from time-frequency points, wherein the overlapping points comprise the time-frequency points that include both the first sound and the second sound, and wherein the overlapping points are found among the time-frequency points, and each of the overlapping points is determined when a differential value between a first distance (d1) and a second distance (d2) is less than a threshold, wherein the first distance (d1) is the distance from one of the time-frequency points (Pt_r) to be determined to a first peak center (Pc_1), and the second distance (d2) is the distance from the same time-frequency point (Pt_r) to be determined to a second peak center (Pc_2); and separating the time-frequency points having the overlapping points eliminated in relation to the first sound and the second sound, respectively.
The method of claim 1, wherein the threshold is set to a quarter of the distance (d0) between the first peak center (Pc_1) and the second peak center (Pc_2).
The method of claim 1, wherein the overlapping points are determined by traversing all the time-frequency points in relation to the first sound and the second sound, respectively.
The method of claim 1, wherein recovering the first sound and the second sound comprises convert the time-frequency points with the overlapping points eliminated back to a time domain.
The method of claim 1, wherein the method can be implemented in any occasions with more than one person talking at the same time.
A system for improving voice separation performance by eliminating overlaps, comprising: at least two microphones (mic1, mic2) adapted to pick up at least two mixtures, x 1 (t), x 2 (t), including mixed first sound and second sound, respectively; a sound recording module adapted to record and store said at least two mixtures , x 1 (t), x 2 (t), from said at least two microphones (mic1, mic2); an algorithm module adapted to analyze said at least two mixtures, x 1 (t), x 2 (t), for recovering the first sound and the second sound, respectively, wherein analyzing said at least two mixtures, x 1 (t), x 2 (t), comprises performing a Degenerate Unmixing Estimation Technique, DUET, voice separation algorithm, wherein the Degenerate Unmixing Estimation Technique, DUET, voice separation algorithm comprises the steps of: constructing time-frequency representations x 1 ( τ, ω ) and x 2 ( τ, ω ) from the at least two mixtures, x 1 (t), x 2 (t), calculating relative attenuation-delay pairs: x ^ 2 τ ω x ^ 1 τ ω − x ^ 1 τ ω x ^ 2 τ ω , − 1 ω ∠ x ^ 2 τ ω x ^ 1 τ ω , constructing a 2D smoothed weighted histogram ( H ( α,δ )) of direction-of-arrivals, DOAs, and distances from said at least two mixtures, x 1 (t), x 2 (t), from said at least two microphones (mic1, mic2), wherein the histogram is built as: H ( α, δ ):= ∬ ( τ,ω ) ∈ ( α,δ ) | x̂ 1 ( τ, ω ) x̂ 2 ( τ, ω )| p ω q d τ d ω , where, the X-axis is − 1 ω ∠ x ^ 2 τ ω x ^ 1 τ ω , which means the relative delay, the Y-axis is x 2 τ ω x ^ 1 τ ω − x 1 τ ω x ^ 2 τ ω , which indicatesa symmetric attenuation, and the Z-axis is H ( α, δ ), which represents a weight, locating peaks and peak centers (Pc_1, Pc_2) in the histogram ( H ( α, δ )) , eliminating overlapping points from time-frequency points, wherein the overlapping points comprise the time-frequency points that include both the first sound and the second sound, and wherein the overlapping points are found among the time-frequency points, and each of the overlapping points is determined when a differential value between a first distance (d1) and a second distance (d2) is less than a threshold, wherein the first distance (d1) is the distance from one of the time-frequency points (Pt_r) to be determined to a first peak center (Pc_1), and the second distance (d2) is the distance from the same time-frequency point (Pt_r) to be determined to a second peak center (Pc_2); and separating the time-frequency points having the overlapping points eliminated in relation to the first sound and the second sound, respectively.
The system of claim 6, wherein the threshold is set to a quarter of the distance (d0) between the first peak center (Pc_1) and the second peak center (Pc_2).
The system of claim 6, wherein the overlapping points are found by traversing all the time-frequency points in relation to the first sound and the second sound, respectively.
The system of claim 6, wherein the first sound and the second sound are recovered by converting the time-frequency points with the overlapping points eliminated back to a time domain.
The system of claim 6, wherein the system can be used in any occasions with more than one person talking at the same time.
A non-transitory computer-readable storage medium including instructions that, when executed by a processor, configure the processor to perform the steps of the method according to any one of claims 1-5.

Description

TECHNICAL FIELD The present invention relates generally to voice separation. More particularly, the present invention relates to a method for improving voice separation by eliminating overlaps. The present invention also relates to a system for improving voice separation by eliminating overlaps. BACKGROUND Document US 2012/046940 A1 discloses a method for processing multichannel acoustic signals, whereby input signals of a plurality of channels including the voices of a plurality of speaking persons are processed. The method comprises: calculating the first feature quantity of the input signals of the multichannels for each channel; calculating similarity of the first feature quantity of each channel between the channels; selecting channels having high similarity; separating signals using the input signals of the selected channels; inputting the input signals of the channels having low similarity and the signals after the signal separation; and detecting a voice section of each speaking person or each channel. Nowadays the voice separation is widely used by general users in many occasions, one of which is, for example, in a car with speech recognition. When more than one person is speaking or while there is noise in the car, the host of the car cannot recognize the speech from the driver. Therefore, voice separation is needed to improve the speech recognition in this case. There are mainly two well-known types of voice separation methods. One is to create a microphone array to achieve voice enhancement. The other is to use the voice separation algorithms, such as, Frequency domain independent component analysis (FDICA), Degenerate unmixing estimation technique (DUET), or other extended algorithms. Because the FDICA algorithm for separating speech is more complex, the DUET algorithm is usually chosen for implementing the voice separation. However, in the traditional DUET algorithm, some of time-frequency points overlapping may be separated into any of the voices. In this case, one of the separated voices may contain another person's voice, which may result in the separated voice being not pure enough. Therefore, there may be a need to partition these overlapping time-frequency points into a single cluster to avoid its appearing in the separated voice, so that the quality of the separated voice can be improved. SUMMARY OF THE INVENTION The present invention overcomes some of the drawbacks by providing a method and system to improve voice separation performance by eliminating overlaps. On one hand, the present invention provides a method for improving voice separation performance by eliminating overlap. The method comprises the steps of: picking up, by at least two microphones, respectively, at least two mixtures including mixed first sound and second sound; recording and storing, in a sound recording module, the at least two mixtures from the at least two microphones; analyzing, in an algorithm module, the two mixtures to separate the time-frequency points. In particular, the algorithm module is configured to apply the Degenerate Unmixing Estimation Technique (DUET) algorithm, wherein the Degenerate Unmixing Estimation Technique, DUET, voice separation algorithm includes the steps of constructing time-frequency representations x1(τ, ω) and x̂2(τ, ω) from the at least two mixture, calculating relative attenuation-delay pairs: x^2τωx¨^1τω−x¨1τωx^2τω,−1ω∠x¨2τωx^1τω , constructing a 2D smoothed weighted histogram of the direction-of-arrivals and distances from said at least two mixtures from said at least two microphones, wherein the histogram is built as: H(α, δ): = ∫∫(τ,ω)∈I(α,δ)|x̂1(τ, ω)x̂2(τ, ω)|pωqdτdω, where, the X-axis is −1ω∠x¨2τωx^1τω, which means the relative delay, the Y-axis is x^2τωx^1τω−x^1τωx^2τω, which indicates the symmetric attenuation, and the Z-axis is H(α, δ), which represents the weight, locating peaks and peak centers in the histogram, eliminating overlapping points from time-frequency points, wherein the overlapping points comprise the time-frequency points that include both the first sound and the second sound, and wherein the overlapping points are found among the time-frequency points, and each of the overlapping points is determined when the differential value between a first distance (d1) and a second distance (d2) is less than a threshold, wherein the first distance is the distance from one of the time-frequency points to be determined to a first peak center, and the second distance is the distance from the same time-frequency point to be determined to a second peak center, and separating the time-frequency points having the overlapping points eliminated in relation to the first sound and the second sound, respectively. In particular, in the method provided herein, eliminating the overlapping points may comprise determining the overlapping points according to the rule of |d1-d2| < d0/4, d0 being the distance between the first peak center and the second peak center. On the other hand