US-20260128050-A1 - COHERENCE CALCULATION FOR STEREO DISCONTINUOUS TRANSMISSION (DTX)

US20260128050A1US 20260128050 A1US20260128050 A1US 20260128050A1US-20260128050-A1

Abstract

Enabling generation of comfort noise in an encoder using an estimated coherence parameter in a network using a discontinuous transmission, DTX, includes receiving time domain audio input comprising audio input signals; and processing the input signals on a frame-by-frame basis by: encoding active content of each input signal at a first bit rate until an inactive period is detected in the input signals; switching the encoding from the active encoding to inactive encoding to encode background noise at a second bit rate during the inactive period; estimating coherence parameters during the inactive period based on a low-pass filtering or averaging of cross-spectra including reinitializing a low pass filter state of the cross-spectra based on a coherence parameter from a previous inactive period; encoding the coherence parameters estimated; and initiating transmitting of the encoded active content, background noise, and coherence parameters towards a decoder.

Inventors

Tomas Jansson Toftgård
Fredrik Jansson

Assignees

TELEFONAKTIEBOLAGET LM ERICSSON (PUBL)

Dates

Publication Date: 20260507
Application Date: 20230920

Claims (20)

1 . A method in an encoder to enable generation of comfort noise using an estimated coherence parameter in a network using a discontinuous transmission, DTX, the method comprising: receiving a time domain audio input comprising audio input signals; processing the audio input signals on a frame-by-frame basis by: encoding active content of each audio input signal at a first bit rate until an inactive period is detected in the audio input signals; switching the encoding from the active encoding content to inactive encoding to encode background noise at a second bit rate during the inactive period; estimating coherence parameters during the inactive period based on a low-pass filtering of cross-spectra or averaging of the cross-spectra, wherein estimating the coherence parameters comprises reinitializing a low pass filter state of the cross-spectra based on a coherence parameter from a previous inactive period; and encoding the coherence parameters estimated.
2 . The method of claim 1 , further comprising: initiating transmitting of the active content encoded, the background noise encoded, and the coherence parameters encoded towards a decoder.
3 . The method of claim 1 , wherein estimating the coherence parameters comprises: in a first encoding frame after active coding, reinitializing a state of a first cross spectra low-pass filter X spec_smooth based on coherence parameters from a previous period of inactive encoding.
4 . The method of claim 3 , wherein reinitializing the state of the first cross spectra low-pass filter X spec_smooth based on coherence parameters from a previous period of inactive encoding comprises reinitializing the state of the first cross spectra low-pass filter X spec_smooth based on a last two frames from the previous period of inactive coding.
5 . The method of claim 3 , wherein reinitializing the state of the first cross spectra low-pass filter X spec_smooth based on coherence parameters from a previous period of inactive encoding comprises reinitializing the state of the first cross spectra low-pass filter X spec_smooth based on a second last frame from the previous period of inactive coding.
6 . The method of claim 3 , further comprising: starting an update of the low-pass filter X spec_smooth during a DTX hangover period.
7 . The method of claim 1 , wherein processing the audio input signals on a frame-by-frame basis comprises processing the audio input signals on a frame-by-frame basis to produce a mono mixdown signal and encoding the active content of each audio input signal comprises encoding the active content of the mono mixdown signal.
8 . The method of claim 7 , wherein processing the audio input signals on a frame-by-frame basis to produce the mono mixdown signal comprises processing the audio input signals on a frame-by-frame basis to produce the mono mixdown signal and one or more stereo parameters and encoding the active content of the mono mixdown signal comprises encoding the active content of the mono mixdown signal and the one or more stereo parameters.
9 . The method of claim 3 , wherein X spec_smooth is determined in accordance with X ⁢ s ⁢ p ⁢ e ⁢ c smooth [ k , m ] = C b ⁢ a ⁢ n ⁢ d ⁢ ( b , m - 2 ) · ❘ "\[LeftBracketingBar]" SPD_L smooth [ k , m ] ⁢ ❘ "\[LeftBracketingBar]" 2 · ❘ "\[LeftBracketingBar]" SPD - R smooth [ k , m ] ⁢ ❘ "\[LeftBracketingBar]" 2 · rand ⁢ ( k ) ∀ k ∈ k b b = 0 , 1 , … , N b ⁢ a ⁢ n ⁢ d - 1 SPD_L smooth [ k , m ] = ( 1 - α ) · SPD_L smooth [ k , m - 1 ] + α · SPD_L ⁢ ( k , m ) SPD_R smooth [ k , m ] = ( 1 - α ) · SPD_R smooth [ k , m - 1 ] + α · SPD_R ⁢ ( k , m ) C ⁡ ( k , m ) = ❘ "\[LeftBracketingBar]" Xspe ⁢ c smooth [ k , m ] ❘ "\[RightBracketingBar]" 2 ❘ "\[LeftBracketingBar]" SPD_L smooth [ k , m ] ⁢ ❘ "\[LeftBracketingBar]" 2 · ❘ "\[LeftBracketingBar]" SPD_R smooth [ k , m ] ❘ "\[LeftBracketingBar]" 2 C b ⁢ a ⁢ n ⁢ d ( b , m ) = ∑ k = b ⁢ andlimit ⁡ ( b ) b ⁢ andlimit ⁡ ( b + 1 ) - 1 ⁢ C ⁡ ( k , m ) bandlimits ⁡ ( b + 1 ) - bandlimits ⁡ ( b ) ⁢ b = 0 , 1 , … , N b ⁢ a ⁢ n ⁢ d - 1 where · indicates multiplication, α is a low pass coefficient, k b is the set of frequency coefficients for band b, bandlimits (b) is a vector containing the limits between the frequency bands, and rand(k) is a complex number with an absolute value=1 and a random phase.
10 . The method of claim 9 , further comprising weighting the C band (b, m) with a weighting function.
11 . The method of claim 10 , wherein weighting the C band (b, m) with the weighting function is weighted in accordance with C b ⁢ a ⁢ n ⁢ d ( b , m ) = ∑ k = bandlimit ⁡ ( b ) bandlimit ⁡ ( b + 1 ) - 1 ⁢ C ⁡ ( k , m ) · ❘ "\[LeftBracketingBar]" LR ⁡ ( m , k ) ❘ "\[RightBracketingBar]" 2 ∑ k = b ⁢ andlimit ⁡ ( b ) b ⁢ andlimit ⁡ ( b + 1 ) - 1 ⁢ ❘ "\[LeftBracketingBar]" LR ⁡ ( m , k ) ❘ "\[LeftBracketingBar]" 2 ⁢ b = 0 , 1 , … , N b ⁢ a ⁢ n ⁢ d - 1 where |LR(m, k)| 2 is a discrete Fourier transform, DFT, energy spectrum for a mono signal being a downmix of the audio input signals.
12 . The method of claim 3 , wherein X spec_smooth is determined in accordance with Xspe ⁢ c smooth [ k , m ] = C b ⁢ a ⁢ n ⁢ d ⁢ ( b , m - 2 ) · ❘ "\[LeftBracketingBar]" SPD_L smooth [ k , m ] ⁢ ❘ "\[LeftBracketingBar]" 2 · ❘ "\[LeftBracketingBar]" SPD - R smooth [ k , m ] ❘ "\[LeftBracketingBar]" 2 ⁢ Xspec smooth [ k , m ] ❘ "\[LeftBracketingBar]" Xspec smooth [ k , m ] ❘ "\[RightBracketingBar]" ∀ k ∈ k b b = 0 , 1 , … , N b ⁢ a ⁢ n ⁢ d - 1 SPD_L smooth [ k , m ] = ( 1 - α ) · SPD_L smooth [ k , m - 1 ] + α · SPD_L ⁢ ( k , m ) SPD_R smooth [ k , m ] = ( 1 - α ) · SPD_R smooth [ k , m - 1 ] + α · SPD_R ⁢ ( k , m ) C ⁡ ( k , m ) = ❘ "\[LeftBracketingBar]" Xspe ⁢ c smooth [ k , m ] ❘ "\[RightBracketingBar]" 2 ❘ "\[LeftBracketingBar]" SPD_L smooth [ k , m ] ⁢ ❘ "\[LeftBracketingBar]" 2 · ❘ "\[LeftBracketingBar]" SPD_R smooth [ k , m ] ❘ "\[LeftBracketingBar]" 2 C b ⁢ a ⁢ n ⁢ d ( b , m ) = ∑ k = b ⁢ andlimit ⁡ ( b ) b ⁢ andlimit ⁡ ( b + 1 ) - 1 ⁢ C ⁡ ( k , m ) bandlimits ⁡ ( b + 1 ) - bandlimits ⁡ ( b ) ⁢ b = 0 , 1 , … , N b ⁢ a ⁢ n ⁢ d - 1 where · indicates multiplication, α is a low pass coefficient, k b is the set of frequency coefficients for band b, and bandlimits (b) is a vector containing the limits between the frequency bands.
13 . The method of claim 12 , further comprising weighting the C band (b, m) with a weighting function.
14 . The method of claim 13 , wherein weighting the C band (b, m) with the weighting function is weighed in accordance with C b ⁢ a ⁢ n ⁢ d ( b , m ) = ∑ k = bandlimit ⁡ ( b ) bandlimit ⁡ ( b + 1 ) - 1 ⁢ C ⁡ ( k , m ) · ❘ "\[LeftBracketingBar]" LR ⁡ ( m , k ) ❘ "\[RightBracketingBar]" 2 ∑ k = b ⁢ andlimit ⁡ ( b ) b ⁢ andlimit ⁡ ( b + 1 ) - 1 ⁢ ❘ "\[LeftBracketingBar]" LR ⁡ ( m , k ) ❘ "\[LeftBracketingBar]" 2 ⁢ b = 0 , 1 , … , N b ⁢ a ⁢ n ⁢ d - 1 where |LR(m, k)| 2 is a discrete Fourier transform, DFT, energy spectrum for a mono signal being a downmix of the audio input signals.
15 . The method of claim 1 , further comprising: not updating the C band (b, m−2) in a first frame of an inactive period having a plurality of frames but in a second frame of the inactive period having the plurality of frames.
16 . The method of claim 1 , further comprising: executing a dedicated cross-correlation estimate that is only updated during the inactive periods and/or during DTX hangover frames for the cross spectra and using the dedicated cross-correlation estimate for the coherence estimation in the inactive period.
17 . The method of claim 1 , further comprising: resetting the cross-spectrum low-pass filter state at one of prior to any updates in a DTX hangover period and prior to any updates in the inactive period.
18 . The method of claim 1 , further comprising: reinitializing a low-pass filter state at the start of a hangover period or at the start of the inactive period.
19 .- 20 . (canceled)
21 . An encoder adapted to enable generation of comfort noise using an estimated coherence parameter in a network using a discontinuous transmission, DTX, the encoder comprising: processing circuitry; and memory coupled with the processing circuitry, wherein the memory includes instructions that when executed by the processing circuitry causes the encoder to perform operations comprising: receive a time domain audio input comprising audio input signals; process the audio input signals on a frame-by-frame basis by: encode active content of each audio input signal at a first bit rate until an inactive period is detected in the audio input signals; switch the encoding from the active encoding content to inactive encoding to encode background noise at a second bit rate during the inactive period; estimate coherence parameters during the inactive period based on a low-pass filtering of cross-spectra or averaging of the cross-spectra, wherein estimating the coherence parameters comprises initiating a low pass filter state of the cross-spectra based on a coherence parameter from a previous inactive period; and encode the coherence parameters estimated.

Description

TECHNICAL FIELD The present disclosure relates generally to communications, and more particularly to communication methods and related devices and nodes supporting encoding and decoding. BACKGROUND In communications networks, there may be a challenge to obtain good performance and capacity for a given communications protocol, its parameters and the physical environment in which the communications network is deployed. For example, although the capacity in telecommunication networks is continuously increasing, it is still of interest to limit the required resource usage per user. In mobile telecommunication networks less required resource usage per call means that the mobile telecommunication network can service a larger number of users in parallel. Lowering the resource usage also yields lower power consumption in both devices at the user-side (such as in terminal devices) and devices at the network-side (such as in network nodes). This translates to energy and cost saving for the network operator, while enabling prolonged battery life and increased talk-time to be experienced in the terminal devices. One mechanism for reducing the required resource usage for speech communication applications in mobile telecommunication networks is to exploit natural pauses in the speech. In more detail, in most conversations only one party is active at a time, and thus the speech pauses in one communication direction will typically occupy more than half of the signal. One way to utilize this property in order to decrease the required resource usage is to employ a Discontinuous Transmission (DTX) system, where the active signal encoding is discontinued during speech pauses. The encoding process is done on segments of the audio signal(s) referred to as frames where input audio samples during a time interval, typically 10-20 ms, are buffered and used by an encoder to extract the parameters to be transmitted to a decoder. During speech pauses it is common to transmit so called SID (silence insertion descriptor) frames at a very low bit rate encoding of the background noise to allow for a Comfort Noise Generator (CNG) system at the receiving end to fill the above-mentioned pauses with a background noise having similar characteristics as the original noise. The CNG makes the sound more natural compared to having silence in the speech pauses since the background noise is maintained and not switched on and off together with the speech. Complete silence in the speech pauses is commonly perceived as annoying and often leads to the misconception that the call has been disconnected. A DTX system might further rely on a Voice Activity Detector (VAD), which indicates to the transmitting device whether to use active signal encoding or low rate background noise encoding. In this respect the transmitting device might be configured to discriminate between other source types by using a (Generic) Sound Activity Detector (GSAD or SAD), which not only discriminates speech from background noise but also might be configured to detect music or other signal types, which are deemed relevant. A block diagram of a DTX system 100 is illustrated in FIG. 1. In FIG. 1, input audio is received by the VAD 102, the speech/audio coder 104, and the CNG coder 106. The VAD 102 indicates whether to transmit the “high” bitrate from speech/audio coder 104 or transmit the “low” bitrate from CNG coder 106. Communication services may be further enhanced by supporting stereo or multichannel audio transmission. In these cases, the DTX/CNG system might also consider the spatial characteristics of the signal in order to provide a pleasant-sounding comfort noise. A common mechanism to generate comfort noise is to transmit information about the energy and spectral shape of the background noise in the speech pauses. This can be done using significantly lower number of bits than the regular coding of speech segments. Normally this information is sent less frequent than in the active segments as illustrated in FIG. 2 where the active segments are illustrated as active encoding and the information about the energy and spectral shape of the background noise in the speech pauses are illustrated as CN encoding. A common feature in DTX systems is to add a so called “hangover period” to the VAD decision as illustrated in FIG. 3. During this period active encoding will still be used even though the VAD decision is that there should not be active encoding. This is to avoid short segments of CNG in the middle of longer active segments, e.g., in breathing pauses in a speech utterance. Parameters used for CNG generation can be estimated during this period. At the receiving side, the comfort noise is generated by creating a pseudo random signal and then shaping the spectrum of the signal with a filter based on information received from the transmitting device. The signal generation and spectral shaping can be performed in the time or the frequency domain. For stereo operation, additional p