EP-4738350-A2 - MDCT-BASED COMPLEX PREDICTION STEREO CODING

EP4738350A2EP 4738350 A2EP4738350 A2EP 4738350A2EP-4738350-A2

Abstract

The invention provides methods and devices for stereo encoding and decoding using complex prediction in the frequency domain. In one embodiment, a decoding method, for obtaining an output stereo signal from an input stereo signal encoded by complex prediction coding and comprising first frequency-domain representations of two input channels, comprises the upmixing steps of: (i) computing a second frequency-domain representation of a first input channel; and (ii) computing an output channel on the basis of the first and second frequency-domain representations of the first input channel, the first frequency-domain representation of the second input channel and a complex prediction coefficient. The method comprises performing frequency-domain modifications selectively before or after upmixing.

Inventors

PURNHAGEN, HEIKO
VILLEMOES, LARS
CARLSSON, PONTUS

Assignees

Dolby International AB

Dates

Publication Date: 20260506
Application Date: 20110406

Claims (13)

A decoder system for providing a stereo signal by complex prediction stereo coding, the decoder system comprising: an upmix stage (406, 407, 408, 409; 1433) adapted to generate the stereo signal based on first frequency-domain representations of a downmix signal (M) and a residual signal (D), each of the first frequency-domain representations comprising first spectral components representing spectral content of the corresponding signal expressed in a first subspace of a multidimensional space, the decoder system being characterised by the upmix stage comprising: a module (408) for computing a second frequency-domain representation of the downmix signal based on the first frequency-domain representation thereof, the second frequency-domain representation comprising second spectral components representing spectral content of the signal expressed in a second subspace of the multidimensional space that includes a portion of the multidimensional space not included in the first subspace, wherein the module is adapted to combine at least two temporally-adjacent first spectral components; a weighted summer (406, 407) for computing a side signal (S) by combining the first frequency-domain representation of the residual signal, the first frequency-domain representation of the downmix signal weighted by a real-valued part of a complex prediction coefficient encoded in a bit stream signal, and the second frequency domain representation of the downmix signal weighted by an imaginary-valued part of the complex prediction coefficient; and a sum-and-difference stage (409) for computing the stereo signal on the basis of the first frequency-domain representation of the downmix signal and the side signal; a first frequency-domain modifier stage (403; 1431) arranged upstream of the upmix stage and operable in an active mode, in which it processes a frequency-domain representation of at least one signal, and a passive mode, in which it acts as a pass-through; and a second frequency-domain modifier stage (410; 1435) arranged downstream of the upmix stage and operable in an active mode, in which it processes a frequency-domain representation of at least one signal, and a passive mode, in which it acts as a pass-through.
The decoder system of claim 1, wherein at least one of said frequency-domain modifier stages is a temporal noise shaping, TNS, stage.
The decoder system of claim 2, further adapted to receive, for each time frame, a data field associated with that frame and to operate, responsive to the value of the data field, the first frequency-domain modifier stage in its active mode or its pass-through mode and the second frequency-domain modifier stage in its active mode or its pass-through mode.
The decoder system of any one of the preceding claims, further comprising: a dequantization stage (401) arranged upstream of the upmix stage, for providing said first frequency-domain representations of the downmix signal (M) and residual signal (D) based on the bit stream signal.
The decoder system of any one of the preceding claims, wherein: the first spectral components have real values expressed in the first subspace; the second spectral components have imaginary values expressed in the second subspace; optionally, the first spectral components are represented by one of the following: a discrete cosine transform, DCT, or a modified discrete cosine transform, MDCT, and optionally, the second spectral components are represented by one of the following: a discrete sine transform, DST, or a modified discrete sine transform, MDST.
The decoder system of claim 5, wherein: the downmix signal is partitioned into successive time frames, each associated with a value of the complex prediction coefficient; and the module for computing a second frequency-domain representation of the downmix signal is adapted to deactivate itself, responsive to the absolute value of the imaginary part of the complex prediction coefficient being smaller than a predetermined tolerance for a time frame, so that it generates no output for that time frame.
The decoder system of any one of the preceding claims, said stereo signal being represented in the time domain and the decoder system further comprising: a switching assembly (203) arranged between said dequantization stage and said upmix stage, operable to function as either: a] a pass-through stage, or b] a sum-and-difference stage, thereby enabling switching between directly and jointly coded stereo input signals; an inverse transform stage (209) adapted to compute a time-domain representation of the stereo signal; and a selector arrangement (208) arranged upstream of the inverse transform stage, adapted to selectively connect this to either: a] a point downstream of the upmix stage, whereby the stereo signal obtained by complex prediction is supplied to the inverse transform stage; or b] a point downstream of the switching assembly (203) and upstream of the upmix stage, whereby a stereo signal obtained by direct stereo coding is supplied to the inverse transform stage.
An encoder system for encoding a frequency-domain representation of a stereo signal into a bit stream signal, the encoder system comprising: an estimator (1112) for determining a complex prediction coefficient (α); a downmix stage adapted to generate a residual signal based on the first frequency-domain representations of the downmix signal (M) and the side signal (S), each of the first frequency-domain representations comprising first spectral components representing spectral content of the corresponding signal expressed in a first subspace of a multidimensional space, the decoder system being characterised by the downmix stage comprising: a sum-and-difference stage (1104) for converting the frequency-domain representation of the stereo signal into first frequency-domain representations of a downmix (M) and a side (S) signal; a module (1105) for computing a second frequency-domain representation of the downmix signal based on the first frequency-domain representation thereof, the second frequency-domain representation comprising second spectral components representing spectral content of the signal expressed in a second subspace of the multidimensional space that includes a portion of the multidimensional space not included in the first subspace, wherein the module is adapted to combine at least two temporally-adjacent first spectral components; and a weighted summer (1106, 1107) for computing a residual signal (D) by combining the first frequency-domain representation of the side signal, the first frequency-domain representation of the downmix signal weighted by a real-valued part of the complex prediction coefficient encoded in a bit stream signal, and the second frequency domain representation of the downmix signal weighted by an imaginary-valued part of the complex prediction coefficient; a multiplexer (1111) for encoding the downmix and residual signals and the complex prediction coefficient into the bit stream signal; a first frequency-domain modifier stage (1102) arranged upstream of the downmix stage and operable in an active mode, in which it processes a frequency-domain representation of at least one signal, and a passive mode, in which it acts as a pass-through; and a second frequency-domain modifier stage (1109) arranged downstream of the downmix stage and operable in an active mode, in which it processes a frequency-domain representation of at least one signal, and a passive mode, in which it acts as a pass-through.
A decoding method for upmixing an input stereo signal by complex prediction stereo coding into an output stereo signal, wherein: said input stereo signal comprises first frequency-domain representations of a downmix signal (M) and a residual signal (D) and a complex prediction coefficient (α); and each of said first frequency-domain representations comprises first spectral components representing spectral content of the corresponding signal expressed in a first subspace of a multidimensional space, the method being performed by an upmix stage and characterised by including the steps of: computing a second frequency-domain representation of the downmix signal based on the first frequency-domain representation thereof, the second frequency-domain representation comprising second spectral components representing spectral content of the signal expressed in a second subspace of the multidimensional space that includes a portion of the multidimensional space not included in the first subspace, wherein computing a second frequency-domain representation of the downmix signal comprises combining at least two temporally-adjacent first spectral components; computing a side signal by combining the first frequency-domain representation of the residual signal, the first frequency-domain representation of the downmix signal weighted by a real-valued part of the complex prediction coefficient, and the second frequency domain representation of the downmix signal weighted by an imaginary-valued part of the complex prediction coefficient; and further comprising either the step, to be performed prior to the step of upmixing, of applying temporal noise shaping, TNS, to said first frequency-domain representation of the downmix signal and/or said first frequency-domain representation of the residual signal; or the step, to be performed after the step of upmixing, of applying TNS to at least one channel of said stereo signal.
An encoding method for encoding a frequency-domain representation of a stereo signal into a bit stream signal, the method comprising the steps of: determining a complex prediction coefficient (α); converting the frequency-domain representation of the stereo signal into first frequency-domain representations of a downmix (M) signal and a side (S) signal by performing a sum-and-difference processing of said frequency-domain representation of the stereo signal, the first frequency-domain representations comprising first spectral components representing spectral content of the corresponding signal expressed in a first subspace of a multidimensional space; computing a second frequency-domain representation of downmix signal based on the first frequency-domain representation thereof, the second frequency-domain representation comprising second spectral components representing spectral content of the signal expressed in a second subspace of the multidimensional space that includes a portion of the multidimensional space not included in the first subspace, wherein computing the second frequency-domain representation of the downmix signal comprises combining at least two temporally-adjacent first spectral components; computing a residual signal by combining the first frequency-domain representation of the side signal, the first frequency-domain representation of the downmix signal weighted by a real-valued part of the complex prediction coefficient, and the second frequency-domain representation of the downmix signal weighted by an imaginary-valued part of the complex prediction coefficient; and encoding the downmix and residual signals and the complex prediction coefficient into the bit stream signal; and either applying temporal noise shaping, TNS, to at least one signal of said frequency-domain representation of the stereo signal; or applying TNS to said first frequency-domain representation of the downmix signal and/or said first frequency-domain representation of the residual signal.
The method of claim 10, further comprising including in the bitstream an indication of whether TNS was applied to at least one signal of said frequency-domain representation of the stereo signal or to said first frequency-domain representation of the downmix signal and/or said first frequency-domain representation of the residual signal.
A computer-program product comprising a computer-readable medium storing instructions which when executed by a general-purpose computer perform the method set forth in any one of claims 9-11.
A computer-readable medium comprising a bit stream signal generated by the encoder system of claim 8 or by performing the encoding method of claim 10 or 11.

Description

Cross-Reference To Related Application This application is a European divisional application of European patent application EP 24170668.8 (reference: D10010BEP05), for which EPO Form 1001 was filed 16 April 2024. Technical field The invention disclosed herein generally relates to stereo audio coding and more precisely to techniques for stereo coding using complex prediction in the frequency domain. Background of the invention Joint coding of the left (L) and right (R) channels of a stereo signal enables more efficient coding compared to independent coding of L and R. A common approach for joint stereo coding is mid/side (M/S) coding. Here, a mid (M) signal is formed by adding the L and R signals, e.g. the M signal may have the form M=L+R/2 Also, a side (S) signal is formed by subtracting the two channels L and R, e.g., the S signal may have the form S=L−R/2 In the case of M/S coding, the M and S signals are coded instead of the L and R signals. In the MPEG (Moving Picture Experts Group) AAC (Advanced Audio Coding) standard (see standard document ISO/IEC 13818-7), L/R stereo coding and M/S stereo coding can be chosen in a time-variant and frequency-variant manner. Thus, the stereo encoder can apply L/R coding for some frequency bands of the stereo signal, whereas M/S coding is used for encoding other frequency bands of the stereo signal (frequency variant). Moreover, the encoder can switch over time between L/R and M/S coding (time-variant). In MPEG AAC, the stereo encoding is carried out in the frequency domain, more particularly the MDCT (modified discrete cosine transform) domain. This allows choosing adaptively either L/R or M/S coding in a frequency and also time variable manner. Parametric stereo coding is a technique for efficiently coding a stereo audio signal as a monaural signal plus a small amount of side information for stereo parameters. It is part of the MPEG-4 Audio standard (see standard document ISO/IEC 14496-3).The monaural signal can be encoded using any audio coder. The stereo parameters can be embedded in the auxiliary part of the mono bit stream, thus achieving full forward and backward compatibility. In the decoder, it is the monaural signal that is first decoded, after which the stereo signal is reconstructed with the aid of the stereo parameters. A decorrelated version of the decoded mono signal, which has zero cross correlation with the mono signal, is generated by means of a decorrelator, e.g., an appropriate all-pass filter which may include one or more delay lines. Essentially, the decorrelated signal has the same spectral and temporal energy distribution as the mono signal. The monaural signal together with the decorrelated signal are input to the upmix process which is controlled by the stereo parameters and which reconstructs the stereo signal. For further information, see the paper "Low Complexity Parametric Stereo Coding in MPEG-4", H. Purnhagen, Proc. of the 7th Int. Conference on Digital Audio Effects (DAFx'04), Naples, Italy, October 5-8, 2004, pages 163-168. MPEG Surround (MPS; see ISO/IEC 23003-1 and the paper "MPEG Surround - The ISO/MPEG Standard for Efficient and Compatible Multi-Channel Audio Coding", J. Herre et al., Audio Engineering Convention Paper 7084, 122nd Convention, May 5-8, 2007) allows combining the principles of parametric stereo coding with residual coding, substituting the decorrelated signal with a transmitted residual and hence improving the perceptual quality. Residual coding may be achieved by downmixing a multi-channel signal and, optionally, by extracting spatial cues. During the process of downmixing, residual signals representing the error signal are computed and then encoded and transmitted. They may take the place of the decorrelated signals in the decoder. In a hybrid approach, they may replace the decorrelated signals in certain frequency bands, preferably in relatively low bands. According to the current MPEG Unified Speech and Audio Coding (USAC) system, of which two examples are shown in figure 1, the decoder comprises a complex-valued quadrature mirror filter (QMF) bank located downstream of the core decoder. The QMF representation obtained as the output of the filter bank is complex - thus oversampled by a factor two - and can be arranged as a downmix signal (or, equivalently, mid signal) M and a residual signal D, to which an upmix matrix with complex entries is applied. The L and R signals (in the QMF domain) are obtained as: LR=g1−α11+α−1MD where g is a real-valued gain factor and α is a complex-valued prediction coefficient. Preferably, α is chosen such that the energy of the residual signal D is minimized. The gain factor may be determined by normalization, that is, to ensure that the power of the sum signal is equal to the sum of the powers of the left and right signals. The real and imaginary parts of each of the L and R signals are mutually redundant - in principle, each of them can be computed on the basis of the other - but