EP-4297027-B1 - AUDIO ENCODER, AUDIO DECODER, METHOD FOR ENCODING AN AUDIO SIGNAL AND METHOD FOR DECODING AN ENCODED AUDIO SIGNAL

EP4297027B1EP 4297027 B1EP4297027 B1EP 4297027B1EP-4297027-B1

Inventors

EDLER, BERND
Helmrich, Christian
NEUENDORF, MAX
SCHUBERT, Benjamin

Dates

Publication Date: 20260506
Application Date: 20160307

Claims (12)

A decoder (200) for decoding an encoded audio signal (120), wherein the decoder (200) is configured to decode the encoded audio signal (120) in a transform domain or filter-bank domain (204), wherein the decoder (200) is configured to parse the encoded audio signal (120) to obtain encoded spectral coefficients (206_t0_f1:206_t0_f6; 206_t-1_f1:206_t-1_f6) of the audio signal (120) for a current time frame (208_t0) and at least one previous time frame (208_t-1), and wherein the decoder (200) is configured to selectively apply predictive decoding to a plurality of individual encoded spectral coefficients (206_t0_f2) or groups of encoded spectral coefficients (206_t0_f4,206_t0_f5), wherein the decoder (200) is configured to obtain a spacing value, wherein the decoder (200) is configured to select the plurality of individual encoded spectral coefficients (206_t0_f2) or groups of encoded spectral coefficients (206_t0_f4,206_t0_f5) to which predictive decoding is applied based on the spacing value; characterized in that the audio signal (102) represented by the encoded audio signal (120) comprises at least two harmonic signal components, and the decoder (200) is configured to apply predictive decoding only to those plurality of individual encoded spectral coefficients or groups of encoded spectral coefficients which represent the at least two harmonic signal components or spectral environments around the at least two harmonic signal components of the audio signal.
The decoder (200) according to claim 1, wherein the plurality of individual encoded spectral coefficients (206_t0_f2) or groups of encoded spectral coefficients (206_t0_f4,206_t0_f5) are separated by at least one encoded spectral coefficient (206_t0_f3).
The decoder (200) according to claim 2, wherein the predictive decoding is not applied to the at least one spectral coefficient (206_t0_f3) by which the individual spectral coefficients (206_t0_f2) or the group of spectral coefficients (206_t0_f4,206_t0_f5) are separated.
The decoder (200) according to one of the claims 1 to 3, wherein the decoder (200) is configured to entropy decode the encoded spectral coefficients, to obtain quantized prediction errors for the spectral coefficients (206_t0_f2,206_t0_f4,206_t0_f5) to which predictive decoding is to be applied and quantized spectral coefficients for spectral coefficients (206_t0_f3) to which predictive decoding is not to be applied; and wherein the decoder (200) is configured to apply the quantized prediction errors to a plurality of predicted individual spectral coefficients (210_t0_f2) or groups of predicted spectral coefficients (210_t0_f4,210_t0_f5), to obtain, for the current time frame (208_t0), decoded spectral coefficients associated with the encoded spectral coefficients (206_t0_f2,206_t0_f4,206_t0_f5) to which predictive decoding is applied.
The decoder (200) according to claim 4, wherein the decoder (200) is configured to determine the plurality of predicted individual spectral coefficients (210_t0_f2) or groups of predicted spectral coefficients (210_t0_f4,210_t0_f5) for the current time frame (208_t0) based on a corresponding plurality of the individual encoded spectral coefficients (206_t-1_f2) or groups of encoded spectral coefficients (206_t-1_f4,206_t-1_f5) of the previous time frame (208_t-1).
The decoder (200) according to claim 5, wherein the decoder (200) is configured to derive prediction coefficients from the spacing value, and wherein the decoder (200) is configured to calculate the plurality of predicted individual spectral coefficients (210_t0_f2) or groups of predicted spectral coefficients (210_t0_f4,210_t0_f5) for the current time frame (208_t0) using a corresponding plurality of previously decoded individual spectral coefficients or groups of previously decoded spectral coefficients of at least two previous time frames and using the derived prediction coefficients.
The decoder (200) according to one of the claims 1 to 7, wherein the decoder (200) is configured to decode the encoded audio signal (120) in order to obtain quantized prediction errors instead of a plurality of individual quantized spectral coefficients or groups of quantized spectral coefficients for the plurality of individual encoded spectral coefficients (206_t0_f2) or groups of encoded spectral coefficients (206_t0_f4,206_t0_f5) to which predictive decoding is applied.
The decoder (200) according to claim 7, wherein the decoder is configured to decode the encoded audio signal (120) in order to obtain quantized spectral coefficients for encoded spectral coefficients (206_t0_f3) to which predictive decoding is not applied, such that there is an alternation of encoded spectral coefficients (206_t0_f2) or groups of encoded spectral coefficients (206_t0_f4,206_t0_f5) for which quantized prediction errors are obtained and encoded spectral coefficients (206_t0_f3) or groups of encoded spectral coefficients for which quantized spectral coefficients are obtained.
The decoder (200) according to one of the claims 1 to 8, wherein the decoder (200) is configured to select individual spectral coefficients (206_t0_f2) or groups of spectral coefficients (206_t0_f4,206_t0_f5) spectrally arranged according to a harmonic grid defined by the spacing value for a predictive decoding.
The decoder (200) according to one of the claims 1 to 9, wherein the spectral coefficients are spectral bins.
Method (400) for decoding an encoded audio signal in a transform domain or filter-bank domain, the method comprising: parsing (402) the encoded audio signal to obtain encoded spectral coefficients of the audio signal for a current time frame and at least one previous time frame; obtaining a spacing value; and selectively applying (404) predictive decoding to a plurality of individual encoded spectral coefficients or groups of encoded spectral coefficients, wherein the plurality of individual encoded spectral coefficients or groups of encoded spectral coefficients to which predictive decoding is applied are selected based on the spacing value; characterized in that the audio signal (102) represented by the encoded audio signal (120) comprises at least two harmonic signal components; and the method further comprises applying predictive decoding only to those plurality of individual encoded spectral coefficients or groups of encoded spectral coefficients which represent the at least two harmonic signal components or spectral environments around the at least two harmonic signal components of the audio signal.
Computer program for performing a method according to claim 11.

Description

Embodiments relate to audio coding, in particular, to a method and apparatus for decoding an encoded audio signal using predictive decoding. Preferred embodiments relate to methods and apparatuses for pitch-adaptive spectral prediction. Further preferred embodiments relate to perceptual coding of tonal audio signals by means of transform coding with spectral-domain inter-frame prediction tools. To improve the quality of coded tonal signals especially at low bit-rates, modern audio transform coders employ very long transforms and/or long-term prediction or pre-/postfiltering. A long transform, however, implies a long algorithmic delay, which is undesirable for low-delay communication scenarios. Hence, predictors with very low delay based on the instantaneous fundamental pitch have gained popularity recently. The IETF (Internet Engineering Task Force) Opus codec utilizes pitch-adaptive pre- and postfiltering in its frequency-domain CELT (Constrained-Energy Lapped Transform) coding path [J. M. Valin, K. Vos, and T. Terriberry, "Definition of the Opus audio codec," 2012, IETF RFC 6716. http://tools.ietf.org/html/rfc67161.], and the 3GPP (3rd Generation Partnership Project) EVS (Enhanced Voice Services) codec provides a long-term harmonic post-filter for perceptual improvement of transform-decoded signals [3GPP TS 26.443, "Codec for Enhanced Voice Services (EVS)," Release 12, Dec. 2014.]. Both of these approaches operate in the time domain on the fully decoded signal waveform, making it difficult and/or computationally expensive to apply them frequency-selectively (both schemes only offer a simple low-pass filter for some frequency selectivity). A welcome alternative to time-domain long-term prediction (LTP) or pre-/post-filtering (PPF) is thus provided by frequency-domain prediction (FDP) like it is supported in MPEG-2 AAC [ISO/IEC 13818-7, "Information technology - Part 7: Advanced Audio Coding (AAC)," 2006.]. This method, although facilitating frequency selectivity, has its own disadvantages, as described hereafter. The FDP method introduced above has two drawbacks over the other tools. First, the FDP method requires a high computational complexity. In detail, linear predictive coding of at least order two (i.e. from the last two frame's channel transform bins) is applied onto hundreds of spectral bins for each frame and channel in the worst case of prediction in all scale factor bands [ISO/IEC 13818-7, "Information technology - Part 7: Advanced Audio Coding (AAC)," 2006.]. Second, the FDP method comprises a limited overall prediction gain. More precisely, the efficiency of the prediction is limited because noisy components between predictable harmonic, tonal spectral parts are subjected to the prediction as well, introducing errors as these noisy parts are typically not predictable. The high complexity is due to the backward adaptivity of the predictors. This means that the prediction coefficients for each bin have to be calculated based on previously transmitted bins. Therefore, numerical inaccuracies between encoder and decoder can lead to reconstruction errors due to diverging prediction coefficients. To overcome this problem, bit exact identical adaptation has to be guaranteed. Furthermore, even if groups of predictors are disabled in certain frames, the adaptation always has to be performed in order to keep the prediction coefficients up to date. US 2007/0016415 A1 relates to techniques and tools for prediction of spectral coefficients in encoding and decoding. For certain types and patterns of content, coefficient prediction exploits correlation between adjacent spectral coefficients, making subsequent entropy encoding more efficient. For example, an audio encoder predictively codes quantized spectral coefficients in the quantized domain and entropy encodes results of the predictive coding. Or, for a particular quantized spectral coefficient, an audio decoder entropy decodes a difference value, computes a predictor in the quantized domain, and combines the predictor and the difference value. Therefore, it is the object of the present invention to provide a concept for decoding an encoded audio signal that avoids at least one (e.g., both) of the aforementioned issues and leads to a more efficient and computationally cheap implementation. This object is solved by the independent claims. Advantageous implementations are addressed by the dependent claims. Embodiments provide an encoder for encoding an audio signal. The encoder is configured to encode the audio signal in a transform domain or filter-bank domain, wherein the encoder is configured to determine spectral coefficients of the audio signal for a current frame and at least one previous frame, wherein the encoder is configured to selectively apply predictive encoding to a plurality of individual spectral coefficients or groups of spectral coefficients, wherein the encoder is configured to determine a spacing value, wherein the encoder is configured