CN-122029602-A - Encoding and decoding an audio signal

CN122029602ACN 122029602 ACN122029602 ACN 122029602ACN-122029602-A

Abstract

The present technology relates to encoding and decoding an audio signal, for example, applied to an encoder, a decoder, a method for encoding or decoding, and a non-transitory storage unit controlling encoding or decoding. For example, the present technology relates to mapping and/or resampling of innovative codebooks. An apparatus for generating a decoded audio signal divided into a plurality of frames or sub-frames, comprising a codec signal reader (722) configured to read at least codec information on prediction coefficients (105') and codec information (119) on at least one pulse from a codec signal (124), a signal processor (703) configured to generate the decoded audio signal (702) at least from a decoded version (704) of the prediction coefficients and a decoded pulse combination (710) or a processed version thereof, wherein the decoded audio signal (702) is generated with a first sample meaning a first plurality of sample positions having a first number of sample positions in one frame or sub-frame, wherein the apparatus is configured to derive (716) a decoded pulse combination (710) from the codec information (119) on at least one pulse and a second sample codebook (118), wherein the at least one second sample codebook (118) comprises a set of pulse combinations defined under a second sample meaning a second plurality of sample positions having a second number of sample positions in the frame or sub-frame, wherein the first sample positions are different from the second sample positions.

Inventors

Domenico Tiziani
JIM FOX
Sania Tayal
MARTIN MILLER
Sebastian Borten
Kakpel Sagnovsky

Assignees

弗劳恩霍夫应用研究促进协会

Dates

Publication Date: 20260512
Application Date: 20240731
Priority Date: 20230801

Claims (20)

1.‌ An apparatus for generating a decoded audio signal divided into a plurality of frames or subframes, comprising: a codec signal reader (722) configured to read at least codec information on the prediction coefficients (105') and codec information (119) on the at least one pulse from the codec signal (124); a signal processor (703) configured to generate a decoded audio signal (702) from at least a decoded version (704) of the prediction coefficients and a decoded pulse combination (710) or a processed version thereof, wherein the decoded audio signal (702) is generated with a first sample, which means a first plurality of sample positions having a first number of sample positions in one frame or sub-frame; wherein the apparatus is configured to derive (716) the decoded pulse combination (710) from the codec information (119) for at least one pulse and a second sample codebook (118), wherein at least one second sample codebook (118) comprises a set of pulse combinations defined at a second sample meaning a second plurality of sample positions having a second number of sample positions in the frame or subframe, wherein the first sample differs from the second sample at least in that the first plurality of sample positions differs from the second plurality of sample positions.
2. The apparatus of claim 1, ‌ configured to use the decoded pulse combination (710) or a processed version (710') thereof to excite a synthesis filter (830) derived from the prediction coefficients (704).
3. The apparatus of any of the preceding claims ‌, wherein the codec signal reader (703) is configured to read codec information (117) regarding a long-term prediction, the long-term prediction being related to a prediction delay (157) and/or at least one long-term prediction gain (156), and wherein the apparatus is configured to generate the decoded audio signal (702) based on a long-term prediction (834) using the prediction delay (157) and/or the at least one long-term prediction gain (156).
4. The apparatus of any of the preceding claims ‌, wherein the second number of sample positions is different from the first number of sample positions.
5. The apparatus of any of the preceding claims ‌, wherein the second number of sample positions is less than the first number of sample positions.
6. The apparatus of any of the preceding claims ‌, configured such that the first and second pluralities of sample locations within the frame or subframe are defined by first and second pluralities of tracks, respectively, each regularly interleaved with one another, wherein the second plurality of sample locations is defined by at least one track less than the first plurality of sample locations.
7. The apparatus of claim 6, configured to process the decoded pulse combination (710) by inserting at least one empty track with zero-valued samples, the zero-valued samples being regularly interleaved in the second plurality of tracks, thereby obtaining a resampled decoded pulse combination at the first sample, the resampled decoded pulse combination being defined at the first plurality of sample positions.
8. The apparatus of claim 6 or 7, wherein the first plurality of tracks of the first plurality of sampling locations and the second plurality of tracks of the second plurality of sampling locations have respective identical sampling locations, except for the at least one empty track.
9.‌ According to any of claims 6-8, configured to define said at least one empty track as having a number of sampling positions (e.g. 16) that is a power of 2.
10. The apparatus of any of claims 6-9, wherein the second number of sampling locations is a power of 2.
11. The apparatus of any of claims ‌, wherein the second plurality of sample locations is mapped to the first plurality of sample locations by adding at least one empty track to a track of the second plurality of tracks.
12. The apparatus of any of the preceding claims ‌, configured to resample (818) the decoded pulse combination (710) or a processed version thereof from the second sample to the first sample to obtain a resampled version (710') of the encoded decoded pulse combination (710) or a processed version thereof.
13. The apparatus of any of the preceding claims ‌, further comprising a resampler (818) configured to resample the decoded pulse combination (710) from the second sample to the first sample to further process the decoding.
14. The apparatus of any of the preceding claims ‌, further comprising a resampler (818) configured to perform upsampling on the decoded pulse combination or a processed version thereof.
15. The apparatus of any of the preceding claims ‌, configured to operate between at least a first mode of operation and a second mode of operation such that: In the second mode of operation, the codec pulse position information is determined at the second sample, and In the first mode of operation, the codec pulse position information is determined at the first samples, and the decoded pulse combination is an entry in at least one first sample codebook containing a set of pulse combinations defined at the first samples.
16. The apparatus of claim 15, configured to select between the first and second modes of operation.
17. The apparatus of any of the preceding claims ‌, wherein the second plurality of sample locations is a proper subset of the first plurality of sample locations.
18. The apparatus of any one of the preceding claims ‌, configured to select between at least a first mode of operation and a second mode of operation in dependence on a target packet size or an instantaneous bit rate of a current frame to be encoded.
19. The apparatus of any of the preceding claims ‌, wherein the at least one second sampling codebook (118) is or comprises an innovation codebook.
20. The apparatus of any of the preceding claims ‌, wherein the at least one second sampling codebook (118) is or comprises an algebraic codebook.

Description

Encoding and decoding an audio signal ‌ Technical field ‌ The present technology relates to encoding and decoding an audio signal, for example, applied to an encoder, a decoder, a method for encoding or decoding, and a non-transitory storage unit controlling encoding or decoding. For example, the present technology relates to mapping and/or resampling of innovative codebooks. Background art ‌ An audio codec, such as a speech codec, is known that relies on a codebook, such as an innovative codebook, to quantize prediction residuals, such as those from Linear Prediction (LP) and long-term prediction (LTP). In particular for encoding prediction residual signals (e.g. excitation signals), the position, amplitude and sign of the pulses may be encoded and subsequently decoded. Although widely used, some problems are encountered. For example, in some cases, it would be preferable to further reduce the number of bits of the bitstream. Furthermore, it is often difficult to adapt to the target bit rate. In encoding, it is often desirable to maintain the sample rate of the input audio signal, which makes it difficult to change the bit rate. As will be discussed in more detail below. In speech codec using CELP, prediction residuals from Linear Prediction (LP) and long-term prediction (LTP) are quantized using an innovative codebook. Unlike the codec of LP, where the spectral envelope is coded on a per time frame basis, the parameters of LTP and residuals are quantized for multiple parts of the frame, called subframes. In the specific case of ACELP (algebraic CELP, CELP using algebraic and innovative codebooks), the innovative codebook is defined by algebraic codes, coding the temporal position and sign of the pulses within a given subframe. The parameters of these pulses are optimized during the encoding process by a least squares algorithm. While the number of theoretically possible positions for a given number of pulses within a subframe is only determined by the length of the subframe and the sampling rate, the algebraic encoding process selects a pulse configuration from a subset of the cardinality of available bit budget constraints. In existing ACELP implementations (as in 3GPP EVS), different sampling rates are applied to different bit rates so that additional available bits can be used for increased time resolution. The additional resolution, and consequently more possible pulse positions, comes at the cost of a reduction in the number of encodable pulses. The present technique provides, among other things, an algebraic codec scheme that allows pulses to be positioned at lower bit rates without reducing the total number of pulses by systematically excluding pulse positions. For efficient residual coding, it is convenient and usual to code the number of possible pulse positions to the power of 2. If the number of samples per frame is a multiple of the power of 2, this can be achieved by dividing the frame into an appropriate number of subframes. This codec scheme has two drawbacks, firstly, that it cannot be applied if the number of samples is not a multiple of the power of 2. Second, the bit consumption of the LTP parameters and the residual code increases with the number of subframes. Innovative CELP codebooks are typically highly constrained. For example, in ACELP, each subframe is divided into tracks of staggered positions. As mentioned above, for convenience, complexity and best code, the number of positions per track is typically the same, a multiple of 2. For example, for a 64 sample subframe, 2 tracks, 32 samples per track, or 4 tracks, 16 samples per track may be designed. The codebook is then designed to allocate the pulse budget evenly or nearly evenly among the various tracks. Thereby achieving an equal or nearly equal number of pulses per track. Thus, for low bit rates, when the number of pulses is limited, the number of tracks must be reduced, which may not be possible or complicated, as it does not result in tracks of equal size or multiples of 2. Another more practical solution is to reduce the sampling rate of the speech codec CELP, which automatically reduces the number of possible positions. This is typically used for wideband or ultra wideband speech codec operating at bit rates below or about 16kbps, with the baseband CELP encoder operating at only 12.8 kHz. A disadvantage of reducing the internal sampling rate of the baseband codec is that the codec audio bandwidth of the baseband codec is further limited and that resampling memory and buffers are required when switching to or from a higher bit rate. For a 64 sample subframe, an example of the potential positions of individual pulses in a 2-pulse algebraic codebook of 2 32-position tracks is used: Rail track PulsePosition of100, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62211, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49,