EP-4740207-A1 - METHODS, APPARATUS AND SYSTEMS FOR SCENE BASED AUDIO MONO DECODING

EP4740207A1EP 4740207 A1EP4740207 A1EP 4740207A1EP-4740207-A1

Abstract

Audio signal encoding and decoding methods are disclosed herein. Some disclosed methods for encoding an audio signal involve obtaining an audio signal that represents an input audio scene with a primary channel and side channels, analyzing the power of the primary channel and analyzing the powers of the side channels. Some such methods involve detecting a mono mode for encoding the audio signal based on analyzing the power of the primary channel and the powers of the side channels and computing one or more downmix channels and spatial metadata from the audio signal for a detected mono mode. Some such methods involve encoding the one or more downmix channels and spatial metadata in a bitstream for the detected mono mode and indicating the mono mode in the bitstream.

Inventors

MILLS, Adam John
TYAGI, RISHABH
TORRES, Juan Felix

Assignees

Dolby Laboratories Licensing Corporation

Dates

Publication Date: 20260513
Application Date: 20240703

Claims (20)

CLAIMS What is claimed is: 1. A method for encoding an audio signal, comprising: obtaining the audio signal, wherein the audio signal represents an input audio scene with a primary channel and side channels; analyzing a power of the primary channel of the audio signal; analyzing powers of the side channels of the audio signal; detecting a mono mode for encoding the audio signal based on analyzing the power of the primary channel and the powers of the side channels; computing one or more downmix channels and spatial metadata from the audio signal for a detected mono mode; encoding the one or more downmix channels and spatial metadata in a bitstream for the detected mono mode; and indicating the mono mode in the bitstream.
2. The method of claim 1, further comprising outputting the bitstream, storing the bitstream, transmitting the bitstream, or combinations thereof.
3. The method of claim 1 or claim 2, wherein detecting the mono mode is based on a determination that input audio signal has non-silent audio in the primary channel and that the side channels are silent.
4. The method of claim 1 or claim 2, wherein detecting the mono mode comprises: computing a primary channel power of the primary channel of the audio signal; and computing a sum power as a summation of the powers of the side channels of the audio signal.
5. The method of claim 4, wherein detecting the mono mode further comprises evaluating a ratio of the primary channel power to the sum power.
6. The method of claim 5, wherein detecting the mono mode further comprises determining when the ratio exceeds a threshold.
7. The method of any one of claims 1–6, further comprising implicitly signaling the mono mode by setting one or more parameters of the spatial metadata to a value of zero.
8. The method of claim 6, wherein the one or more parameters of the spatial metadata include one or more Spatial Reconstruction (SPAR) metadata parameters, one or more Directional Audio Coding (DirAC) metadata parameters, or both.
9. The method of any one of claims 1–8, further comprising explicitly signaling the mono mode by setting a mono flag.
10. The method of any one of claim 1–9, wherein the bitstream is an Immersive Voice and Audio (IVAS) encoded bitstream.
11. A method for audio signal decoding, comprising: obtaining an encoded bitstream; decoding the encoded bitstream and obtaining downmix channels, spatial metadata, and a mono mode indicator; setting one or more parameters of spatial metadata to zero upon detecting a mono mode; upmixing the downmix channels using the spatial metadata; and rendering upmixed channels to a desired audio format.
12. The method of claim 11, wherein the rendering produces rendered audio data, further comprising transmitting the rendered audio data, storing the rendered audio data, transmitting the rendered audio data, or combinations thereof.
13. The method of claim 11, wherein the rendering produces rendered audio data, further comprising providing the rendered audio data to one or more loudspeakers for playback.
14. The method of any one of claims 11–13, wherein the mono mode indicator is based on values of one or more spatial metadata parameters in the encoded bitstream that are set to values to indicate the mono mode.
15. The method of claim 14, wherein the one or more spatial metadata parameters include one or more Spatial Reconstruction (SPAR) metadata parameters, one or more Directional Audio Coding (DirAC) metadata parameters, or both.
16. The method of any one of claims 11–15, wherein setting the one or more parameters of spatial metadata to zero upon detecting the mono mode involves setting one or more energy ratio values to zero.
17. The method of claim 16, wherein a received energy ratio metadata value is non-zero due to an artifact of a quantization process.
18. The method of claim 17, further comprising setting one or more diffuseness values to one.
19. The method of any one of claims 11–18, wherein the encoded bitstream is an Immersive Voice and Audio (IVAS) encoded bitstream.
20. A method of encoding an audio signal representing an input audio scene with a primary channel and side channels, the method comprising: obtaining a frame of the audio signal from a bitstream; identifying a plurality of frequency bands associated with the frame; determining power associated with the primary channel and the side channels for each of the plurality of frequency bands associated with the frame; classifying each of the plurality of frequency bands as one of Silence, Mono, or Ambisonics based on a determined power for a corresponding frequency band; marking a mode of the frame as one of Silence, Mono, or Ambisonics based on a plurality of classified frequency bands for the frame; and encoding a marked mode of the frame in the bitstream.

Description

METHODS, APPARATUS AND SYSTEMS FOR SCENE BASED AUDIO MONO DECODING CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This application claims the benefit of priority from U.S. Provisional Patent Application Serial No. 63/570,117, filed on March 26, 2024, U.S. Provisional Patent Application Serial No. 63/593,278, filed on October 26, 2023 and U.S. Provisional Patent Application Serial No. 63/511,786, filed on July 3, 2023, each of which is incorporated by reference herein in its entirety. TECHNICAL FIELD [0002] This disclosure relates generally to audio processing. BACKGROUND [0003] Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application, and are not admitted as prior art by inclusion in this section. [0004] Spatial Reconstruction (SPAR) and Directional Audio Coding (DirAC) are separate spatial audio coding technologies that each seek to represent an input spatial audio scene in a compact way to enable transmission with a good trade-off between audio quality and bitrate. One such input format for a spatial audio scene is an Ambisonics representation (e.g., first-order Ambisonics (FOA) or higher-order Ambisonics (HOA)). [0005] SPAR seeks to maximize perceived audio quality while minimizing bitrate by reducing the energy of the transmitted audio data while still allowing the second-order statistics of the Ambisonics audio scene (e.g., the covariance) to be reconstructed at the decoder side using transmitted metadata. SPAR seeks to faithfully reconstruct the input Ambisonics scene at the output of the decoder. [0006] DirAC is a technology which represents spatial audio scenes as a collection of directions of arrival (DOA) in time-frequency tiles. From this representation, a similar-sounding scene can be reproduced in a different output format (e.g., binaural). Notably, in the context of Ambisonics, the DirAC representation allows a decoder to produce higher-order output from low-order input (blind upmix). DirAC seeks to preserve direction and diffuseness of the dominant sounds in the input scene. [0007] Both DirAC and SPAR have different strengths and properties. It is therefore desirable to combine the complementary aspects of DirAC and SPAR (e.g., higher audio quality, reduced bitrate, input/output format flexibility and/or reduced computational complexity) into a coder/decoder (“codec”), such as an Ambisonics codec. SUMMARY [0008] Techniques are described for processing audio signals. Examples found herein provide for systems, devices, and methods to encode a bitstream and/or decode a bitstream, where frames are marked with a mode indicator that is leveraged in rendering the audio. [0009] Audio signal encoding and decoding methods are disclosed herein. Some disclosed methods for encoding an audio signal involve obtaining an audio signal that represents an input audio scene with a primary channel and side channels, analyzing the power of the primary channel and analyzing the powers of the side channels. Some such methods involve detecting a mono mode for encoding the audio signal based on analyzing the power of the primary channel and the powers of the side channels and computing one or more downmix channels and spatial metadata from the audio signal for a detected mono mode. Some such methods involve encoding the one or more downmix channels and spatial metadata in a bitstream for the detected mono mode and indicating the mono mode in the bitstream. [0010] Some example embodiments describe methods for encoding an audio signal. In some instances, the audio signal may represent an input audio scene with a primary channel and side channels. In some example embodiments, the methods may involve obtaining the audio signal. According to some example embodiments, the methods may involve analyzing a power of the primary channel of the audio signal and analyzing powers of the side channels of the audio signal. In some example embodiments, the methods may involve detecting a mono mode for encoding the audio signal based on analyzing the power of the primary channel and the powers of the side channels. According to some example embodiments, the methods may involve computing one or more downmix channels and spatial metadata from the audio signal for a detected mono mode and encoding the one or more downmix channels and spatial metadata in a bitstream for the detected mono mode. In some example embodiments, the methods may involve indicating the mono mode in the bitstream. [0011] In some example embodiments, the methods may involve outputting the bitstream, storing the bitstream, transmitting the bitstream, or combinations thereof. [0012] According to some example embodiments, detecting the mono mode may be based on a determination that input audio signal has non-silent audio in the primary channel and that the side channels are silent. [0013] In some example embodiments, detecting the mono mode may involve computing a primary channel power of the primar