EP-4738346-A1 - IMMERSIVE VOICE AND AUDIO SERVICES (IVAS) WITH ADAPTIVE DOWNMIX STRATEGIES

EP4738346A1EP 4738346 A1EP4738346 A1EP 4738346A1EP-4738346-A1

Abstract

Disclosed is an audio signal encoding/decoding method that uses an encoding downmix strategy applied at an encoder that is different than a decoding re-mix/upmix strategy applied at a decoder. Based on the type of downmix coding scheme, the method comprises: computing input downmixing gains to be applied to the input audio signal to construct a primary downmix channel; determining downmix scaling gains to scale the primary downmix channel; generating prediction gains based on the input audio signal, the input downmixing gains and the downmix scaling gains; determining residual channel(s) from the side channels by using the primary downmix channel and the prediction gains to generate side channel predictions and subtracting the side channel predictions from the side channels; determining decorrelation gains based on energy in the residual channels; encoding the primary downmix channel, the residual channel(s), the prediction gains and the decorrelation gains; and sending the bitstream to a decoder.

Inventors

MUNDT, HARALD
MCGRATH, DAVID S.
TYAGI, RISHABH

Assignees

Dolby International AB
Dolby Laboratories Licensing Corporation

Dates

Publication Date: 20260506
Application Date: 20211202

Claims (3)

An audio signal coding method that uses an encoding downmix strategy applied at an encoder that is different than a decoding re-mix or upmix strategy applied at a decoder, the method comprising: obtaining (step 301), with at least one processor, an input audio signal, the input audio signal representing an input audio scene and comprising a primary input audio channel and side channels; determining (step 302), with the at least one processor, a type of downmix coding scheme based on the input audio signal; based on the type of downmix coding scheme: computing (step 303), with the at least one processor, one or more input downmixing gains to be applied to the input audio signal to construct a primary downmix channel, wherein the input downmixing gains are determined to minimize an overall prediction error on the side channels; determining (step 304), with the at least one processor, one or more downmix scaling gains to scale the primary downmix channel, wherein the downmix scaling gains are determined by minimizing an energy difference between a reconstructed representation of the input audio scene from the primary downmix channel and the input audio signal; generating (step 305), with the at least one processor, prediction gains based on the input audio signal, the input downmixing gains and the downmix scaling gains; determining (step 306), with the at least one processor, one or more residual channels from the side channels in the input audio signal by using the primary downmix channel and the prediction gains to generate side channel predictions and then subtracting the side channel predictions from the side channels; determining (step 307), with the at least one processor, decorrelation gains based on energy in the residual channels; computing, with the at least one processor, an input covariance based on the input audio signal; determining, with the at least one processor, the overall prediction error using the input covariance; encoding (step 308), with the at least one processor, the primary downmix channel, the zero or more residual channels and side information into a bitstream, the side information comprising the prediction gains and the decorrelation gains corresponding to the one or more residual channels; and sending (step 309), with the at least one processor, the bitstream to a decoder.
A system comprising: one or more processors; and a non-transitory computer-readable medium storing instructions that, upon execution by the one or more processors, cause the one or more processors to perform operations according to claim 1.
A non-transitory computer-readable medium storing instructions that, upon execution by one or more processors, cause the one or more processors to perform operations according to claim 1.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of priority to U.S. Provisional Patent Application No. 63/228,732, filed August 3, 2021, U.S. Provisional Patent Application No. 63/171,404, filed April 6, 2021, and U.S. Provisional Patent Application No. 63/120,365, filed December 2, 2020, all of which are incorporated herein by reference. This application is a European divisional application of Euro-PCT patent application EP 21836685.4 (reference: D20130EP01), filed 2 December 2021. TECHNICAL FIELD This disclosure relates generally to audio bitstream encoding and decoding. BACKGROUND Voice and audio encoder/decoder ("codec") standard development has recently focused on developing a codec for immersive voice and audio services (IVAS). IVAS is expected to support a range of audio service capabilities, including but not limited to mono to stereo upmixing and fully immersive audio encoding, decoding and rendering. IVAS is intended to be supported by a wide range of devices, endpoints, and network nodes, including but not limited to: mobile and smart phones, electronic tablets, personal computers, conference phones, conference rooms, virtual reality (VR) and augmented reality (AR) devices, home theatre devices, and other suitable devices. The IVAS codec efficiently codes an N channel multi-channel input including Ambisonics input by downmixing the input into N_dmx channels (where in N_dmx <= N) and generating side information (spatial metadata), these N_dmx channels are then coded by one or more instances of core codecs. The core codec bits along with coded side information are then transmitted to the IVAS decoder. The IVAS decoder decodes the N_dmx downmix channels using one or more instances of core codecs and then reconstructs the multi-channel input from the N_dmx channels using the transmitted side information and one or more instances of decorrelators. At various bitrates, different number of N_dmx may be coded, e.g., at 32 kbps only 1 downmix channel may be coded. One of the N_dmx downmix channels is a representation of a dominant eigen signal (W') of the N channel input (hereinafter, also referred to as "primary downmixing channel") and the rest of the downmix channels may be derived as a function of W' and the multi-channel input. There are two downmixing schemes available in IVAS: a passive downmix scheme and an active downmix scheme. In the passive downmix scheme, the dominant eigen signal (W') is a delayed version of the center channel or the primary input channel (the W channel in case of Ambisonics input). In the active downmix scheme, the eigen signal (W') is obtained by scaling and adding one or more channels in the N channel input. For example, for a first order Ambisonics (FoA) input, W'= s0 W + s1Y + s2X + s3Z, where s0-3 are input downmixing gains. Thus, the passive downmixing scheme can be viewed as a special case of the active downmixing scheme wherein s0 = 1, s1 = 0, s2 = 0 and s3 = 0. SUMMARY Implementations are disclosed for IVAS coding with adaptive downmix strategies, wherein an adaptive downmix is either a passive downmix, an active downmix or a combination of passive and active downmix. In an embodiment, an audio signal encoding method that uses an encoding downmix strategy applied at an encoder that is different than a decoding re-mix/upmix strategy applied at a decoder, comprises: obtaining, with at least one processor, an input audio signal, the input audio signal representing an input audio scene and comprising a primary input audio channel and side channels; determining, with the at least one processor, a type of downmix coding scheme based on the input audio signal; based on the type of downmix coding scheme: computing, with the at least one processor, one or more input downmixing gains to be applied to the input audio signal to construct a primary downmix channel, wherein the input downmixing gains are determined to minimize an overall prediction error on the side channels; determining, with the at least one processor, one or more downmix scaling gains to scale the primary downmix channel, wherein the downmix scaling gains are determined by minimizing an energy difference between a reconstructed representation of the input audio scene from the primary downmix channel and the input audio signal; generating, with the at least one processor, prediction gains based on the input audio signal, the input downmixing gains and the downmix scaling gains; determining, with the at least one processor, one or more residual channels from the side channels in the input audio signal by using the primary downmix channel and the prediction gains to generate side channel predictions and then subtracting the side channel predictions from the side channels; determining, with the at least one processor, decorrelation gains based on energy in the residual channels; encoding, with the at least one processor, the primary downmix channel, the zero or more residual channels a