EP-4736160-A1 - SPATIAL CODING OF OBJECT-BASED AUDIO

EP4736160A1EP 4736160 A1EP4736160 A1EP 4736160A1EP-4736160-A1

Abstract

A spatial coding method and audio system configured to reduce the complexity of an audio scene via audio-object clustering. In at least some examples, the spatial coding method is implemented using only a limited set of basic matrix operations, which tends to significantly reduce the associated computational complexity. For example, the spatial coding method employs a cost-matrix construction approach, under which the object inter-product matrix is constructed first and then a plurality of cost matrices is derived therefrom by decimation and addition, with no advanced computational operations, such as multiplications or divisions, being needed. At least some embodiments can beneficially be used for reduction, simplification, or compression of complex audio content, with minimal impact on the audio quality, such that the audio content can be distributed through transmission systems that do not possess sufficient bandwidth to timely deliver all of the original audio-object data to the end users.

Inventors

YANG, Ziyu
Shuang, Zhiwei
LU, LIE

Assignees

Dolby Laboratories Licensing Corporation

Dates

Publication Date: 20260506
Application Date: 20240618

Claims (20)

CLAIMS What is claimed is: 1. A spatial coding method for object-based audio, the method comprising: selecting N cluster seeds from L audio objects of an audio scene based on perceptually weighted energies of the L audio objects, where N < L; and obtaining N clusters corresponding to the N cluster seeds by applying a respective one of L gain vectors to each of the L audio objects, each of the L gain vectors being determined via minimization of a respective one of L cost functions configured to substantially preserve one or more selected metrics of the audio scene for rendering in an audio system upon replacement of the L audio objects by the N clusters.
2. The spatial coding method of claim 1, wherein each of the L gain vectors has N respective components.
3. The spatial coding method of claim 1 or 2, wherein the selecting is further based on pairwise distances of the L audio objects.
4. The spatial coding method of any preceding claim, wherein the one or more selected metrics include one or more of a metric of position correctness, a metric of object-to-cluster distance, and a metric of amplitude preservation.
5. The spatial coding method of any preceding claim, wherein the selecting includes applying a threshold-in-quiet filter to frequency spectra of the L audio objects.
6. The spatial coding method of any preceding claim, wherein the selecting includes iteratively identifying the N cluster seeds one by one.
7. The spatial coding method of any preceding claim, further comprising selecting a value of N based on one or more parameters of the audio system.
8. The spatial coding method of any preceding claim, further comprising computing a first L ^L matrix whose matrix elements represent inter products of position vectors of different pairs of the audio objects.
9. The spatial coding method of claim 8, further comprising computing a second L ^L matrix whose matrix elements represent squared norms of the position vectors of individual ones of the L audio objects, the second L ^L matrix being a rank-1 matrix.
10. The spatial coding method of claim 9, wherein the selecting includes computing a distance matrix using a linear combination of the first L ^L matrix and the second L ^L matrix, the distance matrix having non-diagonal matrix elements that represent squared norms of position vector differences for different object pairs selected from the L audio objects and further having all zero diagonal matrix elements.
11. The spatial coding method of claim 10, wherein the selecting further includes adding to the linear combination a transposed version of the second L ^L matrix.
12. The spatial coding method of claim 8, further comprising: constructing a first N ^N matrix using a first subset of the matrix elements of the first L ^L matrix; for each of the L audio objects, constructing a respective second N ^N matrix using a respective second subset of the matrix elements of the first L ^L matrix; and for each of the L audio objects, constructing a respective third N ^N matrix using a respective third subset of the matrix elements of the first L ^L matrix.
13. The spatial coding method of claim 12, further comprising: for each of the L audio objects, computing a respective cost matrix using a linear combination of the first N ^N matrix, the respective second N ^N matrix, and the respective third N ^N matrix.
14. The spatial coding method of claim 13, wherein the computing said respective cost matrix further comprises subtracting from the linear combination a transposed version of the respective second N ^N matrix.
15. The spatial coding method of claim 13 or 14, wherein the obtaining comprises computing the respective one of the L cost functions using the respective cost matrix.
16. The spatial coding method of any preceding claim, wherein the audio system comprises an audio rendering component configured to generate sound corresponding to the N clusters.
17. The spatial coding method of any preceding claim, wherein the audio system comprises: a spatial coding component configured to perform the selecting and further configured to perform the obtaining; and an audio encoder configured to generate a bitstream having encoded therein the N clusters; and wherein the spatial coding method further comprises transmitting the bitstream over a communication channel.
18. The spatial coding method of claim 17, wherein the audio system further comprises: an audio decoder configured to decode the bitstream received over the communication channel to recover the N clusters; and an audio rendering component configured to generate sound corresponding to the recovered N clusters.
19. A non-transitory computer-readable medium storing instructions that, when executed by an electronic processor, cause the electronic processor to perform operations comprising the method of any one of claims 1 to 18.
20. An audio system for object-based audio, the audio system comprising: at least one processor; and at least one memory including program code; and wherein the at least one memory and the program code are configured to, with the at least one processor, cause the audio system at least to: select N cluster seeds from L audio objects of an audio scene based on perceptually weighted energies of the audio objects, where N < L; and obtain N clusters corresponding to the N cluster seeds by applying a respective one of L gain vectors to each of the L audio objects, each of the L gain vectors being determined via minimization of a respective one of L cost functions configured to substantially preserve one or more selected metrics of the audio scene for rendering in the audio system upon replacement of the L audio objects by the N clusters.

Description

SPATIAL CODING OF OBJECT-BASED AUDIO 1. Cross-Reference to Related Applications [0001] This application claims the benefit of priority from PCT International Application PCT/CN2023/104047 filed on 29 June 2023, and U.S. Provisional Application. No.63/558,254 filed on 27 February 2024, each of which is incorporated herein by reference in its entirety. 2. Field of the Disclosure [0002] Various example embodiments relate generally to audio signal processing and, more specifically but not exclusively, to spatial coding of object-based audio for rendering with bandwidth-constrained playback systems at runtime. 3. Background [0003] Some spatial audio formats include both audio beds and audio objects. Herein, the term “audio beds” refers to audio channels that are meant to be reproduced as originating from predefined, fixed locations. The term “audio objects” refers to individual audio elements that may exist for a defined duration in time but also have spatial information of each object, such as position, size, etc. During audio transmission, audio beds and audio objects can be sent separately and then used by a spatial audio rendering system to recreate the audio scene in accordance with the artistic intent. In some examples, the audio rendering system can have a variable number of speakers or headphones. In recent years, various spatial audio formats are becoming progressively more popular with users both for music creation and for interactive entertainment content, such as gaming and eXtended Reality (XR) content. BRIEF SUMMARY OF SOME SPECIFIC EMBODIMENTS [0004] Disclosed herein are various embodiments of a spatial coding method and audio system configured to reduce the complexity of an audio scene via audio-object clustering. In at least some examples, the spatial coding method is implemented using only a limited set of basic matrix operations, which tends to significantly reduce the associated computational complexity. For example, the spatial coding method employs a streamlined cost-matrix construction approach, under which the object inter-product matrix is constructed first and then a plurality of cost matrices is derived therefrom by decimation and addition, with no advanced computational operations, such as multiplications or divisions, being needed. At least some embodiments described herein can beneficially be used for reduction, simplification, or compression of complex audio content, with minimal impact on the audio quality, such that the audio content can be distributed through transmission systems that do not possess sufficient bandwidth to timely deliver all of the original audio-object data to the end users. [0005] According to an example embodiment, a spatial coding method for object-based audio comprises: selecting N cluster seeds from L audio objects of an audio scene based on perceptually weighted energies of the L audio objects, where N < L; and obtaining N clusters corresponding to the N cluster seeds by applying a respective one of L gain vectors to each of the L audio objects, each of the L gain vectors being determined via minimization of a respective one of L cost functions configured to substantially preserve one or more selected metrics of the audio scene for rendering in an audio system upon replacement of the L audio objects by the N clusters. [0006] According to another example embodiment, an audio system for object-based audio comprises: at least one processor; and at least one memory including program code; and wherein the at least one memory and the program code are configured to, with the at least one processor, cause the audio system at least to: select N cluster seeds from L audio objects of an audio scene based on perceptually weighted energies of the audio objects, where N < L; and obtain N clusters corresponding to the N cluster seeds by applying a respective one of L gain vectors to each of the L audio objects, each of the L gain vectors being determined via minimization of a respective one of L cost functions configured to substantially preserve one or more selected metrics of the audio scene for rendering in the audio system upon replacement of the L audio objects by the N clusters. [0007] According to yet another example embodiment, provided is a non-transitory computer-readable medium storing instructions that, when executed by an electronic processor, cause the electronic processor to perform operations comprising the above spatial coding method. BRIEF DESCRIPTION OF THE DRAWINGS [0008] Other aspects, features, and benefits of various disclosed embodiments will become more fully apparent, by way of example, from the following detailed description and the accompanying drawings, in which: [0009] FIG.1 is a block diagram of an audio system in which various embodiments can be practiced. [0010] FIG.2 is a block diagram illustrating a spatial coding method that can be implemented in the audio system of FIG.1 according to some examples. [0011] FIG.3 is a flowchart ill