EP-4430605-B1 - AUDIO DECODER, AUDIO ENCODER, METHOD FOR DECODING, METHOD FOR ENCODING AND BITSTREAM, USING A SCENE CONFIGURATION PACKET COMPRISING A CELL INFORMATION, WHICH DEFINES AN ASSOCIATION BETWEEN ONE OR MORE CELLS AND RESPECTIVE ONE OR MORE DATA STRUCTURES

EP4430605B1EP 4430605 B1EP4430605 B1EP 4430605B1EP-4430605-B1

Inventors

DISCH, SASCHA
SCHWÄR, Simon
HASSAN, Kahleel Porter

Dates

Publication Date: 20260513
Application Date: 20221109

Claims (15)

An audio decoder (1400), for providing a decoded audio representation on the basis of an encoded audio representation, wherein the audio decoder is configured to spatially render one or more audio signals; wherein the audio decoder is configured to receive a scene configuration packet providing a renderer configuration information, wherein the scene configuration packet comprises a subscene cell information defining one or more cells, wherein the cell information defines an association between the one or more cells and respective one or more data structures associated with the one or more cells and defining a subscene rendering scenario; wherein the audio decoder is configured to evaluate the cell information in order to determine which data structures should be used for the spatial rendering; and wherein the audio decoder is configured to identify one or more current cells; and wherein the audio decoder is configured to perform the spatial rendering using one or more data structures (1430) associated with the one or more identified current cells.
Audio decoder (1400) according to claim 1, wherein the cell information comprises a temporal definition of a given cell and wherein the audio decoder is configured to evaluate the temporal definition of the given cell, in order to determine whether the one or more data structures associated with the given cell should be considered in the spatial rendering; and/or wherein the cell information comprises a spatial definition of a given cell and wherein the audio decoder is configured to evaluate the spatial definition of the given cell, in order to determine whether the one or more data structures associated with the given cell should be considered in the spatial rendering.
Audio decoder (1400) according to one of claims 1 to 2, wherein the cell information comprises a flag indicating whether the cell information comprises a temporal definition of the cell or a spatial definition of the cell and wherein the audio decoder is configured to evaluate the flag indicating whether cell information comprises a temporal definition of the cell or a spatial definition of the cell; and/or wherein the cell information comprises a reference of a geometric structure in order to define the cell and wherein the audio decoder is configured to evaluate the reference of the geometric structure, in order to obtain the geometric definition of the cell.
Audio decoder (1400) according to one of claims 1 to 3, wherein the audio decoder is configured to identify one or more current cells; and wherein the audio decoder is configured to perform the spatial rendering using one or more scene objects (1430) and/or scene characteristics associated with the one or more identified current cells.
Audio decoder (1400) according to one of claims 1 to 4, wherein the audio decoder is configured to select scene objects and/or scene characteristics to be considered in the spatial rendering in dependence on the cell information.
Audio decoder (1400) according to one of claims 1 to 5, wherein the audio decoder is configured to determine, in which one or more spatial cells a current position lies; and wherein the audio decoder is configured to perform the spatial rendering using one or more scene objects and/or scene characteristics associated with the one or more identified current cells.
Audio decoder (1400) according to one of claims 1 to 6, wherein the audio decoder is configured to determine one or more payloads associated with one or more current cells on the basis of an enumeration of payload identifiers included in a cell definition of a cell; and wherein the audio decoder is configured to perform the spatial rendering using the determined one or more payloads.
Audio decoder (1400) according to one of claims 1 to 7, wherein the audio decoder is configured to perform the spatial rendering using information from one or more scene update packets which are associated with one or more current cells.
Audio decoder (1400) according to one of claims 1 to 8, wherein the audio decoder is configured to update a rendering scene using information from one or more scene update packets associated with a given cell in response to a finding that the given cell becomes active.
Audio decoder (1400) according to one of claims 1 to 9, wherein the audio decoder is configured to request (1401, 1508) one or more data structures using respective data structure identifiers, wherein the audio decoder is configured to derive the data structure identifiers of data structures to be requested using the cell information.
An apparatus (1500) for providing an encoded audio representation, wherein the apparatus is configured to provide an information for a spatial rendering of one or more audio signals; wherein the apparatus is configured to provide a plurality of packets (1404, 1522) of different packet types, wherein the apparatus is configured to provide a scene configuration packet providing a renderer configuration information, wherein the scene configuration packet comprises a cell information defining one or more cells, wherein the cell information defines an association between the one or more cells and respective one or more data structures associated with the one or more cells and defining a rendering scenario; and wherein the apparatus is configured to provide one or more scene payload packets, which comprise one or more data structures referenced in the cell information.
A method (1600) for providing a decoded audio representation on the basis of an encoded audio representation, wherein the method comprises spatially rendering (1610) one or more audio signals; wherein the method comprises receiving (1620) a scene configuration packet providing a renderer configuration information, wherein the scene configuration packet comprises a cell information defining one or more cells, wherein the cell information defines an association between the one or more cells and respective one or more data structures associated with the one or more cells and defining a rendering scenario; wherein the method comprises evaluating (1630) the cell information in order to determine which data structures should be used for the spatial rendering; and wherein the method comprises identifying one or more current cells; and wherein the method comprises performing the spatial rendering using one or more data structures (1430) associated with the one or more identified current cells.
A method (1700) for providing an encoded audio representation, wherein the method comprises providing (1710) an information for a spatial rendering of one or more audio signals; wherein the method comprises providing (1720) a plurality of packets (1404, 1522) of different packet types, wherein the method comprises providing (1730) a scene configuration packet providing a renderer configuration information, wherein the scene configuration packet comprises a cell information defining one or more cells, wherein the cell information defines an association between the one or more cells and respective one or more data structures associated with the one or more cells and defining a rendering scenario; and wherein the method comprises providing one or more scene payload packets, which comprise one or more data structures referenced in the cell information.
A computer program for performing the method according to claim 12 or claim 13 when the computer program runs on a computer.
A bitstream (1502) representing an audio content, the bitstream comprising a plurality of packets (1404, 1522) of different packet types, the packets comprising a scene configuration packet providing a renderer configuration information, wherein the scene configuration packet comprises a cell information defining one or more cells, wherein the cell information defines an association between the one or more cells and respective one or more data structures associated with the one or more cells and defining a rendering scenario; and wherein the one or more scene payload packets comprise one or more data structures referenced in the cell information.

Description

Technical Field Embodiments comprise audio decoders, audio encoders, methods for decoding, methods for encoding and bitstreams, using a scene configuration packet comprising a cell information, which defines an association between one or more cells and respective one or more data structures. Embodiments according to the invention are related to dynamic VR (virtual reality) / AR (augmented reality) bitstreams, for example using three packet types, using scene update packets with update condition, using a time stamp and/or using cell information. Background of the Invention In order to provide an immersive experience for VR and/or AR applications it is not sufficient to provide a spatial viewing experience but also a spatial hearing experience. As an example, to fulfill such a need, six degrees of freedom (6DoF) audio techniques are developed. In this regard, it is challenging to develop bitstreams and corresponding encoders and decoders that enable a high definition, immersive hearing experience, whilst also being usable with feasible bandwidths. Regarding conventional approaches, reference is made to "MPEG-I Immersive Audio Encoder Input Format", 134. MPEG MEETING; 20210426 - 20210430; ONLINE; (MOTION PICTURE EXPERT GROUP OR ISO/IEC JTC1 /SC29/WG11), no. n20446, 4 May 2021 (2021-05-04), XP030294726, which is related to a specification of a 6DoF audio encoder input format (EIF) for describing an audio scene. Therefore, it is desired to get a concept which makes a better compromise between an achievable hearing impression of a rendered audio scene, an efficiency of a transmission of data used for the rendering of the audio scene and an efficiency of a decoding and/or rendering of the data. This is achieved by the subject matter of the independent claims of the present application. The present invention provides for a decoder and a method, for providing a decoded audio representation on the basis of an encoded audio representation, according to claims 1 and 12, an apparatus and a method for providing an encoded audio representation, according to claims 11 and 13, a computer program according to claim 14 and a bitstream, representing an audio content, according to claim 15. Further embodiments according to the invention are defined by the subject matter of the dependent claims of the present application. Any embodiments and examples of the description not falling within the scope of the claims do not form part of the invention and are provided for illustrative purposes only. Summary of the Invention In the following examples according to a first aspect are discussed. Examples according to the first aspect may be based on using three packet types. Examples according to the first aspect may, for example, comprise Scene update Packets and/or Scene Payload Packets. Examples according to the first aspect may comprise MPEG-H compatible packets or may provide or comprise MPEG-H compatible decoders, encoders and/or bitstreams. Examples comprise an audio decoder, for providing a decoded, and optionally rendered, audio representation on the basis of an encoded audio representation. The audio decoder is configured to spatially render one or more audio signals and to receive a plurality of packets of different packet types, e.g. having packet types which are conformant to a MPEG-H MHAS packet definition, the packets comprising one or more scene configuration packets, e.g. Scene ConfigPacket, e.g. mpegiSceneConfig[] (sometimes also designated as "mpeghiSceneConfig[]"), providing a renderer configuration information defining a usage of scene objects and/or a usage of scene characteristics, e.g. defining when or under which condition different scene objects and/or scene characteristics should be used in a rendering process, e.g. using a definition of cells. The concept of cells is, for example, especially important to practically implement subscene support. Subscenes are, for example, parts of a scene that are relevant at a certain point in scene time or in a certain vicinity/proximity to predefined scene locations. In cases, the term cell and subscene might be used synonymously. Optionally, the scene configuration packet may, for example, define which scene payload packets are required at a given point in space and time. As another optional feature, the scene configuration packet may, for example, define where scene payload packets can be retrieved from. Furthermore, the packets comprise one or more scene update packets, e.g. mpegiSceneUpdate[] (sometimes also designated as "mpeghiSceneUpdate[]"), defining an update, e.g. change, of scene metadata for the rendering, (e.g. a change of one or more metadata values; e.g. a change of a parameter of a scene object or a change of a scene characteristic; e.g. a change of scene metadata that occurs during playback). Optionally, the one or more scene update packets may, for example, define one or more conditions for a scene update. Moreover, the packets comprise one or more scene payl