EP-4416941-B1 - SPATIAL RENDERING OF AUDIO ELEMENTS HAVING AN EXTENT

EP4416941B1EP 4416941 B1EP4416941 B1EP 4416941B1EP-4416941-B1

Inventors

MORADI ASHOUR, Chamran
FALK, TOMMY
DE BRUIJN, WERNER

Dates

Publication Date: 20260506
Application Date: 20221011

Claims (13)

A method (800) for rendering an audio element, wherein the audio element has an extent (200) and is represented using a set of virtual loudspeakers (202, 203, 204) comprising a middle virtual loudspeaker (203), the method being characterised by comprising: based on a position of a listener, selecting (s802) a position for the middle virtual loudspeaker, wherein selecting the position for the middle virtual loudspeaker based on the position of the listener comprises: selecting a position point on a first straight line 1) between a first point of the audio element or of an extent that was determined based on the extent (200) of the audio element and a second point of the audio element or of the determined extent or 2) between a first virtual speaker and a second virtual speaker, such that: the angle between i) a second straight line running from the position of the listener to the first point or the first virtual speaker and ii) a third straight line running from the position of the listener to the selected position point on the first straight line is equal to the angle between i) a fourth straight line running from the position of the listener to the second point or to the second virtual loudspeaker and ii) the third straight line.
The method of claim 1, wherein selecting the position point comprises calculating a coordinate, M, of the position point by calculating: M = v * Re + w * Le / v + w , where v is the length of the second straight line, w is the length of the third straight line, Re is a coordinate of the first point or first virtual speaker, and Le is a coordinate of the second point or second virtual speaker.
The method of claim 1 or 2, further comprising positioning the middle virtual loudspeaker at the selected position point.
The method of any one of claims 1-3, wherein the method further comprises calculating an attenuation factor for the middle virtual loudspeaker based on the position of the listener, wherein calculating the attenuation factor for the middle virtual loudspeaker based on the position of the listener comprises: determining a first angle based on the position of the listener and i) a position of a first edge point of the audio element or of the determined extent or ii) a position of a first virtual loudspeaker; determining a second angle based on the position of the listener and a position of a second edge point of the audio element or of the determined extent or ii) a position of a second virtual loudspeaker; and calculating ε = sin(λ)/sin(β) or ε = sin(β)/sin(λ), where λ is the first angle, β is the second angle, and ε is the attenuation factor.
The method of claim 4, further comprising modifying a signal, X, for the middle virtual loudspeaker to produce a modified middle virtual loudspeaker signal, X', such that X' = ε * X, and using the modified middle virtual loudspeaker signal to render the audio element.
The method of any one of claims 1-5, further comprising: based on the position of the middle virtual loudspeaker, generating a middle virtual loudspeaker signal for the middle virtual loudspeaker; and using the middle virtual loudspeaker signal to render the audio element.
A computer program (1443) comprising instructions (1444) which when executed by processing circuitry (1402) of an audio renderer (1400) causes the audio renderer to perform the method of any one of claims 1-6.
A carrier containing the computer program of claim 7, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium (1442).
An audio rendering apparatus (1400) for rendering an audio element, wherein the audio element has an extent (200) and is represented using a set of virtual loudspeakers (202, 203, 204) comprising a middle virtual loudspeaker (203), the audio rendering apparatus being characterised by being configured to: based on a position of a listener, select (s802) a position for the middle virtual loudspeaker, wherein the audio rendering apparatus is configured to select the position for the middle virtual loudspeaker based on the position of the listener by: selecting a position point on a first straight line 1) between a first point of the audio element or of an extent that was determined based on the extent (200) of the audio element and a second point of the audio element or of the determined extent or 2) between a first virtual speaker and a second virtual speaker, such that: the angle between i) a second straight line running from the position of the listener to the first point or the first virtual speaker and ii) a third straight line running from the position of the listener to the selected position point on the first straight line is equal to the angle between i) a fourth straight line running from the position of the listener to the second point or to the second virtual loudspeaker and ii) the third straight line.
The audio rendering apparatus of claim 9, wherein selecting the position point comprises calculating a coordinate, M, of the position point by calculating: M = v * Re + w * Le / v + w , where v is the length of the second straight line, w is the length of the third straight line, Re is a coordinate of the first point or first virtual speaker, and Le is a coordinate of the second point or second virtual speaker.
The audio rendering apparatus of claim 9 or 10, wherein the audio rendering apparatus is further configured to position the middle virtual loudspeaker at the selected position point.
The audio rendering apparatus of any one of claims 9-11, wherein the audio rendering apparatus is configured to calculate an attenuation factor for the middle virtual loudspeaker based on the position of the listener by: determining a first angle based on the position of the listener and i) a position of a first edge point of the audio element or of the determined extent or ii) a position of a first virtual loudspeaker; determining a second angle based on the position of the listener and i) a position of a second edge point of the audio element or of the determined extent or ii) a position of a second virtual loudspeaker; and calculating ε = sin(λ)/sin(β) or ε = sin(β)/sin(λ), where λ is the first angle, β is the second angle, and ε is the attenuation factor.
The audio rendering apparatus of any one of claims 9-12, further being configured to: based on the position of the middle virtual loudspeaker, generate a middle virtual loudspeaker signal for the middle virtual loudspeaker; and use the middle virtual loudspeaker signal to render the audio element.

Description

TECHNICAL FIELD Disclosed are embodiments related to rendering of audio elements. BACKGROUND Spatial audio rendering is a process used for presenting audio within an extended reality (XR) scene (e.g., a virtual reality (VR), augmented reality (AR), or mixed reality (MR) scene) in order to give a listener the impression that sound is coming from physical sources within the scene at a certain position and having a certain size and shape (i.e., extent). The presentation can be made through headphone speakers or other speakers. If the presentation is made via headphone speakers, the processing used is called binaural rendering and uses spatial cues of human spatial hearing that make it possible to determine from which direction sounds are coming. The cues involve inter-aural time delay (ITD), inter-aural level difference (ILD), and/or spectral difference. The most common form of spatial audio rendering is based on the concept of point-sources, where each sound source is defined to emanate sound from one specific point. Because each sound source is defined to emanate sound from one specific point, the sound source doesn't have any size or shape. In order to render a sound source having an extent (size and shape), different methods have been developed. One such known method is to create multiple copies of a mono audio element at positions around the audio element. This arrangement creates the perception of a spatially homogeneous object with a certain size. This concept is used, for example, in the "object spread" and "object divergence" features of the MPEG-H 3D Audio standard (see references [1] and [2]), and in the "object divergence" feature of the EBU Audio Definition Model (ADM) standard (see reference [4]). This idea using a mono audio source has been developed further as described in reference [7], where the area-volumetric geometry of a sound object is projected onto a sphere around the listener and the sound is rendered to the listener using a pair of head-related (HR) filters that is evaluated as the integral of all HR filters covering the geometric projection of the object on the sphere. For a spherical volumetric source this integral has an analytical solution. For an arbitrary area-volumetric source geometry, however, the integral is evaluated by sampling the projected source surface on the sphere using what is called a Monte Carlo ray sampling. Another rendering method renders a spatially diffuse component in addition to a mono audio signal, which creates the perception of a somewhat diffuse object that, in contrast to the original mono audio element, has no distinct pin-point location. This concept is used, for example, in the "object diffuseness" feature of the MPEG-H 3D Audio standard (see reference [3]) and the "object diffuseness" feature of the EBU ADM (see reference [5]). Combinations of the above two methods are also known. For example, the "object extent" feature of the EBU ADM combines the creation of multiple copies of a mono audio element with the addition of diffuse components (see reference [6]). In many cases the actual shape of an audio element can be described well enough with a basic shape (e.g., a sphere or a box). But sometimes the actual shape is more complicated and needs to be described in a more detailed form (e.g., a mesh structure or a parametric description format). These methods, however, do not allow the rendering of audio elements that have a distinct spatially-heterogeneous character, i.e. an audio element that has a certain amount of spatial source variation within its spatial extent. Often these sources are made up of a sum of a multitude of sources (e.g., the sound of a forest or the sound of a cheering crowd). The majority of these known solutions are only able to create objects with either a spatially-homogeneous (i.e., with no spatial variation within the element), or a spatially diffuse character, which is too limited for rendering some of the examples given above in a convincing way. In the case of heterogeneous audio elements, as are described in reference [8], the audio element comprises at least two audio channels (i.e., audio signals) to describe a spatial variation over its extent. Techniques exist for rendering these heterogeneous audio elements where the audio element is represented by a multi-channel audio recording and the rendering uses several virtual loudspeakers to represent the audio element and the spatial variation within it. By placing the virtual loudspeakers at positions that correspond to the extent of the audio element, an illusion of audio emanating from the audio element can be conveyed. The number of virtual loudspeakers required to achieve a plausible spatial rendering of a spatially-heterogeneous audio element depends on the audio element's extent. For a spatially-heterogeneous audio element that is small or at some distance from the listener, a two-speaker setup might be enough. As illustrated in FIG. 1, however, for an audio element