EP-4294056-B1 - VIRTUAL SPEAKER SET DETERMINATION METHOD AND DEVICE

EP4294056B1EP 4294056 B1EP4294056 B1EP 4294056B1EP-4294056-B1

Inventors

GAO, YUAN
LIU, Shuai
WANG, BIN
WANG, ZHE
QU, Tianshu
XU, Jiahao

Dates

Publication Date: 20260506
Application Date: 20220302

Claims (13)

A method for determining a virtual speaker set, performed by an apparatus for determining a virtual speaker set, the method comprising: determining (701) a target virtual speaker from F preset virtual speakers based on a to-be-processed audio signal, wherein each of the F preset virtual speakers corresponds to respective S virtual speakers, F is a positive integer, and S is a positive integer greater than 1; and obtaining (702), from a preset virtual speaker distribution table, respective position information of the S virtual speakers corresponding to the target virtual speaker, wherein the virtual speaker distribution table comprises position information of K virtual speakers, the position information comprises an elevation angle index and an azimuth angle index, K is a positive integer greater than 1, F≤K, and F×S≥K, wherein the determining (701) a target virtual speaker from F preset virtual speakers based on a to-be-processed audio signal comprises: obtaining a higher order ambisonics, HOA, coefficient of the audio signal; obtaining F groups of HOA coefficients corresponding to the F preset virtual speakers, wherein the F preset virtual speakers are in one-to-one correspondence with the F groups of HOA coefficients; and determining, as the target virtual speaker, a virtual speaker of the F preset virtual speakers corresponding to a group of HOA coefficients that has a greatest correlation with the HOA coefficient of the audio signal and that is in the F groups of HOA coefficients, wherein each group of the F groups of HOA coefficients includes (N+1) 2 coefficients, wherein the HOA coefficient of the audio signal includes (N+1) 2 coefficients, and wherein N represents an order of the audio signal, wherein the S virtual speakers corresponding to the determined target virtual speaker represent the determined virtual speaker set.
The method according to claim 1, wherein the S virtual speakers corresponding to the target virtual speaker meet the following conditions: the S virtual speakers comprise the target virtual speaker and (S-1) virtual speakers located around the target virtual speaker, wherein any one of (S-1) correlations between the (S-1) virtual speakers and the target virtual speaker is greater than each of (K-S) correlations between (K-S) virtual speakers, other than the S virtual speakers, of the K virtual speakers and the target virtual speaker.
The method according to any one of claims 1 or 2, wherein the K virtual speakers meet the following conditions: the K virtual speakers are distributed on a preset sphere, and the preset sphere comprises L latitude regions, wherein L>1; and an m th latitude region of the L latitude regions comprises T m latitude circles, an azimuth angle difference between adjacent virtual speakers that are in the K virtual speakers and that are distributed on an m i th latitude circle is α m , 1≤m≤L, T m is a positive integer, and 1≤m i ≤Tm, wherein when T m >1, an elevation angle difference between any two adjacent latitude circles in the m th latitude region is α m .
The method according to claim 3, wherein an n th latitude region of the L latitude regions comprises T n latitude circles, an azimuth angle difference between adjacent virtual speakers that are in the K virtual speakers and that are distributed on an n i th latitude circle is α n , 1≤n≤L, T n is a positive integer, and 1≤n i ≤T n , wherein when T n >1, an elevation angle difference between any two adjacent latitude circles in the n th latitude region is α n , wherein α n = α m or α n ≠ α m , and n ≠ m .
The method according to claim 3, wherein a c th latitude region of the L latitude regions comprises T c latitude circles, one of the T c latitude circles is an equatorial latitude circle, an azimuth angle difference between adjacent virtual speakers that are in the K virtual speakers and that are distributed on a c i th latitude circle is α c , 1≤c≤L, T c is a positive integer, and 1≤c i ≤T c , wherein when T c >1, an elevation angle difference between any two adjacent latitude circles in the c th latitude region is α c , wherein α c < α m , and c ≠ m .
The method according to any one of claims 3 to 5, wherein the F virtual speakers meet the following conditions: an azimuth angle difference α mi between adjacent virtual speakers that are distributed on the m i th latitude circle and that are in the F virtual speakers is greater than α m .
An apparatus for determining a virtual speaker set, comprising: a determining module (801), configured to determine a target virtual speaker from F preset virtual speakers based on a to-be-processed audio signal, wherein each of the F preset virtual speakers corresponds to respective S virtual speakers, F is a positive integer, and S is a positive integer greater than 1; and an obtaining module (802), configured to obtain, from a preset virtual speaker distribution table, respective position information of the S virtual speakers corresponding to the target virtual speaker, wherein the virtual speaker distribution table comprises position information of K virtual speakers, the position information comprises an elevation angle index and an azimuth angle index, K is a positive integer greater than 1, F≤K, and F×S≥K, wherein the determining module (801) is specifically configured to: obtain a higher order ambisonics, HOA, coefficient of the audio signal; obtain F groups of HOA coefficients corresponding to the F preset virtual speakers, wherein the F preset virtual speakers are in one-to-one correspondence with the F groups of HOA coefficients; and determine, as the target virtual speaker, a virtual speaker of the F preset virtual speakers corresponding to a group of HOA coefficients that has a greatest correlation with the HOA coefficient of the audio signal and that is in the F groups of HOA coefficients, wherein each group of the F groups of HOA coefficients includes (N+1) 2 coefficients, wherein the HOA coefficient of the audio signal includes (N+1) 2 coefficients, and wherein N represents an order of the audio signal, wherein the S virtual speakers corresponding to the determined target virtual speaker represent the determined virtual speaker set.
The apparatus according to claim 7, wherein the S virtual speakers corresponding to the target virtual speaker meet the following conditions: the S virtual speakers comprise the target virtual speaker and (S-1) virtual speakers located around the target virtual speaker, wherein any one of (S-1) correlations between the (S-1) virtual speakers and the target virtual speaker is greater than each of (K-S) correlations between (K-S) virtual speakers, other than the S virtual speakers, of the K virtual speakers and the target virtual speaker.
The apparatus according to any one of claims 7 or 8, wherein the K virtual speakers meet the following conditions: the K virtual speakers are distributed on a preset sphere, and the preset sphere comprises L latitude regions, wherein L>1; and an m th latitude region of the L latitude regions comprises T m latitude circles, an azimuth angle difference between adjacent virtual speakers that are in the K virtual speakers and that are distributed on an m i th latitude circle is α m , 1≤m≤L, T m is a positive integer, and 1≤m i ≤Tm, wherein when T m >1, an elevation angle difference between any two adjacent latitude circles in the m th latitude region is α m .
The apparatus according to claim 9, wherein an n th latitude region of the L latitude regions comprises T n latitude circles, an azimuth angle difference between adjacent virtual speakers that are in the K virtual speakers and that are distributed on an n i th latitude circle is α n , 1≤n≤L, T n is a positive integer, and 1≤n i ≤T n , wherein when T n >1, an elevation angle difference between any two adjacent latitude circles in the n th latitude region is α n , wherein α n = α m or α n ≠ α m , and n ≠ m .
The apparatus according to claim 9, wherein a c th latitude region of the L latitude regions comprises T c latitude circles, one of the T c latitude circles is an equatorial latitude circle, an azimuth angle difference between adjacent virtual speakers that are in the K virtual speakers and that are distributed on a c i th latitude circle is α c , 1≤c≤L, T c is a positive integer, and 1≤c i ≤T c , wherein when T c >1, an elevation angle difference between any two adjacent latitude circles in the c th latitude region is α c , wherein α c < α m , and c ≠ m .
The apparatus according to any one of claims 9 to 11, wherein the F virtual speakers meet the following conditions: an azimuth angle difference α mi between adjacent virtual speakers that are distributed on the m i th latitude circle and that are in the F virtual speakers is greater than α m .
An audio processing device, comprising: one or more processors; and a memory, configured to store one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors implement the method according to any one of claims 1 to 6.

Description

TECHNICAL FIELD This application relates to the field of audio technologies, and in particular, to a method and an apparatus for determining a virtual speaker set. BACKGROUND A three-dimensional audio technology is an audio technology in which sound events and three-dimensional sound field information in real world are obtained, processed, transmitted, rendered, and played back via a computer, through signal processing, and the like. The three-dimensional audio technology makes sound have a strong sense of space, encirclement, and immersion, and gives people "virtual face-to-face" acoustic experience. Currently, a mainstream three-dimensional audio technology is a higher order ambisonics (higher order ambisonics, HOA) technology. Because of a property that in recording and encoding, the HOA technology is irrelevant to a speaker layout during a playback stage and a feature of rotatability of data in an HOA format, the HOA technology has higher flexibility in three-dimensional audio playback, and therefore has gained more attention and wider research. The HOA technology can convert an HOA signal into a virtual speaker signal, and then obtain, through mapping, a binaural signal for playback. In the foregoing process, even distribution of virtual speakers may achieve a best sampling effect. For example, the virtual speakers are distributed on vertices of a regular tetrahedron. However, in a three-dimensional space, there are only five types of regular polyhedrons: the regular tetrahedron, a regular hex-ahedron, a regular octahedron, a regular dodecahedron, and a regular icosahedron. Consequently, a quantity of virtual speakers that can be disposed is limited, and this is inapplicable to distribution of virtual speakers of a larger quantity. The document Jakob Vennerod: "Bineural Reproduction of Higher Order Ambisonics", XP 055454025 shows a rendering of three dimensional audio data on two dimensional speaker systems. SUMMARY The present invention is defined by the independent claims. Further advantageous developments are shown by the dependent claims. This application provides a method and an apparatus for determining a virtual speaker set, so as to improve an audio signal playback effect. According to a first aspect, this application provides a method for determining a virtual speaker set, as according to claim 1. In this application, the virtual speaker distribution table is preset, so that a high average value of signal-to-noise ratios (SNRs) of HOA reconstructed signals can be obtained by deploying virtual speakers according to the distribution table, and the S virtual speakers having highest correlations with an HOA coefficient of the to-be-processed audio signal are selected based on such distribution, thereby achieving an optimal sampling effect and improving an audio signal playback effect. Encoding analysis is performed on the to-be-processed audio signal. For example, sound field distribution of the to-be-processed audio signal is analyzed, including characteristics such as a quantity of sound sources, directivity, and dispersion of the audio signal, to obtain the HOA coefficient of the audio signal, and the HOA coefficient of the audio signal is used as one of determining conditions for determining how to select the target virtual speaker. A virtual speaker matching the to-be-processed audio signal may be selected based on the HOA coefficient of the to-be-processed audio signal and the HOA coefficients of candidate virtual speakers (namely, the foregoing F virtual speakers). In this application, the virtual speaker is referred to as the target virtual speaker. An inner product may be separately performed between the HOA coefficients of the F virtual speakers and the HOA coefficient of the audio signal, and a virtual speaker with a maximum absolute value of the inner product is selected as the target virtual speaker. It should be noted that the target virtual speaker may alternatively be determined by using another method, and this is not specifically limited in this application. In a possible implementation, the S virtual speakers corresponding to the target virtual speaker meet the following conditions: the S virtual speakers include the target virtual speaker and (S-1) virtual speakers located around the target virtual speaker, where any one of (S-1) correlations between the (S-1) virtual speakers and the target virtual speaker is greater than each of (K-S) correlations between (K-S) virtual speakers, other than the S virtual speakers, of the K virtual speakers and the target virtual speaker. When the target virtual speaker is determined, the target virtual speaker is a central virtual speaker having a highest correlation with the HOA coefficient of the to-be-processed audio signal. S virtual speakers corresponding to each central virtual speaker are S virtual speakers having highest correlations with HOA coefficients of the central virtual speaker. Therefore, the S virtual speakers corresponding t