JP-7855672-B2 - Quantization of spatial speech parameters

JP7855672B2JP 7855672 B2JP7855672 B2JP 7855672B2JP-7855672-B2

Inventors

ヴァシラケアドリアナ

Assignees

ノキアテクノロジーズオサケユイチア

Dates

Publication Date: 20260508
Application Date: 20241223

Claims (11)

A device for spatial speech coding, Means for quantizing and indexing the spatial speech direction parameter of a first time subframe of a frequency subband of a speech frame, and for providing a quantized spatial speech direction index for the first time subframe, Means for determining a quantized spatial speech direction difference index for a first time subframe by calculating the difference between the quantized spatial speech direction index for the first time subframe and the quantized average spatial speech direction index of a previous frequency subband of the speech frame, Means for determining the average spatial speech direction parameter for the first time subframe by taking the average of the average spatial speech direction parameter for the corresponding frequency subband of the previous speech frame and the spatial speech direction parameter of the first time subframe , and by weighting the average , Means for quantizing and indexing the average spatial speech direction parameter for the first time subframe, and for providing a quantized average spatial speech direction index to be used in the second time subframe of the frequency subband, Means for quantizing and indexing the spatial speech direction parameter of the second time subframe and for providing a quantized spatial speech direction index for the second time subframe, Means for determining a quantized spatial speech direction difference index for a second time subframe by calculating the difference between the quantized spatial speech direction index for the second time subframe and the quantized average spatial speech direction index used in the second time subframe, Means for determining an average spatial speech direction parameter for a second time subframe by averaging the spatial speech direction parameter of the first time subframe and the spatial speech direction parameter of the second time subframe, Means for quantizing and indexing the average spatial speech direction parameter for the second time subframe, and for providing a quantized average spatial speech direction index to be used in the third time subframe of the frequency subband, A device equipped with the following features.
The means for averaging the average spatial speech direction parameter for the corresponding frequency subband of a previous speech frame weighted by the spatial speech direction parameter of the first time subframe, Means for weighting the average spatial speech direction parameter for the corresponding frequency subband of the aforementioned prior speech frame with a first weight, Means for weighting the spatial speech direction parameter for the first time subframe with a second weight, Means for averaging the weighted average spatial speech direction parameter for the corresponding frequency subband of the preceding speech frame and the second weighted spatial speech direction parameter for the first time subframe to obtain the average spatial speech direction parameter for the first time subframe, The apparatus according to claim 1, comprising:
The apparatus according to claim 1 or 2, further comprising means for encoding the quantized spatial speech direction index for the first and second time subframes of the frequency subband of the speech frame, and the quantized average spatial speech direction index for the first frequency subband of the speech frame, using Golomb-Rice coding.
The apparatus according to any one of claims 1 to 3, wherein the spatial sound direction parameter is a spherical coordinate azimuth angle value.
The means for averaging is, Converting the aforementioned spatial speech direction parameter from a spherical domain to a Cartesian domain parameter, Averaging the aforementioned Cartesian domain parameters, Converting the averaged Cartesian domain parameters to the spherical domain, The apparatus according to any one of claims 1 to 4, which is performed in the Cartesian domain.
A device for spatial speech coding, The spatial speech direction parameter of a first time subframe of a frequency subband of a speech frame is quantized and indexed, and a quantized spatial speech direction index is given for the first time subframe. The quantized spatial directional difference index for the first time subframe is determined by calculating the difference between the quantized spatial directional index for the first time subframe and the quantized average spatial directional index of the previous frequency subband of the speech frame, Determining the average spatial speech direction parameter for the first time subframe by taking the average of the average spatial speech direction parameter for the corresponding frequency subband of the previous speech frame and the spatial speech direction parameter for the first time subframe , and weighting the average , The average spatial speech direction parameter for the first time subframe is quantized and indexed to obtain a quantized average spatial speech direction index to be used in the second time subframe of the frequency subband, The spatial speech direction parameter of the second time subframe is quantized and indexed, and a quantized spatial speech direction index is given for the second time subframe, The quantized spatial speech direction difference index for the second time subframe is determined by calculating the difference between the quantized spatial speech direction index for the second time subframe and the quantized average spatial speech direction index used in the second time subframe, The average spatial speech direction parameter for the second time subframe is determined by averaging the spatial speech direction parameter of the first time subframe and the spatial speech direction parameter of the second time subframe, The mean spatial speech direction parameter for the second time subframe is quantized and indexed to give a quantized mean spatial speech direction index to be used in the third time subframe of the frequency subband, A method that includes this.
Average the average spatial speech direction parameter for the corresponding frequency subband of the previous speech frame, weighted by the spatial speech direction parameter of the first time subframe, Weighting the average spatial speech direction parameter for the corresponding frequency subband of the aforementioned prior speech frame with a first weight, Weighting the spatial speech direction parameter for the first time subframe with a second weight, Averaging the weighted average spatial speech direction parameter for the corresponding frequency subband of the preceding speech frame and the second weighted spatial speech direction parameter for the first time subframe to obtain the average spatial speech direction parameter for the first time subframe, The method according to claim 6, including the method described in claim 6.
The method according to claim 6 or 7, further comprising encoding the quantized spatial speech direction index for the first and second time subframes of the frequency subband of the speech frame, and the quantized average spatial speech direction index for the first frequency subband of the speech frame, using Golomb-Rice coding.
The method according to any one of claims 6 to 8, wherein the spatial sound direction parameter is a spherical coordinate azimuth angle value.
The above averaging, Converting the aforementioned spatial speech direction parameter from a spherical domain to a Cartesian domain parameter, Averaging the aforementioned Cartesian domain parameters, Converting the averaged Cartesian domain parameters to the spherical domain, The method according to any one of claims 6 to 9, which is performed in the Cartesian domain.
at least, The spatial speech direction parameter of a first time subframe of a frequency subband of a speech frame is quantized and indexed, and a quantized spatial speech direction index is given for the first time subframe. The quantized spatial directional difference index for the first time subframe is determined by calculating the difference between the quantized spatial directional index for the first time subframe and the quantized average spatial directional index of the previous frequency subband of the speech frame, Determining the average spatial speech direction parameter for the first time subframe by taking the average of the average spatial speech direction parameter for the corresponding frequency subband of the previous speech frame and the spatial speech direction parameter for the first time subframe , and weighting the average , The average spatial speech direction parameter for the first time subframe is quantized and indexed to obtain a quantized average spatial speech direction index to be used in the second time subframe of the frequency subband, The spatial speech direction parameter of the second time subframe is quantized and indexed, and a quantized spatial speech direction index is given for the second time subframe, The quantized spatial speech direction difference index for the second time subframe is determined by calculating the difference between the quantized spatial speech direction index for the second time subframe and the quantized average spatial speech direction index used in the second time subframe, The average spatial speech direction parameter for the second time subframe is determined by averaging the spatial speech direction parameter of the first time subframe and the spatial speech direction parameter of the second time subframe, The mean spatial speech direction parameter for the second time subframe is quantized and indexed to give a quantized mean spatial speech direction index to be used in the third time subframe of the frequency subband, Setting the quantized average spatial speech direction index for the second time subframe as the quantized average spatial speech direction index for the third time subframe of the frequency subband, A computer program that stores instructions for executing a command.

Description

This application relates to an apparatus and method for sound field-related parameter coding, but not limited to time-frequency-domain direction-related parameter coding for speech encoders. Parametric spatial speech processing is a field of speech signal processing in which the spatial aspects of acoustics are described using a set of parameters. For example, in parametric spatial speech acquisition from a microphone array, a typical and effective selection is to estimate a set of parameters from the microphone array signal, such as the direction of the acoustics within a frequency band and the ratio between the directional and omnidirectional portions of the acquired acoustics within that frequency band. These parameters are known to adequately describe the perceptual spatial characteristics of the acoustics acquired at the microphone array's position. Therefore, these parameters can be used in the synthesis of spatial acoustics for binaural headphones, loudspeakers, or other forms such as ambisonics. Therefore, the directional and directivity-to-total energy ratio within the frequency band is a particularly effective parameter representation for spatial speech capture. A parameter set consisting of frequency band and directional parameters within time subframes, and energy ratio parameters within the frequency band (indicating acoustic directivity), can also be used as spatial metadata for an audio codec (which may also include other parameters such as ambient coherence, spread coherence, directionality, distance, etc.). For example, these parameters can be estimated from the audio signal captured by a microphone array, and a stereo or mono signal to be transmitted using spatial metadata can be generated from the microphone array signal. The stereo signal can be encoded using, for example, an AAC encoder, and the mono signal can be encoded using an EVS encoder. A decoder can decode the audio signal into a PCM signal, process the acoustics within the frequency band (using spatial metadata), and obtain a spatial output, such as a binaural output. The solutions described above are particularly well-suited for encoding spatial acoustics captured from microphone arrays (e.g., in mobile phones, VR cameras, or standalone microphone arrays). However, it may be desirable for such encoders to also accept input formats other than those captured by microphone arrays, such as loudspeaker signals, audio object signals, or ambisonic signals. The analysis of first-order ambisonics (FOA) inputs for spatial metadata extraction is fully documented in scientific literature related to directional audio coding (DirAC) and harmonic plane wave expansion (Harpex). This is because microphone arrays exist that directly provide FOA signals (more precisely, their variants, B-format signals), and therefore, analyzing such inputs has become a focal point of research in this field. Furthermore, the analysis of higher-order ambisonics (HOA) inputs for multidirectional spatial metadata extraction has been documented in scientific literature related to higher-order directional audio coding (HO-DirAC). Further inputs for the encoder are also multi-channel loudspeaker inputs, such as 5.1 or 7.1 channel surround sound inputs and audio object inputs. However, regarding the components of spatial metadata, there is considerable interest in compressing and encoding spatial speech parameters (such as spatial speech direction parameters) in order to minimize the overall number of bits required to represent spatial speech parameters. According to the first embodiment, there is a device for spatial speech coding that includes means for quantizing and indexing spatial speech direction parameters to form a quantized spatial audio difference index, wherein the spatial speech direction parameters are associated with time subframes of the frequency subband of the speech frame; and for determining a quantized spatial audio difference index by calculating the difference between the quantized spatial audio direction index and the quantized average spatial audio direction index. A quantized average spatial audio direction index can be determined by a device having means for averaging at least two spatial audio direction parameters and providing an average spatial audio direction parameter, wherein at least two spatial audio direction parameters are associated with consecutive time subframes of a preceding frequency subband, and the preceding frequency subband is a frequency subband lower than the frequency subband; and for quantizing and indexing the average spatial audio direction. The apparatus may further comprise means for determining an initial average spatial speech direction parameter for a frequency subband, the determination of which is performed by weighting the average spatial speech direction parameter with a first weight, weighting the average spatial speech direction parameter associated with at least two spatial speech direction parameters fro