JP-7855716-B2 - High-frequency reconstruction using neural network systems

JP7855716B2JP 7855716 B2JP7855716 B2JP 7855716B2JP-7855716-B2

Inventors

イークストラント，ペール
キョーリンク，クリストッフェル
クルイェサ，ヤヌシュ

Assignees

ドルビー・インターナショナル・アーベー

Dates

Publication Date: 20260508
Application Date: 20230414
Priority Date: 20220414

Claims (15)

A method for reconstructing an audio signal; The steps include receiving a bitstream containing an encoded low-bandwidth audio signal representation and a set of high-frequency reconstruction (HFR) parameters; The steps include: decoding the encoded low-bandwidth audio signal representation to provide a low-bandwidth audio signal in the filter bank region; The step of reconstructing a high-bandwidth audio signal in a filter bank region using a neural network system trained to predict a sample of a high-bandwidth audio signal in the filter bank region, given a sample of a low-bandwidth signal in the filter bank region and the HFR parameters, wherein the neural network system is configured to autoregressively generate a current sample (x m ) for the current time slot (m) of the high-bandwidth signal in the filter bank region, each current sample containing a plurality of values corresponding to a channel in the filter bank, and the system: A stage comprising: a processing layer trained to generate conditioning information about the current sample based on quantized samples of the lowband signal in the filter bank region and the HFR parameters; and an output layer subdivided into a plurality of sequentially executed sublayers, each sublayer being trained to generate a subset of values for the current sample, given the conditioning information from the processing layer and samples generated by previously executed sublayers, if any; The process includes the step of synthesizing a time-domain output audio signal from the low-bandwidth signal of the filter bank region and the high-bandwidth signal of the reconstructed filter bank region. method.
The method according to claim 1, wherein the neural network system is trained to predict high-bandwidth samples in a filter bank region having reduced signal dynamics, and the method further comprises increasing the dynamics of the high-bandwidth signal in the reconstructed filter bank region.
The method according to claim 2, further comprising using envelope data in the HFR parameters to envelope-tune the high-bandwidth signals of the reconstructed filter bank region.
The method according to claim 1, wherein the low-bandwidth signals in the filter bank region are compressed in the encoding process, and the method further comprises decompressing the low-bandwidth signals in the filter bank region before synthesis.
The process further includes the step of reconstructing an improved low-band audio signal in the filter bank region using a neural network system trained to predict the low-band signal sample in the filter bank region, given a decoded sample of the low-band signal in the filter bank region. The above synthesis is based on the low-bandwidth signal of the reconstructed filter bank region and the high-bandwidth signal of the reconstructed filter bank region. The method according to claim 1.
The process further includes the step of reconstructing the low-bandwidth audio signal representation using a neural network system trained to predict low-bandwidth filter bank region samples, given quantized filter bank region coefficients. The method according to claim 1.
The method according to claim 6, wherein the neural network system used to reconstruct the low-bandwidth audio signal representation operates in a first filter bank region, and the neural network system used to reconstruct the high-bandwidth audio signal in the filter bank region operates in a second filter bank region.
The method according to claim 7, wherein the first filter bank region is an MDCT region, and the second filter bank region is a QMF region.
A demultiplexer that separates the bitstream into an encoded low-bandwidth audio signal representation and a set of high-frequency reconstruction (HFR) parameters; A decoder that decodes the encoded low-bandwidth audio signal representation to provide a low-bandwidth audio signal in the filter bank region; A generative model for reconstructing a high-bandwidth signal in a filter bank region, using a neural network system trained to predict a sample of a high-bandwidth audio signal in the filter bank region, given a sample of a low-bandwidth signal in the filter bank region and the HFR parameters, wherein the neural network system is configured to autoregressively generate a current sample (x m ) for the current time slot (m) of the high-bandwidth signal in the filter bank region, each current sample containing a plurality of values corresponding to a channel in the filter bank, and the system: A generative model comprising: a processing layer trained to generate conditioning information about the current sample based on quantized samples of the low-bandwidth signal in the filter bank region and the HFR parameters; and an output layer subdivided into a plurality of sequentially executed sublayers, each sublayer being trained to generate a subset of the values of the current sample, given the conditioning information from the processing layer and samples generated by previously executed sublayers, if any; It has a combining filter bank for synthesizing a time-domain audio signal from low-bandwidth signals in the filter bank region and high-bandwidth signals in the reconstructed filter bank region. Decoder system.
The decoder system according to claim 9, wherein the neural network system is also trained to predict samples of low-bandwidth signals in the filter bank region, given decoded samples of low-bandwidth signals in the filter bank region.
The aforementioned neural network system is: Given a decoded sample of the low-bandwidth signal in the filter bank region, a first submodel trained to predict the sample of the low-bandwidth signal in the filter bank region; The decoder system according to claim 10, comprising two submodels: a second submodel trained to predict a sample of a high-bandwidth signal in the filter bank region, given the predicted low-bandwidth signal in the filter bank region and the HFR parameters.
The decoder system according to claim 11, wherein the first submodel operates in a first filter bank region, and the second submodel operates in a second filter bank region.
The decoder system according to claim 12, wherein the first filter bank region is an MDCT region, and the second filter bank region is a QMF region.
A neural network system that autoregressively generates a current sample (x m ) for the current time slot (m) of a filter bank representation of an audio signal, wherein each current sample includes a plurality of values corresponding to a channel of the filter bank, and the system: It comprises a first and second submodel, each submodel being: A processing layer trained to generate conditioning information about the current sample, It includes an output layer subdivided into multiple sequentially executed sublayers, each sublayer being trained to generate a subset of the current sample values, given the conditioning information from the processing layer and samples generated by previously executed sublayers, if any. The first submodel is given previously generated samples of the filter bank representation and is trained to generate the value of the current sample corresponding to the low-band frequency range, conditioned by the quantized samples of the filter bank representation. The second submodel is given previously generated samples of the filter bank representation and is trained to generate the value of the current sample corresponding to a high-bandwidth frequency range, conditioned by the quantized samples of the filter bank representation and by a set of high-frequency reconstruction parameters. Neural network system.
A computer program for causing a computer processor to perform the method described in any one of claims 1 to 8 .

Description

Cross-reference to Related Applications This application claims priority to U.S. Provisional Application No. 63/331,056 (reference number: D21075USP1), filed on 14 April 2022, and European Application No. EP22168469.9 (reference number: D21075EP), filed on 14 April 2022, each of which is incorporated herein by reference in whole. Technical Field of the Invention The present invention relates to high-frequency reconstruction using a generative deep neural network operating in the filter bank domain. For very low-bitrate audio encoding systems, existing encoders for audio transmission cannot encode the entire bandwidth signal and are therefore forced to encode only the lower frequency range. For example, for 32kbps stereo encoded with MP3 (ISO/MPEG-II Layer 3), the codec bandwidth can be as low as 4-6kHz. While this may be sufficient for some use cases, generally, it is desirable to transmit higher frequencies in the audio output as well. One technique, called "blind bandwidth expansion," generates higher frequency bands based solely on information in lower frequency bands. While such processing can successfully provide bandwidth expansion for certain isolated signal classes where signal statistics are well determined, such as speech and piano music, it fails for more complex signal types (e.g., general audio involving mixed music and speech or other signal classes). A more sophisticated technique called "high-frequency reconstruction" (HFR) uses side information that describes characteristics of higher frequency (HFR) bands, such as spectral envelope, tone-to-noise ratio, or other high-band characteristics, to reconstruct the HFR band. Such side-information-guided high-frequency reconstruction is known to work well for most signal classes. Examples include A-SPX in AC-4 (developed by Dolby Laboratories) and SBR in HE-AAC (ISO/MPEG standard). In such an HFR system, the encoded bitstream includes a waveform-encoded low-frequency band and HFR side information that parameterizes the HFR band. Only a portion of the available bitrate is allocated to the HFR side information. The frequency at which the HFR range begins is called the "crossover" frequency. On the decoder side, the low-frequency band is decoded by the decoder, and the HFR side information is used by the HFR module to correctly reconstruct the HFR band. In the HFR module, a unit called a "transposer" first generates an initial high-bandwidth approximation. This approximation is then modified in various ways to resemble the original high-bandwidth in a process guided by side information in the bitstream. One method of transposition is the "copy-on" method (used, for example, in AC-4 and HE-AAC), where frequency chunks (a continuous set of subband samples) from the decoded low-bandwidth are copied to the HFR frequency range. This is a robust method with very low computational complexity, but it often suffers from single-sideband (SSB) distortion at low crossover frequencies. Typically, this is the case when encoding for low bitrates, because the available bits only allow a limited frequency range to be encoded by the waveform core encoder. Another transposition method is the harmonic converter, used, for example, in the MPEG USAC standard, where a phase vocoder is used to generate second, third, and even fourth harmonics from the low-bandwidth. This type of transition avoids SSB distortion, but the resulting high-frequency response is sometimes perceived as metallic and synthetic. The present invention will be described in more detail with reference to the accompanying drawings illustrating currently preferred embodiments of the invention. This is a process diagram of a decoder system according to a first embodiment of the present invention. This is a schematic topology of the generative model shown in Figure 1, based on a particular implementation. The training of the model shown in Figure 2 is illustrated. This is a process diagram of a decoder system according to a second embodiment of the present invention. This is a schematic topology of the generative model shown in Figure 4, based on a particular implementation. This is a process diagram of a decoder system according to a third embodiment of the present invention. The systems and methods disclosed in this application may be implemented as software, firmware, hardware, or a combination thereof. In hardware implementations, task division does not necessarily correspond to division into physical units; conversely, a single physical component may have multiple functions, and a single task may be performed by several collaborating physical components. Computer hardware may include, for example, a server computer, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a mobile phone, a smartphone, a web appliance, a network router, a switch or bridge, or any machine capable of executing instructions (sequentially or otherwise) that specify act