JP-7855728-B2 - Loss-conditional training and use of neural networks for audio processing using neural networks

JP7855728B2JP 7855728 B2JP7855728 B2JP 7855728B2JP-7855728-B2

Inventors

ビスワス，アリジート

Assignees

ドルビー・インターナショナル・アーベー

Dates

Publication Date: 20260508
Application Date: 20230607
Priority Date: 20220608

Claims (20)

A loss-conditional training method for a neural network to output an improved audio signal, which is implemented on a computer, and the method is: The step of randomly sampling a coefficient vector from the distribution of coefficients, wherein the elements of the coefficient vector represent weight coefficients corresponding to the loss term of the loss function; A step of adjusting the neural network based on the coefficient vector; A method comprising the step of training a tuned neural network based on an audio training signal, wherein the training includes calculating a loss function for the audio training signal after processing by the tuned neural network using the weight coefficients indicated by the coefficient vector.
The method according to claim 1, wherein the loss function is a multi-objective loss function.
The method according to claim 1, wherein the distribution of the coefficients is a uniform distribution within a predetermined range.
The method according to claim 1, wherein the adjustment of the neural network includes feature-wise linear modulation (FiLM).
The method according to claim 1, wherein randomly sampling the coefficient vector, tuning the neural network, and training the tuned neural network constitute at least part of an epoch, and the method further comprises performing two or more epochs for each set of audio content types.
The method according to claim 1, wherein the training of the adjusted neural network is performed in perceptually weighted regions.
The method according to claim 1, wherein the neural network implements a deep learning-based generator, the generator comprising an encoder stage and a decoder stage, each comprising a plurality of layers having one or more filters in each layer, and the last layer of the encoder stage is mapped to a latent feature space.
The method according to claim 7, wherein tuning the neural network includes tuning one or more layers of the encoder stage of the generator adjacent to the latent feature space.
The method according to claim 7, wherein the generator is trained in a generative adversarial network (GAN) setting including the generator and the discriminator.
Training the aforementioned tuned neural network involves: Inputting an audio training signal to the aforementioned adjusted generator; The adjusted generator generates a processed audio training signal based on the audio training signal; The processed audio training signal and the corresponding original audio signal from which the audio training signal was derived are input to the discriminator one at a time; The discriminator determines whether the audio signal input to it is the processed audio training signal or the original audio signal; This includes sequentially and iteratively tuning the parameters of the generator until the discriminator can no longer distinguish the processed audio training signal from the original audio signal, The method according to claim 9.
The method according to claim 10, wherein a random noise vector z is applied to the latent feature space to modify the audio.
A computer-implemented method for processing an audio signal using a loss-conditionally trained neural network, the method being: A step of adjusting the neural network based on adjustment information including a coefficient vector, wherein the elements of the coefficient vector represent weight coefficients corresponding to the loss term of the loss function; To process the aforementioned audio signal, the audio signal is input to a modified neural network; The steps include: processing the audio signal based on the adjustment information using the adjusted neural network; The step includes obtaining an improved audio signal as the output from the adjusted neural network, method.
The method according to claim 12, wherein the loss function is a multi-objective loss function.
The method according to claim 12, wherein the adjustment information is based on the content type and/or bitrate of the audio signal.
The method according to claim 12, wherein tuning the neural network includes feature-by-feature linear modulation (FiLM).
The method according to claim 12, wherein the neural network implements a deep learning-based generator, the generator comprising an encoder stage and a decoder stage, each comprising a plurality of layers having one or more filters in each layer, and the last layer of the encoder stage is mapped to a latent feature space.
The method according to claim 16, wherein tuning the neural network includes tuning one or more layers of the encoder stage of the generator adjacent to the latent feature space.
The method according to claim 16, wherein a random noise vector z is applied to the latent feature space to modify the audio.
The method according to claim 12, further comprising receiving an audio bitstream containing the audio signal and the adjustment information.
The method according to claim 19, further comprising core-decoding the audio bitstream to obtain the audio signal.

Description

Cross-reference to Related Applications This application claims priority to U.S. Provisional Application No. 63/350,099 filed on June 8, 2022, and European Patent Application No. 22177849.1 filed on June 8, 2022, all of which are incorporated herein by reference. Technical Disclosures generally relate to methods for loss-conditional training of neural networks. In particular, coefficient vectors are randomly sampled from a coefficient distribution, and the neural network is tuned based on said coefficient vectors. Disclosures further relate to computer-implemented methods for processing audio signals using loss-conditionally trained neural networks. Disclosures also relate to the respective devices and the respective computer program products. While several embodiments are described herein with particular reference to their disclosures, it will be understood that this disclosure is not limited to such domains of use and is applicable in a broader context. Nothing discussed throughout this disclosure regarding background technologies should be construed as acknowledging that such technologies are widely known or constitute part of common general knowledge in the art. Audio quality as perceived by humans is a core performance metric in many audio devices. An audio codec is a computer program designed to encode and decode digital audio streams. More precisely, it compresses digital audio data into a compressed format and decompresses it from that format with the help of codec algorithms. Audio codecs are intended to reduce storage space and bandwidth while maintaining high fidelity of the transmitted signal. However, lossy compression methods introduce encoding artifacts that can impair audio quality. Deep learning approaches are becoming increasingly attractive in various application areas, including audio enhancement. Most deep learning approaches to date have focused on speech noise reduction. Intuitively, regarding noise reduction in general, one might assume that encoding artifact reduction and noise reduction are closely related. However, removing encoding artifacts/noise that are highly correlated with the desired sound often appears more complex than removing other types of noise (in noise reduction applications) that are less correlated. The characteristics of encoding artifacts depend on the codec, the encoding tool used, and the selected bitrate. Furthermore, modeling audio signals containing tone-like content such as speech and music is even more complex due to the periodic functions naturally present in these types of signals. However, deep convolutional models used to reduce coding artifacts and coding noise are extremely complex in terms of model parameters and/or memory usage, and therefore introduce a high computational load. Furthermore, when it is necessary to cover different signal categories such as speech, music, a mixture of speech and music, and applause, as well as various bitrates and codecs, typically separate models are trained, each providing the best possible performance for its respective task. Exemplary embodiments of the present disclosure are described below, merely as examples, with reference to the accompanying drawings. This document presents an example of a loss-conditional training method for neural networks. This example demonstrates loss-conditional training in a generative adversarial network (GAN) setting that includes generators and discriminators. A schematic example of a simple generator architecture is shown below. This example shows a computer implementation of processing audio signals using a loss-conditionally trained neural network. Further examples of computer implementations that use loss-conditionally trained neural networks to process audio signals are presented. An example of a device containing one or more processors is shown. In deep learning-based methods for improving (encoded) audio, the performance of a neural network (model) generally depends on several characteristics rather than a single one. The approach to training a neural network involves balancing these characteristics by minimizing a loss function, which is a weighted sum of terms measuring various characteristics. Depending on the coefficients of these weights, training with this loss function yields a model best suited to a particular content type, bitrate, or codec. However, when it is desired to cover different signal categories, such as speech, music, a mixture of speech and music, applause, and different bitrates and codecs, typically several separate neural networks with different weighting coefficients in the loss function are trained to achieve the best possible performance for each category. This requires maintaining multiple neural networks both during training and inference, resulting in high computational costs during both phases. The methods and apparatus described herein propose a loss-conditional training and inference strategy that allows for the training and inference of a