US-12626709-B2 - Audio super resolution

US12626709B2US 12626709 B2US12626709 B2US 12626709B2US-12626709-B2

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media relate to a method for audio super resolution. The system receives an audio signal. When the sampling rate of the audio signal is below a sampling rate threshold or the frequency range of the audio signal is below a frequency range threshold, the audio signal is input to an audio super resolution model comprising a machine learning model. The audio signal is processed by the audio super resolution model to generate a synthetic audio signal with a wider frequency range than the frequency range of the audio signal.

Inventors

Yuhui Chen
Zhaofeng Jia
Qiyong Liu
Zhengwei Wei

Assignees

Zoom Video Communications, Inc.

Dates

Publication Date: 20260512
Application Date: 20211031

Claims (20)

1 . A method comprising: receiving an audio signal from a first client device during a virtual conference, a plurality of client devices connected to the virtual conference; determining an energy of a portion of the audio signal; after determining that a ratio of the energy of the portion of the audio signal to a total energy of the audio signal exceeds a threshold, performing audio super-resolution, performing audio super-resolution comprising: inputting the audio signal to an audio super resolution model comprising a neural network, one or more encoder blocks, and one or more decoder blocks, the one or more encoder blocks configured to down-sample the audio signal and provide the down-sampled audio signal to the neural network, and the one or more decoder blocks configured to up-sample an output of the neural network; generating, by the audio super resolution model, a synthetic audio signal based on the audio signal, wherein the synthetic audio signal comprises a wider frequency range than the frequency range of the audio signal; and transmitting the synthetic audio signal to each other client device of the plurality of client devices instead of the audio signal.
2 . The method of claim 1 , wherein the synthetic audio signal includes a low frequency portion and a high frequency portion, the audio signal includes a low frequency portion, and the low frequency portion of the synthetic audio signal is the same as the low frequency portion of the audio signal.
3 . The method of claim 1 , wherein the synthetic audio signal includes a low frequency portion, a high frequency portion, and a frequency gap comprising a frequency range between the low frequency portion and the high frequency portion without audio content.
4 . The method of claim 1 , further comprising: determining, by the audio super resolution model, that first content in the audio signal comprises noise and that second content in the audio signal comprises non-noise; and generating, by the audio super resolution model, a corresponding high frequency audio signal portion for the second content and not the first content.
5 . The method of claim 1 , wherein the portion of the audio signal comprises a low frequency portion of the audio signal below a frequency range threshold, and further comprising: determining that a frequency range is below a frequency range threshold based on the ratio.
6 . The method of claim 1 , wherein the audio super resolution model comprises a convolutional neural network (CNN) including at least one encoder layer and at least one decoder layer.
7 . The method of claim 1 , wherein the audio super resolution model is trained using a generative adversarial network (GAN), the GAN including a discriminator network that evaluates a generated audio signal of the audio super resolution model to determine whether the generated audio signal comprises real-world data or generated data.
8 . A non-transitory computer readable medium comprising processor-executable program instructions configured to cause one or more processors to: receive an audio signal from a first client device during a virtual conference, a plurality of client devices connected to the virtual conference; determine an energy of a portion of the audio signal; after determining that a ratio of the energy of the portion of the audio signal to a total energy of the audio signal exceeds a threshold, perform audio super-resolution, performing audio super-resolution comprising: input the audio signal to an audio super resolution model comprising a neural network, one or more encoder blocks, and one or more decoder blocks, the one or more encoder blocks configured to down-sample the audio signal and provide the down-sampled audio signal to the neural network, and the one or more decoder blocks configured to up-sample an output of the neural network; generate, by the audio super resolution model, a synthetic audio signal based on the audio signal, wherein the synthetic audio signal comprises a wider frequency range than the frequency range of the audio signal; and transmit the synthetic audio signal to each other client device of the plurality of client devices instead of the audio signal.
9 . The non-transitory computer readable medium of claim 8 , wherein the synthetic audio signal includes a low frequency portion and a high frequency portion, the audio signal includes a low frequency portion, and the low frequency portion of the synthetic audio signal is the same as the low frequency portion of the audio signal.
10 . The non-transitory computer readable medium of claim 8 , wherein the synthetic audio signal includes a low frequency portion, a high frequency portion, and a frequency gap comprising a frequency range between the low frequency portion and the high frequency portion without audio content.
11 . The non-transitory computer readable medium of claim 8 , further comprising processor-executable program instructions configured to cause the one or more processors to: determine, by the audio super resolution model, that first content in the audio signal comprises noise and that second content in the audio signal comprises non-noise; and generate, by the audio super resolution model, a corresponding high frequency audio signal portion for the second content and not the first content.
12 . The non-transitory computer readable medium of claim 8 , wherein the portion of the audio signal comprises a low frequency portion of the audio signal below a frequency range threshold, and further comprising processor-executable program instructions configured to cause the one or more processors to: determine that a frequency range is below a frequency range threshold based on the ratio.
13 . The non-transitory computer readable medium of claim 8 , wherein the audio super resolution model comprises a CNN including at least one encoder layer and at least one decoder layer.
14 . The non-transitory computer readable medium of claim 8 , wherein the audio super resolution model is trained using a GAN, the GAN including a discriminator network that evaluates a generated audio signal of the audio super resolution model to determine whether the generated audio signal comprises real-world data or generated data.
15 . A system comprising: a non-transitory computer-readable medium; and one or more processors communicatively coupled to the non-transitory computer-readable medium, the one or more processors configured to execute processor-executable instructions stored in the non-transitory computer-readable medium to: receive an audio signal from a first client device during a virtual conference, a plurality of client devices connected to the virtual conference; determine an energy of a portion of the audio signal; after determining that a ratio of the energy of the portion of the audio signal to a total energy of the audio signal exceeds a threshold, perform audio super-resolution, performing audio super-resolution comprising: input the audio signal to an audio super resolution model comprising a neural network, one or more encoder blocks, and one or more decoder blocks, the one or more encoder blocks configured to down-sample the audio signal and provide the down-sampled audio signal to the neural network, and the one or more decoder blocks configured to up-sample an output of the neural network; generate, by the audio super resolution model, a synthetic audio signal based on the audio signal, wherein the synthetic audio signal comprises a wider frequency range than the frequency range of the audio signal; and transmit the synthetic audio signal to each other client device of the plurality of client devices instead of the audio signal.
16 . The system of claim 15 , wherein the synthetic audio signal includes a low frequency portion and a high frequency portion, the audio signal includes a low frequency portion, and the low frequency portion of the synthetic audio signal is the same as the low frequency portion of the audio signal.
17 . The system of claim 15 , wherein the synthetic audio signal includes a low frequency portion, a high frequency portion, and a frequency gap comprising a frequency range between the low frequency portion and the high frequency portion without audio content.
18 . The system of claim 15 , wherein the one or more processors are configured to execute further processor-executable instructions configured to cause the one or more processors to: determine, by the audio super resolution model, that first content in the audio signal comprises noise and that second content in the audio signal comprises non-noise; and generate, by the audio super resolution model, a corresponding high frequency audio signal portion for the second content and not the first content.
19 . The system of claim 15 , wherein the portion of the audio signal comprises a low frequency portion of the audio signal below a frequency range threshold and wherein the one or more processors are configured to execute further processor-executable instructions configured to cause the one or more processors to: determine that a frequency range is below a frequency range threshold based on the ratio.
20 . The system of claim 15 , wherein the audio super resolution model comprises a CNN including at least one encoder layer and at least one decoder layer.

Description

FIELD This application relates generally to audio processing, and more particularly, to systems and methods for improving audio quality through frequency bandwidth extension. SUMMARY The appended claims may serve as a summary of this application. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A is a diagram illustrating an exemplary environment in which some embodiments may operate. FIG. 1B is a diagram illustrating an exemplary computer system with software and/or hardware modules that may execute some of the functionality described herein. FIG. 1C is a diagram illustrating an exemplary audio super resolution training platform. FIG. 2 is a diagram illustrating an exemplary environment including computer systems with audio super resolution functionality. FIG. 3 is a diagram illustrating an exemplary method for selector to determine whether to use audio super resolution model. FIG. 4 is an image illustrating exemplary audio signals of the same speech with a low sampling rate and a high sampling rate. FIG. 5 is an image illustrating an exemplary audio signal with a low frequency range and a high sampling rate. FIG. 6 is a diagram illustrating an exemplary audio super resolution model according to one embodiment of the present disclosure. FIG. 7 is a diagram illustrating a more detailed view of encoder and decoder blocks of an exemplary audio super resolution model according to one embodiment of the present disclosure. FIG. 8 is an image illustrating an exemplary input audio signal and generated synthetic audio signal of the audio super resolution module. FIG. 9 is a diagram illustrating an exemplary GAN according to one embodiment of the present disclosure. FIG. 10 is a diagram illustrating an exemplary discriminator according to one embodiment of the present disclosure. FIG. 11 is an image illustrating exemplary audio signals used for training the audio super resolution model for noisy speech. FIG. 12 is an image illustrating exemplary audio signals used for training the audio super resolution model. FIG. 13 illustrates an exemplary method that may be performed in some embodiments. FIG. 14 illustrates an exemplary method that may be performed in some embodiments. FIG. 15 illustrates an exemplary method that may be performed in some embodiments. FIG. 16 illustrates an exemplary method that may be performed in some embodiments. FIG. 17 is a diagram illustrating an exemplary computer that may perform processing in some embodiments. DETAILED DESCRIPTION OF THE DRAWINGS In this specification, reference is made in detail to specific embodiments of the invention. Some of the embodiments or their aspects are illustrated in the drawings. For clarity in explanation, the invention has been described with reference to specific embodiments, however it should be understood that the invention is not limited to the described embodiments. On the contrary, the invention covers alternatives, modifications, and equivalents as may be included within its scope as defined by any patent claims. The following embodiments of the invention are set forth without any loss of generality to, and without imposing limitations on, the claimed invention. In the following description, specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In addition, well known features may not have been described in detail to avoid unnecessarily obscuring the invention. In addition, it should be understood that steps of the exemplary methods set forth in this exemplary patent can be performed in different orders than the order presented in this specification. Furthermore, some steps of the exemplary methods may be performed in parallel rather than being performed sequentially. Also, the steps of the exemplary methods may be performed in a network environment in which some steps are performed by different computers in the networked environment. Some embodiments are implemented by a computer system. A computer system may include a processor, a memory, and a non-transitory computer-readable medium. The memory and non-transitory medium may store instructions for performing methods and steps described herein. In general, one innovative aspect of the subject described in this specification can be embodied in systems, computer readable media, and methods that include operations for audio super resolution. One system may receive an audio signal, such as during a video conference or other application. The system may evaluate the sampling rate or frequency range of the audio signal to determine whether to apply an audio super resolution model, such as due to the audio signal lacking content in a high frequency range. Based on this determination, the audio signal may be input to the audio super resolution model for processing. The audio super resolution model may comprise a machine learning model, such as a neural network and optionally one or m