US-12620408-B2 - Generating audio using neural networks

US12620408B2US 12620408 B2US12620408 B2US 12620408B2US-12620408-B2

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating an output sequence of audio data that comprises a respective audio sample at each of a plurality of time steps. One of the methods includes, for each of the time steps: providing a current sequence of audio data as input to a convolutional subnetwork, wherein the current sequence comprises the respective audio sample at each time step that precedes the time step in the output sequence, and wherein the convolutional subnetwork is configured to process the current sequence of audio data to generate an alternative representation for the time step; and providing the alternative representation for the time step as input to an output layer, wherein the output layer is configured to: process the alternative representation to generate an output that defines a score distribution over a plurality of possible audio samples for the time step.

Inventors

Aaron Gerard Antonius van den Oord
Sander Etienne Lea Dieleman
Nal Emmerich Kalchbrenner
Karen Simonyan
Oriol Vinyals

Assignees

GDM HOLDING LLC

Dates

Publication Date: 20260505
Application Date: 20231127

Claims (20)

1 . A method performed by one or more computers, the method comprising: receiving data characterizing a sequence of text; processing a network input comprising the data characterizing the sequence of text using a neural network to generate a neural network output that defines audio data that is a verbalization of the sequence of text characterized by the network input, wherein: the neural network comprises a sequence of one or more convolutional neural network layers; and each convolutional neural network layer in the sequence of one or more convolutional neural network layers is configured to: receive a layer input; apply one or more convolution operations to the layer input; and generate a layer output based at least in part on a result of applying the one or more convolution operations to the layer input; wherein the neural network output directly comprises amplitude values of an audio waveform that is the verbalization of the sequence of text characterized by the network input, wherein the amplitude values of the audio waveform are directly generated as an output of an output neural network layer of the neural network, and wherein the neural network has been trained by applying a supervised learning technique that depends on (i) ground truth output audio waveforms for each of a set of training examples for the neural network and (ii) corresponding output audio waveforms generated by the neural network.
2 . The method of claim 1 , wherein the data characterizing the sequence of text comprises a sequence of phonemes corresponding to the sequence of text.
3 . The method of claim 1 , wherein for each of one or more convolutional neural network layers in the sequence of one or more convolutional neural network layers, applying one or more convolution operations to the layer input comprises: applying one or more dilated convolution operations to the layer input.
4 . The method of claim 3 , wherein the sequence of one or more convolutional neural network layers comprises a plurality of convolutional neural network layers that each implement respective dilated convolution operations associated with a respective different dilation rate.
5 . The method of claim 4 , wherein the sequence of convolutional neural network layers comprises a plurality of convolutional neural network layers that each have a dilation rate that is a constant multiple of a dilation rate of a preceding convolutional neural network layer.
6 . The method of claim 1 , wherein for each of one or more convolutional neural network layers in the sequence of one or more convolutional neural network layers, the convolutional neural network layer comprises a gated activation unit that is configured to: process the layer input by a main convolutional operation to generate a main convolutional output; process the layer input by a gate convolutional operation to generate a gating convolutional output; and generate a gated activation unit output by element-wise multiplying the main convolutional output and the gating convolutional output.
7 . The method of claim 1 , wherein the sequence of one or more convolutional neural network layers comprises one or more residual connections, wherein each residual connection is configured to route the layer input to the convolutional neural network layer to a summer that sums the layer input with an intermediate output generated by the convolutional neural network layer, wherein the layer output is based at least in part on the sum of the layer input with the intermediate output.
8 . The method of claim 1 , wherein the sequence of one or more convolutional neural network layers comprises a plurality of convolutional neural network layers.
9 . The method of claim 1 , wherein the neural network is conditioned on speaker identity data; and wherein the audio data that is the verbalization of the sequence of text is expressed in a voice associated with the speaker identity data.
10 . The method of claim 1 , further comprising: evaluating an objective function that measures an error in the audio waveform; and backpropagating gradients of the objective function through the neural network.
11 . The method of claim 1 , wherein: the network input comprises data characterizing a conditioning input; and wherein processing the network input comprising the data characterizing the sequence of text using the neural network to generate the neural network output that defines audio data that is the verbalization of the sequence of text characterized by the network input comprises: processing the network input comprising the data characterizing the sequence of text using the neural network and the data characterizing the conditioning input to generate the neural network output that defines audio data that is the verbalization of the sequence of text characterized by the network input, as conditioned on the conditioning input.
12 . The method of claim 11 , wherein the conditioning input comprises image data.
13 . The method of claim 11 , wherein the conditioning input comprises video data.
14 . The method of claim 11 , wherein the conditioning input comprises data characterizing a particular speaker for the verbalization of the sequence of text.
15 . The method of claim 11 , wherein the conditioning input comprises data characterizing a particular language for the verbalization of the sequence of text.
16 . The method of claim 11 , wherein the conditioning input comprises data characterizing particular music for the verbalization of the sequence of text.
17 . A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising: receiving data characterizing a sequence of text; processing a network input comprising the data characterizing the sequence of text using a neural network to generate a neural network output that defines audio data that is a verbalization of the sequence of text characterized by the network input, wherein: the neural network comprises a sequence of one or more convolutional neural network layers; and each convolutional neural network layer in the sequence of one or more convolutional neural network layers is configured to: receive a layer input; apply one or more convolution operations to the layer input; and generate a layer output based at least in part on a result of applying the one or more convolution operations to the layer input; wherein the neural network output directly comprises amplitude values of an audio waveform that is the verbalization of the sequence of text characterized by the network input, and wherein the amplitude values of the audio waveform are directly generated as an output of an output neural network layer of the neural network, and wherein the neural network has been trained by applying a supervised learning technique that depends on (i) ground truth output audio waveforms for each of a set of training examples for the neural network and (ii) corresponding output audio waveforms generated by the neural network.
18 . The system of claim 17 , wherein the data characterizing the sequence of text comprises a sequence of phonemes corresponding to the sequence of text.
19 . One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: receiving data characterizing a sequence of text; processing a network input comprising the data characterizing the sequence of text using a neural network to generate a neural network output that defines audio data that is a verbalization of the sequence of text characterized by the network input, wherein: the neural network comprises a sequence of one or more convolutional neural network layers; and each convolutional neural network layer in the sequence of one or more convolutional neural network layers is configured to: receive a layer input; apply one or more convolution operations to the layer input; and generate a layer output based at least in part on a result of applying the one or more convolution operations to the layer input; wherein the neural network output directly comprises amplitude values of an audio waveform that is the verbalization of the sequence of text characterized by the network input, wherein the amplitude values of the audio waveform are directly generated as an output of an output neural network layer of the neural network, and wherein the neural network has been trained by applying a supervised learning technique that depends on (i) ground truth output audio waveforms for each of a set of training examples for the neural network and (ii) corresponding output audio waveforms generated by the neural network.
20 . The non-transitory computer storage media of claim 19 , wherein the data characterizing the sequence of text comprises a sequence of phonemes corresponding to the sequence of text.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. application Ser. No. 17/838,985, filed on Jun. 13, 2022, which is a continuation of U.S. application Ser. No. 17/020,348, filed on Sep. 14, 2020 (now U.S. Pat. No. 11,386,914), which is a continuation of U.S. application Ser. No. 16/390,549, filed on Apr. 22, 2019 (now U.S. Pat. No. 10,803,884), which is a continuation of U.S. application Ser. No. 16/030,742, filed on Jul. 9, 2018 (now U.S. Pat. No. 10,304,477), which is a continuation of PCT Application No. PCT/US2017/050320, filed on Sep. 6, 2017, which claims priority to U.S. Provisional Application No. 62/384,115, filed on Sep. 6, 2016. The disclosures of the prior applications are considered part of and are incorporated by reference in the disclosure of this application. BACKGROUND This specification relates to processing and generating audio using neural networks. Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters. SUMMARY This specification describes how a system implemented as computer programs on one or more computers in one or more locations can generate a sequence of audio data that includes a respective audio sample at each of multiple time steps. For example, the sequence of audio data can represent speech in a particular natural language or a piece of music. In one innovative aspect a neural network system implemented by one or more computers is configured to generate an output sequence of audio data that comprises a respective audio sample at each of a plurality of time steps. The neural network system may comprise a convolutional subnetwork comprising one or more audio-processing convolutional neural network layers; and an output layer. The convolutional subnetwork may be configured to, for each of the plurality of time steps: receive a current sequence of audio data that comprises the respective audio sample at each time step that precedes the (current) time step in the output sequence. The convolutional subnetwork may further be configured to process the current sequence of audio data to generate an alternative representation for the time (current) step. This alternative representation may thus comprise a numeric representation, i.e., an ordered collection of numeric values, in which the current sequence of audio data has been encoded by the convolutional subnetwork, for example encoding features of the current sequence. The output layer may be configured to, for each of the plurality of time steps: receive the alternative representation for the time step, and process the alternative representation for the time step to generate an output that defines a score distribution over a plurality of possible audio samples for the time step. Some of the many advantages of such a system are described later. The system can use the score distribution to select a sample for the current time step, by sampling from the distribution. The output may, but need not necessarily, comprise one score for each possible audio sample value, for example 256 scores for 256 possible values. In can thus be useful to compress or compand the audio sample values, which may be amplitude values, to reduce the number of model outputs. In some implementations the convolutional neural network layers are causal convolutional neural network layers, as described in more detail later. In particular, the audio-processing convolutional neural network layers may include one or more dilated causal convolutional neural network layers. Again as described in more detail later, a dilated convolutional neural network layer applies a convolution to non-adjacent values in a sequence, i.e., as defined by the outputs from a previous layer. This can increase the receptive field of the convolutional subnetwork by orders of magnitude whilst preserving the input (time) resolution and maintaining computational efficiency. In some implementations the convolutional neural network layers include multiple stacked blocks of dilated convolutional neural network layers. Each block may comprise multiple dilated convolutional neural network layers with increasing dilation. For example the dilation may be increased by a factor n for each successive layer up to a limit within each block. This can further increase the receptive field size. In some implementations one or more of the convolutional neural network layers may have gated activation units. For example a rectified linear or other unit following a convolution implemented by a layer may be replaced by a gated activation unit. In a gated act