US-12620388-B2 - Robustness aware norm decay for quantization aware training and generalization
Abstract
A method includes obtaining a plurality of training samples, determining a minimum integer fixed-bit width representing a maximum quantization of an automatic speech recognition (ASR) model, and training the ASR model on the plurality of training samples using a quantity of random noise. The ASR model includes a plurality of weights that each include a respective float value. The quantity of random noise is based on the minimum integer fixed-bit value. After training the ASR model, the method also includes selecting a target integer fixed-bit width greater than or equal to the minimum integer fixed-bit width, and for each respective weight of the plurality of weights, quantizing the respective weight from the respective float value to a respective integer associated with a value of the selected target integer fixed-bit width. The operations also include providing the quantized trained ASR model to a user device.
Inventors
- David Qiu
- David RIM
- Shaojin Ding
- Yanzhang He
Assignees
- GOOGLE LLC
Dates
- Publication Date
- 20260505
- Application Date
- 20240410
Claims (20)
- 1 . A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: obtaining a plurality of training samples, each respective training sample of the plurality of training samples comprising: a respective speech utterance; and a respective textual utterance representing a transcription of the respective speech utterance; determining a minimum integer fixed-bit width representing a maximum quantization of an automatic speech recognition (ASR) model, the ASR model comprising a plurality of weights, each respective weight of the plurality of weights comprising a respective float value; training the ASR model on the plurality of training samples using a quantity of random noise, the quantity of random noise based on the minimum integer fixed-bit width; after training the ASR model, selecting a target integer fixed-bit width greater than or equal to the minimum integer fixed-bit width; for each respective weight of the plurality of weights, quantizing the respective weight from the respective float value to a respective integer associated with a value of the selected target integer fixed-bit width; and providing the quantized trained ASR model to a user device.
- 2 . The method of claim 1 , wherein the maximum quantization level comprises 4-bit quantization.
- 3 . The method of claim 1 , wherein the maximum quantization level comprises 2-bit quantization.
- 4 . The method of claim 1 , wherein the random noise is drawn from a uniform distribution of noise.
- 5 . The method of claim 1 , wherein training the ASR model using the quantity of random noise comprises, for each respective channel of each respective tensor of the ASR model: determining a respective maximum value for the respective channel of the respective tensor; and adding, to the respective channel of the respective tensor, a uniform distribution of noise based on the respective maximum value and the minimum integer fixed-bit width.
- 6 . The method of claim 5 , wherein the uniform distribution of noise represents the entire range of noise the ASR model receives due to quantization up to the minimum integer fixed-bit width.
- 7 . The method of claim 5 , wherein adding the uniform distribution of noise comprises scaling the uniform distribution of noise based on the respective maximum value.
- 8 . The method of claim 7 , wherein scaling the uniform distribution of noise is further based on a sensitivity of the respective channel to scaling.
- 9 . The method of claim 1 , wherein training the ASR model using the quantity of random noise comprises adding, during forward propagation of the ASR model, the quantity of random noise.
- 10 . The method of claim 1 , wherein: the ASR model further comprises a plurality of activations, each activation of the plurality of activations comprising a respective float value; and for each respective activation of the plurality of activations, the operations further comprise quantizing the respective activation from the respective float value to the respective integer with the value of the selected target fixed-bit width.
- 11 . A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: obtaining a plurality of training samples, each respective training sample of the plurality of training samples comprising: a respective speech utterance; and a respective textual utterance representing a transcription of the respective speech utterance; determining a minimum integer fixed-bit width representing a maximum quantization of an automatic speech recognition (ASR) model, the ASR model comprising a plurality of weights, each respective weight of the plurality of weights comprising a respective float value; training the ASR model on the plurality of training samples using a quantity of random noise, the quantity of random noise based on the minimum integer fixed-bit width; after training the ASR model, selecting a target integer fixed-bit width greater than or equal to the minimum integer fixed-bit width; for each respective weight of the plurality of weights, quantizing the respective weight from the respective float value to a respective integer associated with a value of the selected target integer fixed-bit width; and providing the quantized trained ASR model to a user device.
- 12 . The system of claim 11 , wherein the maximum quantization level comprises 4-bit quantization.
- 13 . The system of claim 11 , wherein the maximum quantization level comprises 2-bit quantization.
- 14 . The system of claim 11 , wherein the random noise is drawn from a uniform distribution of noise.
- 15 . The system of claim 11 , wherein training the ASR model using the quantity of random noise comprises, for each respective channel of each respective tensor of the ASR model: determining a respective maximum value for the respective channel of the respective tensor; and adding, to the respective channel of the respective tensor, a uniform distribution of noise based on the respective maximum value and the minimum integer fixed-bit width.
- 16 . The system of claim 15 , wherein the uniform distribution of noise represents the entire range of noise the ASR model receives due to quantization up to the minimum integer fixed-bit width.
- 17 . The system of claim 15 , wherein adding the uniform distribution of noise comprises scaling the uniform distribution of noise based on the respective maximum value.
- 18 . The system of claim 17 , wherein scaling the uniform distribution of noise is further based on a sensitivity of the respective channel to scaling.
- 19 . The system of claim 11 , wherein training the ASR model using the quantity of random noise comprises adding, during forward propagation of the ASR model, the quantity of random noise.
- 20 . The system of claim 11 , wherein: the ASR model further comprises a plurality of activations, each activation of the plurality of activations comprising a respective float value; and for each respective activation of the plurality of activations, the operations further comprise quantizing the respective activation from the respective float value to the respective integer associated with the value of the selected target fixed-bit width.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS This U.S. patent application claims priority under 35 U.S.C. § 119(e) to U.S. Provisional Application 63/495,310, filed on Apr. 11, 2023. The disclosure of this prior application is considered part of the disclosure of this application and is hereby incorporated by reference in its entirety. TECHNICAL FIELD This disclosure relates to robustness aware norm decay for quantization aware training and generalization. BACKGROUND Modern automated speech recognition (ASR) systems focus on providing not only high quality (e.g., a low word error rate (WER)), but also low latency (e.g., a short delay between the user speaking and a transcription appearing). Moreover, when using an ASR system today there is a demand that the ASR system decode utterances in a streaming fashion that corresponds to real-time or even faster than real-time. To illustrate, when an ASR system is deployed on a mobile phone that experiences direct user interactivity, an application on the mobile phone using the ASR system may require the speech recognition to be streaming such that words appear on the screen as soon as they are spoken. Here, it is also likely that the user of the mobile phone has a low tolerance for latency. Due to this low tolerance, the speech recognition strives to run on the mobile device in a manner that minimizes an impact from latency and inaccuracy that may detrimentally affect the user's experience. However, mobile phones often have limited resources, which limit the size of the ASR model. SUMMARY One aspect of the disclosure provides a computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations that include obtaining a plurality of training samples, determining a minimum integer fixed-bit width representing a maximum quantization of an automatic speech recognition (ASR) model, and training the ASR model on the plurality of training samples using a quantity of random noise. Each respective training sample of the plurality of training samples includes a respective speech utterance and a respective textual utterance representing a transcription of the respective speech utterance. The ASR model includes a plurality of weights, wherein each respective weight of the plurality of weights includes a respective float value. The quantity of random noise used for training the ASR model is based on the minimum integer fixed-bit value. After training the ASR model, the operations also include selecting a target integer fixed-bit width greater than or equal to the minimum integer fixed-bit width, and for each respective weight of the plurality of weights, quantizing the respective weight from the respective float value to a respective integer with the selected target integer fixed-bit width. The operations also include providing the quantized trained ASR model to a user device. Implementations of the disclosure may include one or more of the following optional features. In some implementations, training the ASR model using the quantity of random noise includes, for each respective channel of each respective tensor of the ASR model, determining a respective maximum value for the respective channel of the respective tensor, and adding, to the respective channel of the respective tensor, a uniform distribution of noise based on the respective maximum value and the minimum integer fixed-bit width. In these implementations, the uniform distribution of noise may represent the entire range of noise the ASR model receives due to quantization up to the minimum integer fixed-bit width. Additionally, adding the uniform distribution of noise may include scaling the uniform distribution of noise based on the respective maximum value, while scaling the uniform distribution of noise may be further based on a sensitivity of the respective channel to scaling. In some examples, the maximum quantization level includes a 4-bit quantization. In other examples, the maximum quantization level includes a 2-bit quantization. The random noise may be drawn from a uniform distribution of noise. In some additional implementations, training the ASR model using the quantity of random noise includes adding, during forward propagation of the ASR model, the quantity of random noise. The ASR model may further include a plurality of activations each associated with a respective float value such that for each respective action of the plurality of activations, the operations further include quantizing the respective activation from the respective float value to the respective integer with the selected target fixed-bit width. Another aspect of the disclosure provides a system that includes data processing hardware and memory hardware storing instructions that when executed on the data processing hardware causes the data processing hardware to perform operations that include obtaining a plurality of training samples, determining a minimum integer fixed-bit wid