EP-4497128-B1 - 4-BIT CONFORMER WITH ACCURATE QUANTIZATION TRAINING FOR SPEECH RECOGNITION

EP4497128B1EP 4497128 B1EP4497128 B1EP 4497128B1EP-4497128-B1

Inventors

DING, Shaojin
RYBAKOV, OLEG
MEADOWLARK, Phoenix
Agrawal, Shivani
HE, YANZHANG
LEW, LUKASZ

Dates

Publication Date: 20260506
Application Date: 20230320

Claims (8)

A computer-implemented method (500) when executed on data processing hardware (62) causes the data processing hardware (62) to perform operations comprising: obtaining a plurality of training samples (152), each respective training sample (152) of the plurality of training samples (152) comprising: a respective speech utterance (154); and a respective textual utterance (156) representing a transcription of the respective speech utterance (154); training, using quantization aware training with native integer operations, an automatic speech recognition (ASR) model (200) on the plurality of training samples (152); quantizing the trained ASR model (200) to an integer target fixed-bit width (162), the quantized trained ASR model (200) comprising a plurality of weights (202), each weight (202) of the plurality of weights (202) comprising an integer with the target fixed- bit width (162); and providing the quantized trained ASR model (200) to a user device (10), wherein the target fixed-bit width (162) is four.
The method (500) of claim 1, wherein the ASR model (200) further comprises a plurality of activations (204), each activation (204) of the plurality of activations (204) comprising: an integer with the target fixed-bit width (162); an integer with a fixed bit width greater than the target fixed-bit width (162); or a float value.
The method (500) of any of claims 1-2, wherein quantizing the trained ASR model (200) comprises determining a scale factor (160) based on an estimated max value of an axis to be quantized and the target fixed-bit width (162).
The method (500) of any of claims 1-3, wherein the ASR model (200) comprises one or more multi-head self attention layers (302).
The method (500) of claim 4, wherein the one or more multi-head attention layers (302) comprise one or more conformer layers.
The method (500) of any of claims 1-5, wherein: the ASR model (200) comprises a plurality of encoders and a plurality of decoders; and quantizing the ASR model (200) comprises quantizing the plurality of encoders and not quantizing the plurality of decoders.
The method (500) of any of claims 1-6, wherein: the ASR model (200) comprises an audio encoder (210); and the audio encoder (210) comprises a cascaded encoder comprising a first causal encoder and a second non-causal encoder.
A system (100) comprising: data processing hardware (62)configured to perform operations according to the method of any preceding claim.

Description

TECHNICAL FIELD This disclosure relates to accurate quantization training for speech recognition. BACKGROUND Modern automated speech recognition (ASR) systems focus on providing not only high quality (e.g., a low word error rate (WER)), but also low latency (e.g., a short delay between the user speaking and a transcription appearing). Moreover, when using an ASR system today there is a demand that the ASR system decode utterances in a streaming fashion that corresponds to real-time or even faster than real-time. To illustrate, when an ASR system is deployed on a mobile phone that experiences direct user interactivity, an application on the mobile phone using the ASR system may require the speech recognition to be streaming such that words appear on the screen as soon as they are spoken. Here, it is also likely that the user of the mobile phone has a low tolerance for latency. Due to this low tolerance, the speech recognition strives to run on the mobile device in a manner that minimizes an impact from latency and inaccuracy that may detrimentally affect the user's experience. However, mobile phones often have limited resources, which limit the size of the ASR model. A common approach to generating smaller ASR models for mobile devices, is to fine-tune large ASR models and/or to apply quantization of the model weights. An example thereof is provided in the conference paper "4-bit Quantization of LSTM-based Speech Recognition Models" by A. Fasoli et al., 27.08.2021. SUMMARY One aspect of the disclosure provides a method for training an automatic speech recognition (ASR) model. The computer-implemented method, when executed on data processing hardware, causes the data processing hardware to perform operations. The operations include obtaining a plurality of training samples. Each respective training sample of the plurality of training samples includes a respective speech utterance and a respective textual utterance representing a transcription of the respective speech utterance. The method includes training, using quantization aware training with native integer operations, an ASR model on the plurality of training samples. The method also includes quantizing the trained ASR model to an integer target fixed-bit width. The quantized trained ASR model includes a plurality of weights. Each weight of the plurality of weights includes an integer with the target fixed-bit width. The method includes providing the quantized trained ASR model to a user device. Implementations of the disclosure may include one or more of the following optional features. In some implementations, the target fixed-bit width is four. In some examples, the ASR model further includes a plurality of activations and each activation of the plurality of activations may include an integer with the target fixed-bit width. In other examples, the ASR model further includes a plurality of activations and each activation of the plurality of activations includes an integer with a fixed bit width greater than the target fixed-bit width. In yet other examples, the ASR model further includes a plurality of activations and each activation of the plurality of activations includes a float value. Optionally, quantizing the trained ASR model includes determining a scale factor based on an estimated max value of an axis to be quantized and the target fixed-bit width. In some implementations, the ASR model includes one or more multi-head attention layers. In some of these implementations, the one or more multi-head attention layers include one or more conformer layers or one or more transformer layers. The ASR model may include a plurality of encoders and a plurality of decoders and quantizing the ASR model may include quantizing the plurality of encoders and not quantizing the plurality of decoders. In some examples, the ASR model includes an audio encoder and the audio encoder includes a cascaded encoder includes a first causal encoder and a second non-causal encoder. Another aspect of the disclosure provides a system for training an ASR model. The system includes data processing hardware and memory hardware in communication with the data processing hardware. The memory hardware stores instructions that when executed on the data processing hardware cause the data processing hardware to perform operations. The operations include obtaining a plurality of training samples. Each respective training sample of the plurality of training samples includes a respective speech utterance and a respective textual utterance representing a transcription of the respective speech utterance. The method includes training, using quantization aware training with native integer operations, an ASR model on the plurality of training samples. The method also includes quantizing the trained ASR model to an integer target fixed-bit width. The quantized trained ASR model includes a plurality of weights. Each weight of the plurality of weights includes an integer with the target fixed-bit w