US-12626131-B2 - Computer-implemented methods and systems for compressing recurrent neural network (RNN) models and accelerating RNN execution in mobile devices to achieve real-time inference

US12626131B2US 12626131 B2US12626131 B2US 12626131B2US-12626131-B2

Abstract

A recurrent neural network (RNN) acceleration framework leverages both a block-based pruning approach and compiler optimizations to accelerate RNN inference on mobile devices.

Inventors

Yanzhi Wang
Peiyan Dong
Zhengang Li
Bin Ren
Wei Niu

Assignees

NORTHEASTERN UNIVERSITY
COLLEGE OF WILLIAM & MARY

Dates

Publication Date: 20260512
Application Date: 20210125

Claims (14)

1 . A computer-implemented method for compressing a recurrent neural network (RNN) model and accelerating RNN execution in a mobile device to achieve real-time inference, the method comprising the steps of: performing a block-based structured pruning of weights in a weight matrix of the RNN model through row-based column block pruning and column-based row pruning to generate a pruned RNN model; and applying a compiler-assisted RNN model acceleration framework to the compressed RNN model to generate code to be executed on the mobile device to accelerate RNN inference, wherein said applying comprises reordering rows of the weight matrix of the pruned RNN model such that a plurality of groups are formed, wherein each group of the plurality of groups comprises a plurality of rows, wherein the plurality of rows of each group are adjacent and have the same pattern, and the code comprises instructions for loading, for each group, a portion of an input feature map for use by the plurality of rows of that group, thereby performing load redundancy elimination optimization.
2 . The method of claim 1 , wherein the compiler-assisted RNN acceleration framework performs compiler-assisted performance optimizations on the compressed RNN model including a compact data format optimization.
3 . The method of claim 2 , wherein the compact data format optimization provides a compact data structure to store RNN weight matrices.
4 . The method of claim 3 , wherein the compact data structure has a Block-based Structured Pruning Compact format, wherein the compact data structure comprises a register load array, a stride array, a column index array, a filter offset array, and a reorder array.
5 . The method of claim 1 , wherein said performing the block-based structured pruning comprises using an Alternating Direction Method of Multipliers (ADMM) pruning technique.
6 . The method of claim 1 , wherein the RNN model comprises a Gated Recurrent Unit (GRU) model.
7 . The method of claim 1 , wherein the RNN model is used in an application for real-time speech recognition, natural language processing (NLP), human-machine interaction, or image recognition and characterization.
8 . A computer system, comprising: at least one processor; memory associated with the at least one processor; and a program supported in the memory for compressing a recurrent neural network (RNN) model and accelerating RNN execution in a mobile device to achieve real-time inference, the program containing a plurality of instructions which, when executed by the at least one processor, cause the at least one processor to: perform a block-based structured pruning of weights in a weight matrix of the RNN model through row-based column block pruning and column-based row pruning to generate a compressed RNN model; and apply a compiler-assisted RNN acceleration framework to the compressed RNN model to generate code to be executed on the mobile device to accelerate RNN inference, wherein: said applying comprises reordering rows of the weight matrix of the pruned RNN model such that a plurality of groups are formed, wherein each group of the plurality of groups comprises a plurality of rows, wherein the plurality of rows of each group are adjacent and have the same pattern, and the code comprises instructions for loading, for each group, a portion of an input feature map for use by the plurality of rows of that group, thereby performing load redundancy elimination optimization.
9 . The system of claim 8 , wherein the compiler-assisted RNN acceleration framework performs compiler-assisted performance optimizations on the compressed RNN model including a compact data format optimization.
10 . The system of claim 9 , wherein the compact data format optimization provides a compact data structure to store RNN weight matrices.
11 . The system of claim 10 , wherein the compact data structure has a Block-based Structured Pruning Compact format, wherein the compact data structure comprises a register load array, a stride array, a column index array, a filter offset array, and a reorder array.
12 . The system of claim 8 , wherein said performing the block-based structured pruning comprises using an Alternating Direction Method of Multipliers (ADMM) pruning technique.
13 . The system of claim 8 , wherein the RNN model comprises a Gated Recurrent Unit (GRU) model.
14 . The system of claim 8 , wherein the RNN model is used in an application for real-time speech recognition, natural language processing (NLP), human-machine interaction, or image recognition and characterization.

Description

CROSS REFERENCE TO RELATED APPLICATION APPLICATIONS This application is a national phase entry under 35 USC § 371 of International Application No. PCT/US21/14866 filed Jan. 25, 2021 entitled COMPUTER-IMPLEMENTED METHODS AND SYSTEMS FOR COMPRESSING RECURRENT NEURAL NETWORK (RNN) MODELS AND ACCELERATING RNN EXECUTION IN MOBILE DEVICES TO ACHIEVE REAL-TIME INFERENCE, which claims priority from U.S. Provisional Patent Application No. 62/965,275 filed on 24 Jan. 2020 entitled RTMOBILE: A MOBILE ACCELERATION FRAMEWORK OF RNNS FOR BEYOND REAL-TIME SPEECH RECOGNITION. The entire contents of these applications are incorporated herein by reference in their entirety. GOVERNMENT SUPPORT This invention was made with government support under Grant No. 1739748 awarded by the National Science Foundation. The government has certain rights in the invention. BACKGROUND The present application relates to a recurrent neural network (RNN) acceleration framework that leverages both a block-based pruning approach and compiler optimizations to accelerate RNN inference on mobile devices. BRIEF SUMMARY OF THE DISCLOSURE In accordance with one or more embodiments, a computer-implemented method is disclosed for compressing a recurrent neural network (RNN) model and accelerating RNN execution in a mobile device to achieve real-time inference. The method includes the steps of: (a) performing a block-based structured pruning of weights in a weight matrix of the RNN model through row-based column block pruning and column-based row pruning to generate a compressed RNN model; and (b) applying a compiler-assisted RNN acceleration framework to the compressed RNN model to generate code to be executed on the mobile device to accelerate RNN inference. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates an exemplary single GRU model. FIG. 2A shows a weight tensor representation of a CONY layer is transformed into the weight matrix representation. FIG. 2B shows how different structured weight pruning schemes are implemented on the weight matrix representation. FIGS. 3A, 3B, and 3C illustrate a systematic overview of an exemplary RTMobile acceleration framework in accordance with one or more embodiments. FIG. 4 is a graph illustrating speedup using RTMobile with different rates on mobile devices. FIG. 5 shows an exemplary algorithm for block-based structured pruning in accordance with one or more embodiments. FIGS. 6 and 7 show Tables I and II, respectively. FIG. 8 is a block diagram illustrating an exemplary computer system in which the methods described herein in accordance with one or more embodiments can be implemented. DETAILED DESCRIPTION Deep neural networks (DNNs) have become the state-of-the-art technique due to their high prediction accuracy in many artificial intelligence tasks, such as image recognition and characterization [1], speech recognition [2], and recommendation system [3]. Among various DNN architectures, recurrent neural networks (RNNs) are widely used for speech recognition tasks because they can contain cycles to carry information across neurons when reading inputs. For instance, Gated Recurrent Units (GRUs) [4], recent and popular type of RNNs, achieve great success in automatic speech recognition. In recent years, executing DNNs on mobile platforms has become increasingly popular because many high-end mobile devices are emerging. Several recent studies have proposed techniques to accelerate large-scale DNNs in the mobile environment. However, due to fairly high computation complexity and memory consumption when executing RNNs, it is very challenging to deploy RNNs on current embedded processors in mobile devices to achieve real-time inference. (Real-time inference usually means 30 frames per second.) DNN model compression provides an effective way to mitigate the computation and memory challenges brought by DNNs. Many model compression techniques have been studied in recent years. For example, weight pruning can provide a notable reduction ratio in the size of models. Early work [5] on non-structured weight pruning eliminates weights at arbitrary locations, which leads to the pruned model to be stored in a sparse matrix format, such as the compressed sparse column (CSC) format. Non-structured weight pruning, however, hurts processing throughput because the indices in the compressed weight representation result in stalls or GPUs and FPGAs. On the other hand, structured weight pruning [6] is more hardware friendly. By exploiting filter pruning [7] and channel pruning [8], the pruned model is more regular in terms of the shape, which can eliminate storing the weight indices. However, structured pruning reduces accuracy more than non-structured pruning. Moreover, state-of-the-art model-compression-based RNN acceleration techniques such as ESE [9] and C-LSTM still suffer from limited inference accuracy and processing throughput, which keeps them from being implemented on mobile devices. Furthermore, existing DNN acceleration f