US-12626134-B2 - Method and apparatus for lightweighting of artificial intelligence model

US12626134B2US 12626134 B2US12626134 B2US 12626134B2US-12626134-B2

Abstract

The disclosure relates to a method and an apparatus for lightweighting of artificial intelligence models, and the method of lightweighting of artificial intelligence models includes identifying an outlier in an input vector of a layer, identifying at least one column corresponding to the outlier in a weight matrix, and quantizing weight values of columns which do not correspond to the outlier.

Inventors

Eunhyeok PARK
Taesu Kim
Changhun Lee
Hyungjun Kim
Jungyu Jin

Assignees

SqueezeBits Inc.

Dates

Publication Date: 20260512
Application Date: 20241125
Priority Date: 20231228

Claims (9)

1 . A method of lightweighting of artificial intelligence models, the method comprising: identifying, during execution of a GPU kernel, an outlier in an input vector of a layer; identifying, during execution of a GPU kernel, at least one column corresponding to the outlier in a weight matrix; quantizing, during execution of a GPU kernel, weight values of columns which do not correspond to the outlier so as to reduce device-side memory traffic and preserve kernel throughput; and uploading data of the quantized weight values of columns to a memory, wherein the outlier is one or more values identified from a result of a matrix multiplication of the input vector and the weight matrix, wherein the identifying the outlier comprises generating a Hessian value based on a product of the input vector and a transpose of the input vector, calculating sensitivity for each channel using a diagonal component of the Hessian value and a difference between weight values before and after quantization, and selecting the outlier based on the sensitivity, and wherein the method is executed by a processor configured to perform N-parallel operations, those related to the outlier are preferentially allocated to threads having lower indexes, where N is a natural number indicating the number of threads.
2 . The method of claim 1 , wherein the identifying further comprises identifying at least one element in the input vector having a value greater than or equal to a threshold value relative to an average of values included in the input vector.
3 . The method of claim 1 , wherein the identifying the outlier in the input vector of the layer comprises: generating the Hessian value by using the input vector; determining the sensitivity for each channel, based on the Hessian value; and identifying a weak column corresponding to the outlier based on the sensitivity, and excluding the weak column from quantization to preserve model accuracy.
4 . The method of claim 3 , wherein the weak column is maintained at full bit precision to preserve model accuracy.
5 . A method of processing input data by using lightweight artificial intelligence models, the method comprising: acquiring, during execution of a GPU kernel, an input data vector; dividing, during execution of a GPU kernel, the input data vector into a first partial vector corresponding to quantized columns of a weight matrix and a second partial vector corresponding to at least one weak column of the weight matrix; performing calculations, during execution of a GPU kernel, for the first partial vector by using the quantized columns; performing calculations, during execution of a GPU kernel, for the second partial vector by using the at least one weak column; adding, during execution of a GPU kernel, results of the calculations for the first partial vector and the second partial vector; and uploading data of the results of the calculations to a memory so as to reduce device-side memory traffic and preserve kernel throughput, wherein the weight matrix comprises weights of at least one column corresponding to an outlier in the input data vector and weights of at least one remaining column, wherein the outlier is one or more values identified from a result of a matrix multiplication of the input data vector and the weight matrix, and is determined based on a descending order of magnitudes of the values, wherein bit precision used for the weights of the at least one column and bit precision used for the weights of the at least one remaining column are different from each other, wherein the input data vector is processed by a processor configured to perform N-parallel operations, wherein the N-parallel operations are indexed from 0 to N-1, and among the N-parallel operations, those related to the outlier are preferentially allocated to threads having lower indexes, where N is a natural number indicating the number of threads.
6 . The method of claim 5 , wherein the performing the calculations for the first partial vector comprises: dequantizing weights stored in the quantized columns; and performing a matrix multiplication between the dequantized weights and the first partial vector.
7 . The method of claim 5 , wherein the weight matrix is stored as: a first partial matrix including the quantized columns, a second partial matrix including the weak columns, a first index table identifying columns included in the first partial matrix, and a second index table identifying columns included in the second partial matrix, wherein the first index table and the second index table enable efficient memory access.
8 . An apparatus for lightweighting of artificial intelligence models, the apparatus comprising: a storage unit; and a processor, wherein the processor is configured to: identify, during execution of a GPU kernel, an outlier in an input vector of a layer; identify, during execution of a GPU kernel, at least one column corresponding to the outlier in a weight matrix; quantize, during execution of a GPU kernel, weight values of columns which do not correspond to the outlier configured to minimize overhead of the actual calculation speed of the artificial intelligence models operated in a GPU kernel; and upload data of the quantized weight values of columns to the processor, wherein the processor is configured to generate a Hessian value based on a product of the input vector and a transpose of the input vector, calculate sensitivity for each channel using a diagonal component of the Hessian value and a difference between weight values before and after quantization, and select the outlier based on the sensitivity, and wherein the processor performs N-parallel operations, those related to the outlier are preferentially allocated to threads having lower indexes, where N is a natural number indicating the number of threads.
9 . An apparatus for processing input data, the apparatus comprising: a storage unit; and a processor, wherein the processor is configured to: acquire, during execution of a GPU kernel, an input data vector; divide, during execution of a GPU kernel, the input data vector into a first partial vector corresponding to quantized columns of a weight matrix and a second partial vector corresponding to at least one weak column of the weight matrix; perform calculations, during execution of a GPU kernel, for the first partial vector by using the quantized columns; perform calculations, during execution of a GPU kernel, for the second partial vector by using the at least one weak column; add, during execution of a GPU kernel, results of the calculations for the first partial vector and the second partial vector; and upload data of the results of the calculations to a memory configured to minimize overhead of the actual calculation speed of the artificial intelligence models operated in the GPU kernel, wherein an outlier is one or more values identified from a result of a matrix multiplication of the input data vector and the weight matrix, and is determined based on descending order of value magnitudes, wherein the processor is configured to generate a Hessian value using a product of the input data vector and a transpose of the input data vector, calculate sensitivity for each channel using a diagonal component of the Hessian value and a difference between pre-and post-quantization weight values, and determine at least one weak column based on the sensitivity, wherein the processor is further configured to operate N-parallel operations, those related to the outlier are preferentially allocated to threads having lower indexes, where N is a natural number indicating the number of threads, and wherein bit precision used for the weights of the at least one column and bit precision used for the weights of at least one remaining column are different from each other.

Description

CROSS-REFERENCE TO RELATED APPLICATION The present application claims priority to Korean Patent Application No. 10-2023-0194111, filed on Dec. 28, 2023, the entire contents of which is incorporated herein for all purposes by this reference. BACKGROUND Technical Field The disclosure relates to a method and an apparatus for lightweighting of artificial intelligence models and, more particularly, to a method and an apparatus for quantizing weights constituting artificial intelligence models. Description of Related Art A large language model (LLM) is a type of artificial intelligence models that process natural language data and corresponds to an artificial intelligence model that generates responses similar to those generated by humans. The LLM is constructed using a deep learning technology and may be trained using a huge amount of text data. The LLM has recently attracted a lot of attention in a natural language processing field, and a representative LLM is a chat generative pre-trained transformer (GPT)-3. The LLM is trained using a huge amount of learning data sets and thus understands the structure and meaning of language. The LLM has a capability of detecting a pattern of text and a language rule and generating or understanding new text, based thereon. By using such a characteristic, the LLM may be used for various purposes of machine translation, automatic text summarization, question and answer, dialog system, content generation, and the like. SUMMARY The disclosure intends to provide a method and an apparatus for lightweighting of artificial intelligence models. Further, the disclosure intends to provide a method and an apparatus for selectively quantizing weights of artificial intelligence models. According to an embodiment of the disclosure, the disclosure is invention for a method of lightweighting of artificial intelligence models, and includes identifying an outlier in an input vector of a layer, identifying at least one column corresponding to the outlier in a weight matrix, and quantizing weight values of columns which do not correspond to the outlier. According to an embodiment of the disclosure, the disclosure is invention for a method of identifying an outlier in an input vector of a layer, and includes identifying at least one element having a value larger than or equal to a threshold compared to an average of values of elements included in the input vector. According to an embodiment of the disclosure, the disclosure is invention for a method of identifying an outlier in an input vector of a layer, and includes generating a Hessian value by using the input vector, determining sensitivity for each channel, based on the Hessian value, and determining a weak column corresponding to the outlier, based on the sensitivity for each channel. According to an embodiment of the disclosure, the disclosure is invention for a Hessian value and sensitivity for each channel, the Hessian value is determined based on product of an input vector and transpose of the input vector, and the sensitivity for each channel is determined based on a diagonal component of the Hessian value and a difference between weight values before and after quantization of a corresponding channel. According to an embodiment of the disclosure, the disclosure is invention for a method of processing input data by using lightweight artificial intelligence models, and includes acquiring an input data vector, dividing the input data vector into a first partial vector corresponding quantized columns of a weight matrix and a second partial vector corresponding to at least one weak column of the weight matrix, performing calculations for the first partial vector by using the quantized columns, performing calculations for the second partial vector by using the at least one weak column, and adding up results of the calculations for the first partial vector and the second partial vector, the weight matrix comprises weights of at least one column corresponding to an outlier in the input data vector and weights of at least one remaining column, and bit precision between the weights of the at least one column corresponding to the outlier in the input data and bit precision between the weights of the at least one remaining column are different from each other. According to an embodiment of the disclosure, the disclosure is invention for an input data vector, the input data vector is processed by a processor capable of performing N-parallel calculations, N calculations performed in parallel have indexes 0 to N, and calculations of indexes smaller than a size of an outlier vector corresponding to the outlier in the input data vector are allocated to process the outlier vector. According to an embodiment of the disclosure, the disclosure is invention for a method of performing calculations for the first partial vector, and includes dequantizing weights included in the quantized columns and performing a matrix multiplication between the dequantized we