Search

US-20260127420-A1 - METHOD AND APPARATUS WITH NEURAL NETWORK MODEL QUANTIZATION

US20260127420A1US 20260127420 A1US20260127420 A1US 20260127420A1US-20260127420-A1

Abstract

A processor-implemented method including obtaining first data in an integer (INT) form by performing a first quantization on input activation data, applying a block-wise orthogonal matrix to the first data to obtain second data, and performing a second quantization on the second data to obtain third data, the block-wise orthogonal matrix including a plurality of orthogonal matrices arranged diagonally.

Inventors

  • Minjeong CHOI
  • Jaehoon Yu
  • Won-Jo Lee

Assignees

  • SAMSUNG ELECTRONICS CO., LTD.

Dates

Publication Date
20260507
Application Date
20250501
Priority Date
20241107

Claims (20)

  1. 1 . A processor-implemented method, the method comprising: obtaining first data in an integer (INT) form by performing a first quantization on input activation data; applying a block-wise orthogonal matrix to the first data to obtain second data; and performing a second quantization on the second data to obtain third data, wherein the block-wise orthogonal matrix comprises a plurality of orthogonal matrices arranged diagonally.
  2. 2 . The method of claim 1 , further comprising: generating the block-wise orthogonal matrix based on information related to one of the input activation data or the first data.
  3. 3 . The method of claim 2 , wherein the generating of the block-wise orthogonal matrix comprises: deriving a dimension in which an outlier occurs in the input activation data or the first data; and generating the block-wise orthogonal matrix in response to the derived dimension.
  4. 4 . The method of claim 3 , wherein the deriving of the dimension in which the outlier occurs is performed offline in advance based on data previously constructed in relation to a neural network model.
  5. 5 . The method of claim 2 , wherein the generating of the block-wise orthogonal matrix comprises: recursively extending or reducing a base orthogonal matrix to generate at least one orthogonal matrix candidate based on the information related to the one of the input activation data or the first data; and repeatedly arranging the at least one orthogonal matrix candidate diagonally.
  6. 6 . The method of claim 5 , wherein the generating of the block-wise orthogonal matrix comprises: loading a block-wise orthogonal matrix that is previously generated, from a lookup table, based on the information related to the one of the input activation data or the first data.
  7. 7 . The method of claim 1 , wherein the applying the block-wise orthogonal matrix comprises: applying the block-wise orthogonal matrix and a scaling factor of the block-wise orthogonal matrix to the first data to obtain the second data.
  8. 8 . The method of claim 7 , wherein the applying the block-wise orthogonal matrix comprises: in response to the scaling factor of the block-wise orthogonal matrix not being an INT, obtaining transferred a scaling factor, the transferred scaling factor being received by the block-wise orthogonal matrix in response to a portion of the scaling factor of a transpose matrix of the block-wise orthogonal matrix; and calculating the transferred scaling factor by using a shifter to obtain the second data.
  9. 9 . The method of claim 1 , wherein each of the plurality of orthogonal matrices comprises at least one element of {−1, 0, 1}.
  10. 10 . The method of claim 1 , further comprising: obtaining a parameter matrix in which a transpose matrix of the block-wise orthogonal matrix and a weight of a neural network model are pre-calculated; and applying the parameter matrix to the third data to output fourth data.
  11. 11 . The method of claim 1 , wherein respective sizes of the plurality of orthogonal matrices are a same size.
  12. 12 . The method of claim 1 , wherein two or more respective sizes of the plurality of orthogonal matrices are different sizes.
  13. 13 . A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the method of claim 1 .
  14. 14 . An electronic apparatus, comprising: processors configured to execute instructions; and a memory storing the instructions, wherein execution of the instructions configures the processors to: perform a first quantization on input activation data to obtain first data in a form of an integer (INT); apply a block-wise orthogonal matrix to the first data to obtain second data; and perform a second quantization on the second data to obtain third data, wherein the block-wise orthogonal matrix comprises a plurality of orthogonal matrices arranged diagonally.
  15. 15 . The apparatus of claim 14 , wherein the processors are further configured to: generate the block-wise orthogonal matrix based on information related to one of the input activation data or the first data, and wherein each of the plurality of orthogonal matrices comprises at least one element of {−1, 0, 1}.
  16. 16 . The apparatus of claim 14 , wherein the applying the block-wise orthogonal matrix comprises: applying the block-wise orthogonal matrix and a scaling factor of the block-wise orthogonal matrix to the first data to obtain the second data.
  17. 17 . The apparatus of claim 16 , wherein the applying the block-wise orthogonal matrix comprises: in response to the scaling factor of the block-wise orthogonal matrix not being an INT, obtaining a transferred scaling factor, the transferred scaling factor being received by the block-wise orthogonal matrix in response to a portion of the scaling factor of a transpose matrix of the block-wise orthogonal matrix; and calculating the transferred scaling factor by using a shifter to obtain the second data.
  18. 18 . The apparatus of claim 14 , wherein the processors are further configured to: obtain a parameter matrix in which a transpose matrix of the block-wise orthogonal matrix and a weight of a neural network model are pre-calculated; and apply the parameter matrix to the third data to obtain fourth data.
  19. 19 . An electronic device, comprising: shifter circuitry comprising at least one shifter logic; and integer (INT) operation circuitry comprising at least one INT operation logic, wherein the shifter circuitry is configured to: generate first data in a form of an INT by performing a first quantization on input activation data; and generate second data by applying a block-wise orthogonal matrix to the first data, wherein the shifter circuitry is configured to: obtain third data by performing a second quantization on the second data, and wherein the block-wise orthogonal matrix includes a plurality of orthogonal matrices arranged diagonally.
  20. 20 . The device of claim 19 , further comprising: matrix generation circuitry including generation logic of at least one block-wise orthogonal matrix, wherein the generation logic is configured to: generate the block-wise orthogonal matrix based on information related to the input activation data or the first data, and wherein the shifter circuitry is configured to: obtain the second data by performing computation by applying the block-wise orthogonal matrix and a scaling factor of the block-wise orthogonal matrix to the first data.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit under 35 USC § 119 (a) of Korean Patent Application No. 10-2024-0157208, filed on Nov. 7, 2024, in the Korean Intellectual Property Office, the entire disclosure of which is incorporated herein by reference for all purposes. BACKGROUND 1. Field The following description relates to a method and apparatus with neural network model quantization. 2. Description of Related Art A large language model (LLM), which has recently been developed as one of the deep learning models, is a model for generating answers corresponding to text-type queries. Recently, LLM's have become generally large models including billions or even more than 10 billion parameters. However, a capacity of dynamic random-access memory (DRAM) hardware to operate that kind of an LLM is relatively limited. To overcome such hardware constraints, quantization technology to reduce the size of the LLM and to convert the LLM into a model suitable for actual services is typically used. SUMMARY This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. In a general aspect, here is provided a processor-implemented method including obtaining first data in an integer (INT) form by performing a first quantization on input activation data, applying a block-wise orthogonal matrix to the first data to obtain second data, and performing a second quantization on the second data to obtain third data, the block-wise orthogonal matrix including a plurality of orthogonal matrices arranged diagonally. The method may include generating the block-wise orthogonal matrix based on information related to one of the input activation data or the first data. The generating of the block-wise orthogonal matrix may include deriving a dimension in which an outlier occurs in the input activation data or the first data and generating the block-wise orthogonal matrix in response to the derived dimension. The deriving of the dimension in which the outlier occurs may be performed offline in advance based on data previously constructed in relation to a neural network model. The generating of the block-wise orthogonal matrix may include recursively extending or reducing a base orthogonal matrix to generate at least one orthogonal matrix candidate based on the information related to the one of the input activation data or the first data and repeatedly arranging the at least one orthogonal matrix candidate diagonally. The generating of the block-wise orthogonal matrix may include loading a block-wise orthogonal matrix that is previously generated, from a lookup table, based on the information related to the one of the input activation data or the first data. The applying the block-wise orthogonal matrix may include applying the block-wise orthogonal matrix and a scaling factor of the block-wise orthogonal matrix to the first data to obtain the second data. The applying the block-wise orthogonal matrix may include, in response to the scaling factor of the block-wise orthogonal matrix not being an INT, obtaining transferred a scaling factor, the transferred scaling factor being received by the block-wise orthogonal matrix in response to a portion of the scaling factor of a transpose matrix of the block-wise orthogonal matrix and calculating the transferred scaling factor by using a shifter to obtain the second data. Each of the plurality of orthogonal matrices may include at least one element of {−1, 0, 1}. The method may include obtaining a parameter matrix in which a transpose matrix of the block-wise orthogonal matrix and a weight of a neural network model are pre-calculated and applying the parameter matrix to the third data to output fourth data. Respective sizes of the plurality of orthogonal matrices may be a same size. Two or more respective sizes of the plurality of orthogonal matrices may be different sizes. In a general aspect, here is provided a non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the method. In a general aspect, here is provided an electronic apparatus included processors configured to execute instructions and a memory storing the instructions, and an execution of the instructions configures the processors to perform a first quantization on input activation data to obtain first data in a form of an integer (INT), apply a block-wise orthogonal matrix to the first data to obtain second data, and perform a second quantization on the second data to obtain third data, the block-wise orthogonal matrix including a plurality of orthogonal matrices arranged diagonally. The processors may be further configured to generate the block