DE-112023006737-T5 - TENSOR MULTIPLICATION IN NEURAL NETWORK BASED ON DEQUANTIZATION WITH MIXED DATA LAYOUT
Abstract
Tensor multiplication operations in a deep neural network (DNN) can be performed using mixed-data layout based on weight dequantization. A tensor multiplication operation can involve a weight tensor containing weights in a floating-point data point. The weight tensor can be quantized to create an integer weight block with a smaller storage size than the weight tensor. The integer weight tensor can be partitioned into weight blocks. A weight block can further be partitioned into weight groups based on a predefined group size. A weight group can be packed by right-shifting bits. The data layout in the packed weight group can be modified by shuffling the data elements within the weight group. The packed and mixed weight group can be dequantized to create a floating-point weight block. The tensor multiplication operation can then be performed on the floating-point weight block.
Inventors
- Jiong Gong
- Weiwen Xia
- Xiaolan Yuan
Assignees
- INTEL CORPORATION
Dates
- Publication Date
- 20260513
- Application Date
- 20231122
- Priority Date
- 20230728
Claims (20)
- A method for performing a tensor multiplication operation in a neural network, comprising: Store a first integer weight block generated by quantizing a weight tensor of the tensor multiplication operation, wherein the weight tensor comprises weights in a first floating-point data format, and the first integer weight block contains data elements in a first integer data format; Create a second integer weight block based on the first integer weight block, wherein the second integer weight block contains data elements in a second integer data format; Create a third integer weight block from the second integer weight block, wherein the third integer weight block contains data elements in a third integer data format; Converting the third integer weighting block into a floating-point weighting block, where the floating-point weighting block contains data elements in a second floating-point data format; and performing the tensor multiplication operation by multiplying the floating-point weighting block by a floating-point activation block.
- Procedure according to Claim 1 , wherein the second integer data format differs from the first integer data format, the third integer data format differs from the first integer data format and the second integer data format, wherein a data element in the second integer weighting block comprises bits in multiple data elements in the first integer weighting block, and the third integer weighting block is formed by inserting zeros into the second integer weighting block.
- Procedure according to Claim 1 , wherein the first integer weighting block has multiple weighting groups, one weighting group is stored as a first sequence of bits, and forming the second integer weighting block based on the first integer weighting block comprises: generating a second sequence of bits by shifting the first sequence of bits; and forming a new weighting group by combining the first sequence of bits and the second sequence of bits, the second integer weighting block comprising the new weighting group.
- Procedure according to Claim 3 , where: a data element in the weighting group has a predetermined number of bits, and shifting the first sequence of bits involves shifting the first sequence of bits to the right by the predetermined number of bits.
- Procedure according to Claim 1 , where the first integer weighting block and the second integer weighting block have the same number of data elements and the second integer weighting block has a larger storage size than the first integer weighting block.
- Procedure according to Claim 1 , where a data element in the third integer weight block comprises bits in a data element in the second integer weight block and bits in one or more of the zeros inserted into the second integer weight block.
- Procedure according to Claim 1 , where the first integer weighting block and the floating-point weighting block have the same number of data elements.
- Procedure according to Claim 1 , wherein the second floating-point data format differs from the first floating-point data format and the method further comprises: storing an activation block, wherein the activation block comprises activations in the first floating-point data format; and converting the activation block into the floating-point activation block by changing the first floating-point data format to the second floating-point data format.
- Procedure according to Claim 1 , wherein the first integer weighting block is further generated by: partitioning a quantized weighting tensor into quantized weighting blocks, wherein the quantized weighting tensor is generated by quantizing the weighting tensor; and generating the first integer weighting block from a single quantized weighting block of the quantized weighting blocks.
- Procedure according to Claim 9 , wherein generating the first integer weight block comprises partitioning the single quantized weight block into multiple weight groups; and for each weight group of the multiple weight groups, modifying one or more positions of one or more data elements in the weight group.
- One or more non-volatile, computer-readable media storing instructions executable to perform operations for a tensor multiplication operation in a neural network, wherein the operations include: storing a first integer weight block generated by quantizing a weight tensor of the tensor multiplication operation, the weight tensor comprising weights in a first floating-point data format, and the first integer weight block comprising data elements in a first integer data format; forming a second integer weight block based on the first integer weight block, the second integer weight block comprising data elements in a second integer data format; forming a third integer weight block from the second integer weight block, the third integer weight block comprising data elements in a third integer data format; Converting the third integer weighting block into a floating-point weighting block, wherein the floating-point weighting block comprises data elements in a second floating-point data format; and performing the tensor multiplication operation by multiplying the floating-point weighting block by a floating-point activation block.
- One or more non-volatile, computer-readable media according to Claim 11 , wherein the first integer weighting block has multiple weighting groups, one weighting group is stored as a first sequence of bits, and forming the second integer weighting block based on the first integer weighting block comprises: generating a second sequence of bits by shifting the first sequence of bits; and forming a new weighting group by combining the first sequence of bits and the second sequence of bits, the second integer weighting block comprising the new weighting group.
- One or more non-volatile, computer-readable media according to Claim 11 , where the first integer weighting block and the second integer weighting block have the same number of data elements and the second integer weighting block has a larger storage size than the first integer weighting block.
- One or more non-volatile, computer-readable media according to Claim 11 , where the first integer weighting block and the floating-point weighting block have the same number of data elements.
- One or more non-volatile, computer-readable media according to Claim 11 , wherein the second floating-point data format differs from the first floating-point data format and the operations further include: storing an activation block, wherein the activation block comprises activations in the first floating-point data format; and converting the activation block into the floating-point activation block by changing the first floating-point data format to the second floating-point data format.
- One or more non-volatile, computer-readable media according to Claim 11 , wherein the first integer weighting block is further generated by: partitioning a quantized weighting tensor into quantized weighting blocks, wherein the quantized weighting tensor is generated by quantizing the weighting tensor; and generating the first integer weighting block from a single quantized weighting block of the quantized weighting blocks.
- A device comprising: a computer processor for executing computer program instructions and non-volatile, computer-readable memory that stores computer program instructions which can be executed by the computer processor to perform operations for carrying out a tensor multiplication operation in a neural network, wherein the operations comprise: storing a first integer weight block generated by quantizing a weight tensor of the tensor multiplication operation, wherein the weight tensor comprises weights in a first floating-point data format, wherein the first integer weight block comprises data elements in a first integer data format; forming a second integer weight block based on the first integer weight block, wherein the second integer weight block comprises data elements in a second integer data format; forming a third integer weight block from the second integer weight block, wherein the third integer weight block comprises data elements in a third integer data format; and converting the third integer weight block. Convert an integer weighting block into a floating-point weighting block, where the floating-point weighting block contains data elements in a second floating-point data format, and performing the tensor multiplication operation by multiplying the floating-point weighting block with a floating-point activation block.
- establishment according Claim 17 , wherein the first integer weighting block has multiple weighting groups, one weighting group is stored as a first sequence of bits, and forming the second integer weighting block based on the first integer weighting block comprises: generating a second sequence of bits by shifting the first sequence of bits; and forming a new weighting group by combining the first sequence of bits and the second sequence of bits, the second integer weighting block comprising the new weighting group.
- establishment according Claim 17 , wherein the second floating-point data format differs from the first floating-point data format and the operations further include: storing an activation block, wherein the activation block comprises activations in the first floating-point data format; and converting the activation block into the floating-point activation block by changing the first floating-point data format to the second floating-point data format.
- establishment according Claim 17 , wherein the first integer weighting block is further generated by: partitioning a quantized weighting tensor into quantized weighting blocks, wherein the quantized weighting tensor is generated by quantizing the weighting tensor; and generating the first integer weighting block from a single quantized weighting block of the quantized weighting blocks.
Description
Cross-reference to related registration This application claims the benefits of international application no. PCT/ CN2023/109778 , submitted on July 28, 2023 and entitled “LOOKUP-TABLE BASED DEQUANTIZATION FOR NEURAL NETWORK INFERENCE WITH SHUFFLED DATA LAYOUT AND PERMUTATION INSTRUCTION”, and hereby incorporates the entire content by reference for all purposes. Technical field This disclosure relates generally to deep neural networks (DNNs) and specifically to tensor multiplication operations in DNNs based on mixed-data layout dequantization. background Deep neural networks (DNNs) are widely used in a variety of artificial intelligence applications, from computer vision to speech recognition and natural language processing, due to their ability to achieve high accuracy. However, this high accuracy comes at the cost of significant computational overhead. DNNs are extremely computationally expensive because each inference can require hundreds of millions of tensor operations, as well as large amounts of data to read and write. Many tensor operations (e.g., MAC (multiply-accumulate) operations in convolutional layers, linear transformations in fully connected layers, etc.) involve tensor multiplications. Brief description of the drawings The embodiments are easily understood through the following detailed description in conjunction with the accompanying drawings. To simplify this description, identical reference numerals denote identical structural elements. The embodiments are shown in the illustrations of the accompanying drawings as examples and are not exhaustive. shows an example DNN according to different embodiments. This is a block diagram of a DNN system according to various embodiments. This is a block diagram of a DNN module according to various embodiments. shows a tensor multiplication operation based on dequantization with mixed data layout according to various embodiments. shows the quantization of a weighting tensor according to different embodiments. shows the partitioning of a weighting tensor according to different embodiments. This shows the mixing of data elements in a weighting group according to different implementations. shows the mixing of data elements in a different weighting group according to various embodiments. shows the unpacking of a weight group according to different embodiments. shows an extended weighting group according to different embodiments. shows a floating-point weighting group according to various embodiments. is a flowchart that shows a procedure for performing a tensor multiplication operation in a DNN according to different embodiments. is a block diagram of an example of a computing device according to various embodiments. Detailed description overview Over the past decade, AI (artificial intelligence)-based data processing, particularly using deep neural networks (DNNs), has increased rapidly. DNNs are widely used in computer vision, speech recognition, and image and video processing, primarily due to their ability to achieve accuracy beyond human levels. Significant improvements in the size and accuracy of DNN models, coupled with the rapid increase in the computing power of execution platforms, have enabled the adoption of DNN applications even on resource-constrained mobile and edge devices with limited power availability. For example, large language models (LLMs) can improve language understanding, language generation, and language automation. This can lead to significant advancements across all areas of the industry. However, efficient inference of LLMs can be challenging due to their enormous parameter sizes and can require substantial computational resources. While large-scale models with hundreds of billions or trillions of parameters may require GPU (Graphics Processing Unit) accelerators for optimal performance, domain-specific, medium-sized models can still benefit from the capabilities of the Central Processing Unit (CPU) and deliver satisfactory results. During the inference process, the model may be fed a text prompt, typically containing tens of thousands of tokens. The response tokens are then generated sequentially in an autoregressive manner. The majority of the workload is expended on tensor multiplications in the inner product operations, which are primarily limited by the memory bandwidth required to load large parameters (e.g., weights). Many DNNs incorporate tensor multiplication operations. A tensor multiplication operation can be performed on an input tensor and a weighting tensor. A tensor is a data structure with multiple elements across one or more dimensions. Examples of tensors include a vector, which can be a one-dimensional tensor, and a matrix, which can be a two-dimensional tensor. There can also be three-dimensional tensors and even higher-dimensional tensors. The input tensor contains one or more activations (also called "input activations" or "input elements"). The weighting tensor contains one or more weights. The values o