EP-4738096-A1 - MIXED-PRECISION MAC TREE STRUCTURE FOR MAXIMIZING MEMORY BANDWIDTH USAGE TO ACCELERATE OPERATIONS OF GENERATIVE LARGE LANGUAGE MODEL

EP4738096A1EP 4738096 A1EP4738096 A1EP 4738096A1EP-4738096-A1

Abstract

Provided is a mixed-precision multiply-and-accumulate (MAC) tree structure for maximizing memory bandwidth usage in order to accelerate the operations of a generative large language model. A MAC tree-based computational device according to one embodiment may comprise: a plurality of floating-point multipliers that are connected in parallel and process multiplication operations for data transmitted from an external memory; a plurality of first converters for converting the output of each of the plurality of floating-point multipliers from floating-point to fixed-point; a fixed-point adder tree that is connected to the plurality of first converters and processes the addition of the multiplication results of the plurality of floating-point multipliers; a fixed-point accumulator for accumulating the outputs of the fixed-point adder tree; and a second converter for converting the output of the fixed-point accumulator from fixed point to floating point.

Inventors

KIM, JUNG-HOON

Assignees

Hyperaccel Co., Ltd.

Dates

Publication Date: 20260506
Application Date: 20240520

Claims (8)

A multiply-and-accumulation (MAC) tree-based operator comprising: a plurality of floating-point (FP) multipliers connected in parallel and configured to process a multiplication operation on data delivered from an external memory; a plurality of first converters configured to convert output of each of the plurality of FP multipliers from floating point to fixed point; a fixed-point (FXP) adder tree connected to the plurality of first converters and configured to process summation of multiplication results of the plurality of FP multipliers; an FXP accumulator configured to accumulate output of the FXP adder tree; and a second converter configured to convert output of the FXP accumulator from the fixed point to the floating point, wherein the MAC tree-based operator corresponds to one of a plurality of MAC tree-based operators included in a hardware accelerator for acceleration of an artificial intelligence (AI) model, and at least one of the number of the plurality of MAC tree-based operators included in the hardware accelerator and the number of the plurality of FP multipliers included in the MAC tree-based operator is determined based on a memory bandwidth provided for the hardware accelerator.
The MAC tree-based operator of claim 1, wherein the plurality of MAC tree-based operators is configured to perform a matrix multiplication operation for at least one partition among a plurality of partitions that implements the AI model.
The MAC tree-based operator of claim 2, wherein the external memory includes a high bandwidth memory in which the at least one partition is stored and a local memory unit included in the hardware accelerator.
The MAC tree-based operator of claim 1, wherein each of the plurality of FP multipliers comprises: a mixed-precision FXP exponent adder for addition of the exponent; and a mixed-precision FXP mantissa multiplier for multiplication of the mantissa.
The MAC tree-based operator of claim 1, wherein each of the plurality of FP multipliers is configured to process multiplication between a first operand and a second operand with the same bit precision in response to a high-precision mode being selected and to compute a first result value.
A multiply-and-accumulation (MAC) tree-based operator comprising: a plurality of floating-point (FP) multipliers connected in parallel and configured to process a multiplication operation on data delivered from an external memory; a plurality of first converters configured to convert output of each of the plurality of FP multipliers from floating point to fixed point; a fixed-point (FXP) adder tree connected to the plurality of first converters and configured to process summation of multiplication results of the plurality of FP multipliers; an FXP accumulator configured to accumulate output of the FXP adder tree; and a second converter configured to convert output of the FXP accumulator from the fixed point to the floating point, wherein each of the plurality of FP multipliers is configured to simultaneously process first multiplication between a first operand with first bit precision and a (2-1)-th operand with second bit precision and second multiplication between the first operand and a (2-2)-th operand with the second bit precision in response to a high-performance mode being selected and to simultaneously compute a first result value of the first multiplication and a second result value of the second multiplication.
The MAC tree-based operator of claim 6, wherein: the first bit precision includes 16-bit precision, and the second bit precision includes 8-bit precision.
An operating method of a multiply-and-accumulation (MAC) tree-based operator, wherein the MAC tree-based operator comprises a plurality of floating-point (FP) multipliers connected in parallel, a plurality of first converters connected to the plurality of FP multipliers, a fixed-point (FXP) adder tree, an FXP accumulator, and a second converter, and the method comprises: processing, using the plurality of FP multipliers, a multiplication operation on data delivered from an external memory; converting, using the plurality of first converters, a result of multiplication operation of each of the plurality of FP multipliers from floating point to fixed point; processing, using the FXP adder tree, summation of the converted result of the plurality of FP multipliers; accumulating, using the FXP accumulator, output of the FXP adder tree; and converting, using the second converter, output of the FXP accumulator from the fixed point to the floating point, and the MAC tree-based operator corresponds to one of a plurality of MAC tree-based operators included in a hardware accelerator for acceleration of an artificial intelligence (AI) model, and at least one of the number of the plurality of MAC tree-based operators included in the hardware accelerator and the number of the plurality of FP multipliers included in the MAC tree-based operator is determined based on a memory bandwidth provided for the hardware accelerator.

Description

TECHNICAL FIELD Example embodiments relate to a mixed-precision multiply-and-accumulate (MAC) tree structure to maximize memory bandwidth usage for computational acceleration of a generative large language model. BACKGROUND ART Currently, with the rapid development of a generative large language model, a size of a model gradually increases to achieve high precision and has model parameters ranging from as few as millions to as many as billions. Therefore, a lot of data needs to be retrieved from memory at once and operations need to be performed without interruption. Also, a floating-point (FP) operator is required to support generative large language model operations without loss of accuracy. However, such FP operator has high logic complexity and occupies a large area accordingly. Reference material includes Korean Patent Laid-Open Publication No. 10-2022-0164573. DETAILED DESCRIPTION OF THE INVENTION Technical Object Example embodiments may provide an operator of a hardware accelerator that maximizes usage of a memory bandwidth provided for acceleration of a generative large language model that is difficult to parallelize and has a large amount of data. Technical subjects of the present invention are not limited to the aforementioned technical subjects and still other technical subjects not described herein will be clearly understood by one of ordinary skill in the art from the following description. Problem Solving Means According to an example embodiment, there is provided a multiply-and-accumulation (MAC) tree-based operator including a plurality of floating-point (FP) multipliers connected in parallel and configured to process a multiplication operation on data delivered from an external memory; a plurality of first converters configured to convert output of each of the plurality of FP multipliers from floating point to fixed point; a fixed-point (FXP) adder tree connected to the plurality of first converters and configured to process summation of multiplication results of the plurality of FP multipliers; an FXP accumulator configured to accumulate output of the FXP adder tree; and a second converter configured to convert output of the FXP accumulator from fixed point to floating point. According to an aspect, the MAC tree-based operator may correspond to one of a plurality of MAC tree-based operators included in a hardware accelerator for acceleration of an artificial intelligence (AI) model. According to another aspect, at least one of the number of the MAC tree-based operators included in the hardware accelerator and the plurality of FP multipliers included in the MAC tree-based operator may be determined based on a memory bandwidth provided for the hardware accelerator. According to still another aspect, the plurality of MAC tree-based operators may be configured to perform a matrix multiplication operation for at least one partition among a plurality of partitions that implements the AI model. According to still another aspect, the external memory may include a high bandwidth memory in which the at least one partition is stored. According to still another aspect, each of the plurality of FP multipliers may include a mixed-precision FXP exponent adder for addition of exponent; and a mixed-precision FXP mantissa multiplier for multiplication of mantissa. According to still another aspect, each of the plurality of FP multipliers may be configured to process multiplication between a first operand and a second operand with the same bit precision in response to a high-precision mode being selected and to compute a first result value. According to still another aspect, each of the plurality of FP multipliers may be configured to simultaneously process first multiplication between a third operand with first bit precision and a (4-1)-th operand with second bit precision and second multiplication between the third operand with first bit precision and a (4-2)-th operand with the second bit precision in response to a high-performance mode being selected and to simultaneously compute a second result value of the first multiplication and a third result value of the second multiplication. According to still another aspect, the first bit precision may include 16-bit precision, and the second bit precision may include 8-bit precision. According to an example embodiment, there is provided an operating method of a MAC tree-based operator, wherein the MAC tree-based operator includes a plurality of FP multipliers connected in parallel, a plurality of first converters connected to the plurality of FP multipliers, a fixed-point (FXP) adder tree, an FXP accumulator, and a second converter, and the method includes processing, using the plurality of FP multipliers, a multiplication operation on data delivered from an external memory; converting, using the plurality of first converters, a result of multiplication operation of each of the plurality of FP multipliers from floating point to fixed point; processing, using th