CN-121352075-B - Multi-core heterogeneous system, method, and medium for machine learning model

CN121352075BCN 121352075 BCN121352075 BCN 121352075BCN-121352075-B

Abstract

The present disclosure provides multi-core heterogeneous systems, methods and media for machine learning models, and relates to the field of artificial intelligence, and more particularly to the fields of model acceleration, model training, and the like, implemented by a first processor sampling input data, binning the sampled sample data to obtain corresponding histograms, selecting hot spot bins from the histograms, generating a lookup table comprising a plurality of activation function values and index parameters for looking up the activation function values based on the histograms and the hot spot bins in response to receiving an instruction to generate the lookup table, and issuing an instruction to update the lookup table in response to determining that the lookup table has been generated, and executing by a second processor, in response to receiving the instruction to update the lookup table, updating a copy of the lookup table stored in the second processor with the lookup table generated by the first processor, and calculating input tensor data based on the stored lookup table copy.

Inventors

WANG YINGHUI
ZHANG HAO
WU QINGSONG

Assignees

瀚博半导体(上海)股份有限公司

Dates

Publication Date: 20260512
Application Date: 20251222

Claims (15)

1. A multi-core heterogeneous system for a machine learning model, comprising: A first processor configured to: Sampling the input data to obtain sample data; carrying out binning operation on the sample data, and counting the sample data amount falling into each binning interval, so as to obtain a histogram indicating the sample data amount of each binning interval; selecting at least one bin from the histogram as a hotspot bin, wherein each bin selected corresponds to a respective one of the hotspot bins, and In response to receiving an instruction to generate a lookup table: Generating a lookup table based on the histogram and the hotspot interval, the lookup table including a plurality of activation function values and index parameters for looking up the activation function values, and Issuing an instruction to update the lookup table in response to having generated the lookup table, and A second processor comprising on-chip memory configured to store a look-up table copy, the second processor communicatively coupled to the first processor and configured to: Updating a copy of a lookup table stored on the on-chip memory with the lookup table generated by the first processor in response to receiving the instruction to update the lookup table, and The input tensor data is calculated based on a copy of the look-up table stored on the on-chip memory.
2. The multi-core heterogeneous system of claim 1, wherein generating the lookup table based on the histogram and the hotspot interval comprises: Determining the total number of sampling points of the activation function in the hot spot interval; Determining the sampling points and corresponding index parameters of the activation function in each sub-interval of the hot spot interval based on the determined total number of sampling points, wherein the number of sampling points in each sub-interval is distributed according to the sample data volume of the corresponding binning interval, and Solving the activation function at the determined sampling points to obtain the plurality of activation function values respectively corresponding to the sampling points.
3. The multi-core heterogeneous system of claim 2, wherein a total number of sampling points of the activation function in the hotspot interval is based on a computational accuracy to be used by the machine learning model, and wherein the first processor is further configured to generate the lookup table based on a new computational accuracy in response to receiving an instruction to update the computational accuracy.
4. The multi-core heterogeneous system of claim 1, wherein the second processor further comprises: an arithmetic logic circuit; a routing circuit configured to perform the following operations on data input to the routing circuit: Determining whether an activation function value for the data is available through the look-up table replica based on the index parameter in the look-up table replica, Obtaining an activation function value for the data by looking up a table of the look-up table copy in response to determining that the activation function value for the data is available through the look-up table copy, and Obtaining an activation function value for the data by routing the data to the arithmetic logic circuit for solution in response to determining that the activation function value for the data is not available through the look-up table replica, and And a tensor calculation circuit configured to calculate input tensor data based on the activation function value obtained by the routing circuit.
5. The multi-core heterogeneous system of claim 1, wherein the second processor is further configured to: counting events for which activation function values cannot be obtained by the look-up table copy; An instruction to generate a lookup table is issued to the first processor in response to the count exceeding a predetermined threshold.
6. The multi-core heterogeneous system of claim 4, wherein the multi-core heterogeneous system is to train the machine learning model, and the routing circuit is further configured to determine, with a probability, whether to obtain the activation function value of the data by a look-up table or to obtain the activation function value of the data by a solution, wherein the probability is associated with at least one of a current training round, a current loss value, and a current gradient norm of the machine learning model.
7. The multi-core heterogeneous system of claim 2, wherein the multi-core heterogeneous system is configured to train the machine learning model, and wherein a total number of sampling points of the activation function over the hotspot interval is determined based on a computational accuracy to be used by the machine learning model in reasoning and/or a current training round of the machine learning model.
8. A method for machine learning models, comprising: sampling, by a first processor, the input data to obtain sample data; Performing binning operation on the sample data by the first processor and counting the amount of sample data falling into each binning interval, thereby obtaining a histogram indicative of the amount of sample data of each binning interval; Selecting, by the first processor, at least one binning interval from the histogram as a hotspot interval, wherein each selected binning interval corresponds to a respective one of the hotspot intervals; in response to the first processor receiving an instruction to generate a look-up table: Generating, by the first processor, a look-up table based on the histogram and the hotspot interval, the look-up table comprising a plurality of activation function values and index parameters for looking up the activation function values, Issuing, by the first processor, an instruction to update a lookup table in response to the first processor determining that the lookup table has been generated; In response to a second processor receiving an instruction to update a lookup table, updating, by the second processor, a copy of the lookup table stored in the second processor with the lookup table generated by the first processor, and Input tensor data is calculated by the second processor based on the stored look-up table copies.
9. The method of claim 8, wherein generating the lookup table based on the histogram and the hotspot interval comprises: Determining the total number of sampling points of the activation function in the hot spot interval; Determining the sampling points and corresponding index parameters of the activation function in each sub-interval of the hot spot interval based on the determined total number of sampling points, wherein the number of sampling points in each sub-interval is distributed according to the sample data volume of the corresponding binning interval, and Solving the activation function at the determined sampling points to obtain the plurality of activation function values respectively corresponding to the sampling points.
10. The method of claim 9, wherein the total number of sampling points of the activation function in the hotspot interval is based on a computational accuracy to be used by the machine learning model, and the method further comprises: in response to the first processor receiving an instruction to update the computational accuracy, the look-up table is generated by the first processor based on the new computational accuracy.
11. The method of claim 8, wherein the method further comprises: Determining, by the second processor, whether an activation function value for data to be calculated is available through the look-up table replica based on the index parameter in the look-up table replica; in response to the second processor determining that the activation function value of the data is available through the look-up table copy, the second processor obtaining the activation function value of the data by looking up the look-up table copy; in response to the second processor determining that the activation function value of the data cannot be obtained by the look-up table copy, the second processor obtaining the activation function value of the data by solving the activation function, and The input tensor data is calculated by the second processor based on the obtained activation function value.
12. The method of claim 11, wherein the method further comprises: Counting, by the second processor, events for which activation function values cannot be obtained by the look-up table copy; an instruction to generate a lookup table is issued to the first processor in response to the second processor determining that the count exceeds a predetermined threshold.
13. The method of claim 11, wherein the first processor and the second processor are used to train the machine learning model, and further comprising: Determining, by the second processor, whether to obtain the activation function value of the data by a look-up table or to obtain the activation function value of the data by a solution, with the probability being associated with at least one of a current training round, a current loss value, and a current gradient norm of the machine learning model.
14. The method of claim 9, wherein the first processor and the second processor are configured to train the machine learning model, and wherein a total number of sampling points of the activation function over the hotspot interval is determined based on a computational accuracy to be used by the machine learning model in reasoning and/or a current training round of the machine learning model.
15. A computer readable storage medium storing instructions which, when executed by a processor of a computer system, cause the processor to perform the method of any one of claims 8 to 14.

Description

Multi-core heterogeneous system, method, and medium for machine learning model Technical Field The present disclosure relates to the field of artificial intelligence, and in particular to the fields of model acceleration, model training, etc., which are suitable for use in scenarios such as customization, deployment, etc. of a machine learning model, and more particularly, to a multi-core heterogeneous system for a machine learning model, a method for a machine learning model, and a computer readable storage medium storing instructions. Background With the rapid development of artificial intelligence technology, a large-scale machine learning model based on architecture including a transducer has been dominant. In order to improve the nonlinear expression capability and training stability of the model, modern models commonly adopt advanced activation functions such as Gaussian Error Linear Units (GELU), swish/SiLU and the like. However, these functions are mathematically defined by means of the transcendental functions such as exponential function, hyperbolic tangent function, etc., resulting in long calculation path at the hardware level, large clock period consumption, and far more calculation overhead than simple functions such as ReLU. When such activation functions are executed on the existing multi-core parallel processing system (including a general Graphics Processing Unit (GPU), a Neural Processing Unit (NPU) and various Tensor Processing Units (TPU)), hardware resource constraint and calculation efficiency bottleneck which are difficult to reconcile are faced. The hardware architecture design of modern multi-core processors is mainly optimized for large-scale matrix multiply-add operations (MACs), whose chip area and power budget are largely allocated to vector processing units or tensor computing arrays. In contrast, for performing an index [ ]) Logarithmic scale) Hyperbolic tangent) Special Function Units (SFUs) or logic circuit resources of the transcendental functions are relatively scarce (e.g., the ratio of SFU to MAC may be as low as 1:8 in some architectures). Because GELU/SiLU and other activation functions are severely dependent on such override function operations, in the model reasoning process, when massive data enter an activation layer concurrently, scarce SFU resources become blocking points in a pipeline. This results in a large number of parallel vector/tensor processing units being forced to an idle waiting state. The structural mismatch of the computing resources can not effectively release the theoretical computing power of the processor, and seriously hinders the improvement of the reasoning throughput rate. For this bottleneck, existing activation function acceleration schemes typically employ a statically preset calculation strategy (e.g., a fixed piecewise fit) to accelerate the calculation. However, such designs ignore the dynamic nature of the data distribution of the machine learning model during reasoning. The data distribution of different network levels and different input tasks is huge, the static scheme often causes hardware resources to be wasted in an invalid section with sparse data, and a key area with dense data cannot be supported by enough calculation resolution. The mismatch between the static resource allocation and the dynamic data distribution further reduces the overall energy efficiency ratio of the hardware system. In view of the foregoing, a method for deep integration of data distribution features and a hierarchical architecture of processor storage is needed to solve the technical problems of insufficient computing resources of transcendental functions, limited on-chip storage space and inability of adapting to dynamic distribution by static strategies. On the other hand, the prior art also has significant drawbacks in terms of the cooperative consistency of model training and inference accuracy. Specifically, machine learning models typically employ high precision numerical representations (e.g., 32 or 64 bit floating point or mixed precision floating point formats) during the training phase to ensure stable convergence of gradients and adequate expression of model parameters during back propagation. However, in order to reduce the computation and storage overhead of the inference phase, the industry has generally introduced low-bit quantization inference mechanisms (such as INT8, BF16 or lower precision fixed-point formats) to increase throughput and reduce power consumption by reducing the bit width of each activation value and weighting parameter. This mismatch in quantization accuracy between training and reasoning creates a numerical error accumulation problem at the activation function layer in existing systems. Since the activation functions (especially GELU, swish/SiLU, etc.) have a high degree of nonlinearity, their function values vary greatly in local rate of change over the input interval, and low precision representation can lead to s