Search

CN-121979486-A - Single-precision floating point matrix multiplication calculation unit, split mapping method and accelerator

CN121979486ACN 121979486 ACN121979486 ACN 121979486ACN-121979486-A

Abstract

The invention provides a single-precision floating point matrix multiplication calculation unit which comprises a preprocessing module, a linear evaluation arithmetic logic module, a low-order wide integer calculation array, an interpolation merging module and a format conversion module, wherein the preprocessing module is used for reading single-precision floating point data and splitting a mantissa domain part of each matrix into a plurality of groups of low-order wide integer matrixes, the linear evaluation arithmetic logic module is used for performing linear combination operation on the plurality of groups of low-order wide integer matrixes to generate a plurality of groups of point value matrixes, the low-order wide integer calculation array is used for performing low-order wide integer matrix multiplication operation on the plurality of groups of point value matrixes to obtain a plurality of groups of point value multiplication product matrixes, the interpolation merging module is used for performing interpolation operation and weighted merging on the plurality of groups of point value multiplication product matrixes to generate a mantissa domain product result, and the format conversion module is used for combining the mantissa domain product result with an exponential domain part of each matrix to generate the single-precision floating point matrix multiplication result. The invention improves the overall throughput and energy efficiency and reduces the mapping calculation and the system overhead.

Inventors

  • HAO YIFAN
  • XIA RUIYANG
  • ZHAO YUMING
  • ZHAO YONGWEI
  • Du Zidong
  • LIU YANG

Assignees

  • 中国科学院计算技术研究所

Dates

Publication Date
20260505
Application Date
20260228

Claims (12)

  1. 1. A single precision floating point matrix multiplication computation unit, comprising: The preprocessing module is used for reading single-precision floating point data and splitting the mantissa domain part of each matrix into a plurality of groups of low-order wide integer matrixes; The linear evaluation arithmetic logic module is connected with the preprocessing module and is used for performing linear combination operation on the plurality of groups of low-order wide integer matrixes to generate a plurality of groups of point value matrixes; The low-order wide integer computing array is connected with the linear evaluation arithmetic logic module and is used for performing low-order wide integer matrix multiplication operation on the multiple groups of point value matrixes to obtain multiple groups of point value multiplication product matrixes; The interpolation merging module is connected with the low-order wide integer computing array and is used for performing interpolation operation and weighted merging on the multi-group point value multiplication product matrix to generate a mantissa domain product result; And the format conversion module is connected with the interpolation merging module and is used for combining the mantissa domain product result with the exponent domain part of each matrix to generate a single-precision floating point number matrix multiplication result.
  2. 2. The single precision floating point matrix multiplication computing unit of claim 1, wherein the preprocessing module comprises a mantissa splitting module and a radix configuration register, the radix configuration register pre-stores 2 0 、2 -8 、2 -16 groups of splitting radix, and the mantissa splitting module splits the mantissa domain part of each matrix data in the single precision floating point data into three groups of 8-bit integer matrices based on the splitting radix to obtain three groups of low-bit wide integer matrices.
  3. 3. The single precision floating point matrix multiplication unit of claim 1, wherein, the linear evaluation arithmetic logic module comprises a plurality of groups of parallel adder/subtractors and shifters, wherein the linear evaluation arithmetic logic module is used for evaluating the linear evaluation arithmetic logic module through a preset evaluation point set {0,1, -1,2, +++ performs a linear combination on the three sets of low-order wide integer matrices, five sets of point value matrices corresponding to the evaluation point sets are generated, the linear combination is achieved only by addition, subtraction and shift operations.
  4. 4. The single precision floating point matrix multiplication computation unit of claim 1, wherein said low bit wide integer computation array is an accelerator native 8 bit integer matrix multiplication array; and the five groups of point value matrixes are paired pairwise, and low-bit wide integer matrix multiplication is respectively carried out to obtain corresponding five groups of point value product matrixes.
  5. 5. The single-precision floating-point matrix multiplication unit according to claim 1, wherein the interpolation combining module comprises a coefficient register, an interpolation module and a weighted addition tree, wherein the coefficient register pre-stores fixed coefficients of interpolation operation and preset base weights, the interpolation module performs interpolation operation on five groups of dot-value multiplication product matrixes to generate five groups of interpolation component matrixes, and the weighted addition tree performs shift summation according to the preset base weights to obtain a mantissa domain product result.
  6. 6. The single precision floating point matrix multiplication unit of claim 1, further comprising a global control module with built-in pipeline scheduling logic to synchronize clock beats of the units via a control bus: And the preprocessing module is used for splitting the N+1th block matrix, the low-order wide integer computing array is used for executing multiplication operation of the N th block matrix, and the interpolation merging module is used for processing the product result of the N-1 th block matrix.
  7. 7. An AI accelerator comprising at least a computing unit, a multi-level cache, wherein the computing unit comprises: The preprocessing module is used for reading single-precision floating point data and splitting the mantissa domain part of each matrix into a plurality of groups of low-order wide integer matrixes; The linear evaluation arithmetic logic module is connected with the preprocessing module and is used for performing linear combination operation on the plurality of groups of low-order wide integer matrixes to generate a plurality of groups of point value matrixes; The low-order wide integer computing array is connected with the linear evaluation arithmetic logic module and is used for performing low-order wide integer matrix multiplication operation on the multiple groups of point value matrixes to obtain multiple groups of point value multiplication product matrixes; The interpolation merging module is connected with the low-order wide integer computing array and is used for performing interpolation operation and weighted merging on the multi-group point value multiplication product matrix to generate a mantissa domain product result; The format conversion module is connected with the interpolation merging module and is used for combining the mantissa domain product result with the exponent domain part of each matrix to generate a single-precision floating point number matrix multiplication result and transmitting the single-precision floating point number matrix multiplication result to the multi-level cache; The multi-level cache stores single-precision floating point data and single-precision floating point number matrix multiplication results.
  8. 8. A single-precision floating-point matrix multiplication split mapping method, executed on an accelerator, comprising: splitting the mantissa domain part of each matrix of the input single-precision floating point data into a plurality of groups of low-order wide integer matrixes; Performing fixed linear combination operation on the plurality of groups of low-order wide integer matrixes to generate a plurality of groups of point value matrixes; Performing low-order wide integer matrix multiplication operation on the multiple groups of point value matrixes to obtain multiple groups of point value multiplication product matrixes; Performing interpolation operation and weighted combination on the multi-group dot value multiplication product matrix to obtain a mantissa domain product result of single-precision floating point number matrix multiplication; Combining the mantissa product result with the exponent domain part of the single-precision floating point number to output a final single Precision floating point number matrix multiplication results.
  9. 9. The method of claim 8, wherein performing interpolation and weighted combination on the plurality of sets of dot-value product matrices to obtain mantissa-domain product results further comprises: Performing interpolation operation on the multi-set point value multiplication product matrix to obtain a plurality of sets of interpolation component matrices; and carrying out weighted combination on the multiple groups of interpolation component matrixes to obtain a mantissa product result of single-precision floating point number matrix multiplication.
  10. 10. The method of claim 8, wherein the step of determining the position of the first electrode is performed, Dividing the mantissa domain part of each matrix data in the single-precision floating point data into three groups of 8-bit integer matrixes according to a preset dividing base number to obtain three groups of low-bit wide integer matrixes, wherein the dividing base numbers are 2 0 、2 -8 、2 -16 respectively.
  11. 11. The method according to claim 8, wherein the data obtained by the predetermined set of evaluation points {0,1, -1,2, +++ performs a linear combination on the three sets of low-order wide integer matrices, generating five groups of point value matrixes corresponding to the evaluation point sets, wherein the linear combination is realized only through addition, subtraction and shift operation; wherein the multiple groups of point value matrixes are five groups.
  12. 12. The method of claim 11, wherein five sets of dot-value matrices are paired one by one, and low-bit-width integer matrix multiplication is performed separately to obtain corresponding five sets of dot-value product matrices.

Description

Single-precision floating point matrix multiplication calculation unit, split mapping method and accelerator Technical Field The invention relates to the technical field of hardware acceleration, in particular to a single-precision floating point matrix multiplication calculation unit, a split mapping method and an accelerator. Background With the development of workloads such as deep learning, scientific computing and the like, single-precision floating point (FP 32) matrix multiplication is still a basic operator of many key scenes, and the model training quality and the reliability of scientific computing results are directly affected. However, existing AI accelerators typically devote the main area and energy consumption budget to low bit-width matrix computing arrays such as INT8 (8-bit signed integer data type), FP32 often only relies on a separate computational path of smaller size, and when the workload must be executed with FP32, it is often difficult to fully multiplex the high throughput and energy efficiency resources of the low bit-width array, resulting in limited overall throughput and reduced energy efficiency. In addition, in order to multiplex the low-bit-width array, a naive mantissa fragment expansion method is often adopted to split the FP32 operation into a combination of several low-bit-width operations, but the mapping process requires excessive sub-operation times or introduces a large amount of intermediate data handling and control overhead, which counteracts the benefits brought by array multiplexing, so that the end-to-end performance and energy efficiency are not obviously improved. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a single-precision floating point matrix multiplication calculation unit, a split mapping method and an accelerator, which are based on matrix level split mapping of Toom-Cook-3 (TC-3), so that the multiplexing efficiency of a low-bit-width array is improved, the existing low-bit-width array resource can be more fully utilized by FP32 matrix multiplication, and the overall throughput and energy efficiency are improved. One aspect of the present invention provides a single-precision floating-point matrix multiplication unit, comprising: The preprocessing module is used for reading single-precision floating point data and splitting the mantissa domain part of each matrix into a plurality of groups of low-order wide integer matrixes; The linear evaluation arithmetic logic module is connected with the preprocessing module and is used for performing linear combination operation on the plurality of groups of low-order wide integer matrixes to generate a plurality of groups of point value matrixes; The low-order wide integer computing array is connected with the linear evaluation arithmetic logic module and is used for performing low-order wide integer matrix multiplication operation on the multiple groups of point value matrixes to obtain multiple groups of point value multiplication product matrixes; The interpolation merging module is connected with the low-order wide integer computing array and is used for performing interpolation operation and weighted merging on the multi-group point value multiplication product matrix to generate a mantissa domain product result; and the format conversion module is connected with the interpolation merging module and is used for combining the mantissa domain product result with the exponent domain part of each matrix to generate and store the single-precision floating point number matrix multiplication result. In an embodiment of the present invention, the preprocessing module includes a mantissa splitting module and a radix configuration register, where the radix configuration register pre-stores 2 0、2-8、2-16 groups of splitting radix, and the mantissa splitting module splits a mantissa domain portion of each matrix data in the single-precision floating point data into three groups of 8-bit integer matrices based on the splitting radix, so as to obtain three groups of low-order wide integer matrices. In one embodiment of the present invention, the linear evaluation arithmetic logic module includes a plurality of parallel adder/subtractors and shifters, and the linear evaluation arithmetic logic module is configured to evaluate the linear evaluation arithmetic logic module by using a predetermined evaluation point set {0,1, -1,2, and performing linear combination on the three groups of low-order wide integer matrixes to generate five groups of point value matrixes corresponding to the evaluation point sets, wherein the linear combination is realized only through addition, subtraction and shift operation. In one embodiment of the present invention, the low-order wide integer computing array is an accelerator native 8-bit integer matrix multiplication array; and the five groups of point value matrixes are paired pairwise, and low-bit wide integer matrix multiplication is respectively