CN-121997996-A - Data processing method and related device based on accelerator

CN121997996ACN 121997996 ACN121997996 ACN 121997996ACN-121997996-A

Abstract

A data processing method based on an accelerator is applied to the technical field of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI). In the data processing method based on the accelerator, when the AI accelerator is adopted to execute the operation indicated by the model, the characteristic that the AI accelerator is provided with a calculation unit for executing different types of operation is utilized, and the first calculation unit on the AI accelerator is utilized to execute the first operation on part of input data to obtain output data. In this way, when the first computing unit continues to execute the first operation on other part of input data, the second computing unit on the AI accelerator can execute the second operation on the output data of the first computing unit at the same time, thereby ensuring that the first computing unit and the second computing unit execute the operation at the same time and reducing the overall time delay of the AI accelerator executing the operation.

Inventors

WANG XIANG
GAO XUEJIAN
ZHANG CHENGBO
WU YI
LI JUN
GUO QINGHAI

Assignees

华为技术有限公司

Dates

Publication Date: 20260508
Application Date: 20241107

Claims (20)

1. An accelerator-based data processing method, the method being applied to an artificial intelligence AI accelerator, the AI accelerator comprising a first computing unit and a second computing unit for performing different types of operations, the method comprising: the first computing unit executes a first operation on first data to obtain second data, wherein the first data is part of target data, the target data is the data needing to execute the first operation, and the data obtained after the first operation is executed on the target data needs to execute the second operation; The second calculation unit performs the second operation on the second data during the first calculation unit performs the first operation on third data, which is part of the data in the target data.
2. The method according to claim 1, wherein the method further comprises: The first computing unit stores the second data into a target cache; the second computing unit reads the second data from the target cache; The target cache is a cache shared by the first computing unit and the second computing unit.
3. The method of claim 2, wherein the first computing unit and the second computing unit have a plurality of caches shared therebetween, and the target cache is a cache with a highest read-write speed of the plurality of caches.
4. A method according to any of claims 1-3, wherein the first operation comprises a tensor operation and the second operation comprises a vector operation; Or the first operation comprises a vector operation and the second operation comprises a tensor operation.
5. The method of any of claims 1-4, wherein the first operation is a matrix multiplication operation and the target data comprises a first matrix and a second matrix; the first data comprises a first submatrix obtained by splitting the first matrix and a second submatrix obtained by splitting the second matrix.
6. The method of claim 5, wherein the manner in which the first matrix and the second matrix are split is related to the size of the first matrix, the size of the second matrix, and the size of the buffer space in the first computing unit.
7. The method according to claim 6, wherein in the case where the split method of the first matrix and the second matrix is a first split method, the number of rows of the first sub-matrix is smaller than the number of rows of the first matrix, and the number of columns of the first sub-matrix is smaller than the number of columns of the first matrix; the number of rows of the second sub-matrix is smaller than the number of rows of the second matrix, and the number of columns of the second sub-matrix is smaller than the number of columns of the second matrix.
8. The method according to claim 6, wherein in the case where the split manner of the first matrix and the second matrix is the second split manner, the number of rows of the first sub-matrix is smaller than the number of rows of the first matrix, and the number of columns of the first sub-matrix is equal to the number of columns of the first matrix; the number of rows of the second sub-matrix is equal to the number of rows of the second matrix, and the number of columns of the second sub-matrix is smaller than the number of columns of the second matrix.
9. The method of any of claims 1-8, wherein the first data comprises multi-part sub-data; The first computing unit performs a first operation on the first data to obtain second data, including: the first computing unit sequentially executes the first operation on the multi-part sub-data to obtain multi-part operated data, and the multi-part operated data are used for forming the second data.
10. The method according to any one of claims 1-8, wherein the second data is obtained by the first computing unit performing the first operation once, the second data including a plurality of sub-portions of sub-data, the second computing unit performing the second operation on the second data, comprising: the second calculation unit sequentially executes the second operation on the multi-part sub-data to obtain multi-part operated data, and the multi-part operated data is used for forming data obtained by executing the second operation on the second data.
11. An AI accelerator, characterized in that the AI accelerator comprises a first computing unit and a second computing unit for performing different types of operations; The first computing unit is used for executing a first operation on first data to obtain second data, wherein the first data is part of data in target data, the target data is the data needing to execute the first operation, and the data obtained after the first operation is executed on the target data needs to execute the second operation; The second computing unit is configured to perform the second operation on the second data during the period when the first computing unit performs the first operation on third data, where the third data is part of the data in the target data.
12. The AI accelerator of claim 11, it is characterized in that the method comprises the steps of, The first computing unit is further used for storing the second data into a target cache; the second computing unit is further configured to read the second data from the target cache; The target cache is a cache shared by the first computing unit and the second computing unit.
13. The AI accelerator of claim 12, wherein the first computing unit and the second computing unit have a plurality of caches shared therebetween, and the target cache is a cache with a highest read-write speed of the plurality of caches.
14. The AI accelerator of any of claims 11-13, wherein the first operation comprises a tensor operation and the second operation comprises a vector operation; Or the first operation comprises a vector operation and the second operation comprises a tensor operation.
15. The AI accelerator of any of claims 11-14, wherein the first operation is a matrix multiplication operation and the target data includes a first matrix and a second matrix; the first data comprises a first submatrix obtained by splitting the first matrix and a second submatrix obtained by splitting the second matrix.
16. The AI accelerator of claim 15, wherein the manner in which the first and second matrices are split is related to a size of the first matrix, a size of the second matrix, and a size of a buffer space in the first computing unit.
17. The AI accelerator of claim 16, wherein, in the case where the split of the first matrix and the second matrix is a first split, the number of rows of the first sub-matrix is smaller than the number of rows of the first matrix, and the number of columns of the first sub-matrix is smaller than the number of columns of the first matrix; the number of rows of the second sub-matrix is smaller than the number of rows of the second matrix, and the number of columns of the second sub-matrix is smaller than the number of columns of the second matrix.
18. The AI accelerator of claim 16, wherein, in the case where the split of the first matrix and the second matrix is the second split, the number of rows of the first submatrix is smaller than the number of rows of the first matrix, and the number of columns of the first submatrix is equal to the number of columns of the first matrix; the number of rows of the second sub-matrix is equal to the number of rows of the second matrix, and the number of columns of the second sub-matrix is smaller than the number of columns of the second matrix.
19. The AI accelerator of any of claims 11-18, wherein the first data comprises multi-part sub-data; The first computing unit is specifically configured to sequentially perform the first operation on the multi-part sub-data to obtain multi-part operated data, where the multi-part operated data is used to form the second data.
20. The AI accelerator of any of claims 11-18, wherein the second data is a result of a single execution of the first operation by the first computing unit, the second data comprising a plurality of sub-portions of sub-data; The second calculation unit is specifically configured to sequentially perform the second operation on the multi-part sub-data to obtain multi-part operated data, where the multi-part operated data is used to form data obtained by performing the second operation on the second data.

Description

Data processing method and related device based on accelerator Technical Field The application relates to the technical field of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), in particular to a data processing method based on an accelerator and a related device. Background Currently, deep learning models have become a hotspot direction in the field of artificial intelligence. Among them, the model architecture represented by a large language model is considered as a potential approach for realizing general artificial intelligence, and has been widely used in many fields such as natural language generation, speech translation, intelligent customer service, and intelligent assistant. In the running process of the deep learning model, the input data of the deep learning model is required to be calculated by a series of serial operators, and finally a model prediction result is obtained. For serially connected operators in the deep learning model, the input data of the next operator is the output data of the previous operator, so that the next operator often needs to wait for the previous operator to finish calculation before starting calculation. Since the deep learning model is mainly composed of a series of operators in series, the computation delay of the deep learning model is often closely related to the scale of the deep learning model. Along with the continuous expansion of the scale of the deep learning model, the calculation time delay of the model is increased while the performance of the model is improved. Therefore, how to reduce the computation time delay of the deep learning model based on the existing deep learning model architecture becomes a problem to be solved. Disclosure of Invention The application provides a data processing method based on an accelerator, which can improve the overall efficiency of the operation executed by an AI accelerator. In a first aspect, an accelerator-based data processing method is provided, the method being applied to an AI accelerator. The AI accelerator includes a first computing unit and a second computing unit for performing different types of operations. The data processing method based on the accelerator comprises the steps that a first computing unit executes first operation on first data to obtain second data, wherein the first data is part of data in target data, the target data is data needing to execute the first operation, and the data obtained after the first operation is executed on the target data needs to execute the second operation. The target data may specifically be input data of an AI model, or output data of a certain neural network layer in the AI model. The second calculation unit performs a second operation on the second data, which is part of the target data, during the period in which the first calculation unit performs the first operation on the third data. That is, the second computing unit does not need to wait until the first computing unit has performed the first operation on all the data in the target data, but can start the operation after the first computing unit has performed the first operation on part of the data, thereby shortening the waiting time of the first computing unit as much as possible. In the scheme, when the AI accelerator is adopted to execute the operation indicated by the model, the characteristic that the AI accelerator is provided with the computing units for executing different types of operation is utilized, and the first computing unit on the AI accelerator is used for executing the first operation on part of input data to obtain output data. In this way, when the first computing unit continues to execute the first operation on other part of input data, the second computing unit on the AI accelerator can execute the second operation on the output data of the first computing unit at the same time, thereby ensuring that the first computing unit and the second computing unit execute the operation at the same time and reducing the overall time delay of the AI accelerator executing the operation. Specifically, for two operators with different operation types in serial in the deep learning model, in the scheme, on the basis of maintaining the data dependency relationship between the two operators, the input data of the operators are segmented, so that a first computing unit and a second computing unit in the AI accelerator execute the operators for multiple times to process the segmented data. Therefore, after the first calculation unit executes the former operator to obtain output data, the second calculation unit can execute the latter operator based on the output data of the former operator, so that the fact that the latter operator can be executed during the execution period of the former operator is ensured as much as possible, the situation that the second calculation unit can start to run the latter operator after the first calculation unit finishes the operation of the former operator is avoid