CN-121070567-B - Method and equipment for scheduling mixed computing tasks on TCU and CUDA cores

CN121070567BCN 121070567 BCN121070567 BCN 121070567BCN-121070567-B

Abstract

The application provides a method and equipment for scheduling mixed calculation tasks on a TCU and a CUDA core, wherein the method comprises the steps of determining a value of a theoretical execution time ratio according to a ratio between the estimated execution time of the CUDA core and the estimated execution time of the TCU; the CUDA core is used for performing matrix multiplication computation for CUDA core computation tasks in all computation tasks, the TCU is used for performing matrix multiplication computation for TCU computation tasks in all computation tasks, the value of the theoretical execution time ratio is input into a TCU utilization prediction model to obtain TCU utilization, and if the TCU utilization is smaller than a TCU utilization threshold, all the CUDA core computation tasks and all the TCU computation tasks are sequentially and serially scheduled in a single computation flow. The application can accurately predict and select the optimal execution mode of each calculation task without extra analysis expenditure, thereby remarkably improving the calculation throughput of the whole task and the utilization efficiency of hardware resources.

Inventors

LI SHIGANG
SHI JINLIANG
Fu Rongtian
MA HUADONG
WU TONG
XU YOUXUAN

Assignees

北京邮电大学

Dates

Publication Date: 20260512
Application Date: 20250902

Claims (10)

1.A TCU and CUDA core up-mix computing task scheduling method is characterized by comprising the following steps: Determining a value of a preset theoretical execution time ratio according to a ratio between CUDA core estimated execution time corresponding to a CUDA core in a graphic processing unit GPU and tensor core TCU estimated execution time corresponding to a TCU in the graphic processing unit GPU, wherein the CUDA core is used for executing matrix multiplication calculation for CUDA core calculation tasks in each calculation task; Inputting the value of the theoretical execution time ratio into a preset TCU utilization rate prediction model to obtain a corresponding TCU utilization rate; If the TCU utilization rate is smaller than a preset TCU utilization rate threshold value, sequentially and serially scheduling each CUDA core computing task and each TCU computing task in a single computing flow; If the TCU utilization rate is equal to or greater than the TCU utilization rate threshold, determining a value of a preset atomic operation ratio according to a ratio between the total number of atomic operation instructions and the total number of all calculation operation instructions in the execution process of the CUDA core calculation task and the TCU calculation task; If the value of the atomic operation ratio is larger than a preset atomic index threshold, sequentially and serially scheduling each CUDA core computing task and each TCU computing task in a single computing flow; And if the value of the atomic operation ratio is smaller than or equal to the atomic index threshold, decoupling each TCU computing task and each CUDA core computing task into different computing flows for parallel scheduling.
2. The TCU and CUDA core up-mix computing task scheduling method according to claim 1, wherein the TCU utilization prediction model is as shown in formula (1): (1) Wherein, the Representing the theoretical execution time ratio; indicating the TCU utilization.
3. The TCU and CUDA core on-mix computing task scheduling method according to claim 2, further comprising, prior to said inputting the value of the theoretical execution time ratio into a preset TCU utilization prediction model: Constructing an exponential decay model between a theoretical execution time ratio and TCU utilization, wherein the exponential decay model is shown in a formula (2): (2) Wherein A, k and C both represent undetermined coefficients; and taking the TCU actual utilization rates obtained by testing under the values of different theoretical execution time ratios as observation data, and determining the respective corresponding values of the undetermined coefficients in the exponential decay model through function fitting to obtain corresponding TCU utilization rate prediction models.
4. A TCU and CUDA on-core hybrid computing task scheduling method according to any one of claims 1 to 3, further comprising, before the determining the value of the preset theoretical execution time ratio according to a ratio between the estimated execution time of the CUDA core corresponding to the CUDA core in the GPU of the graphics processing unit and the estimated execution time of the tensor core TCU corresponding to the TCU in the GPU of the graphics processing unit: Determining a division granularity and a non-zero element threshold corresponding to the sparse matrix multiplication based on the type of the sparse matrix multiplication to be calculated currently, wherein the type of the sparse matrix multiplication comprises sparse matrix multiplication SpMM and dense matrix multiplication SDDMM; According to the division granularity and the non-zero element threshold, respectively carrying out load distribution on each sub-matrix corresponding to each window in the sparse matrix multiplication to obtain each calculation group corresponding to each window, wherein each calculation group comprises a CUDA core calculation block corresponding to the CUDA core and/or a TCU block corresponding to the TCU, each calculation group is respectively subjected to load balancing processing to obtain a plurality of calculation tasks corresponding to each calculation group, each calculation task comprises a CUDA core calculation task corresponding to the CUDA core calculation block and/or a TCU calculation task corresponding to the TCU block, and the CUDA core is used for performing matrix multiplication calculation aiming at the CUDA core calculation task and/or the TCU is used for performing matrix multiplication aiming at the TCU calculation task.
5. The method for scheduling mixed computing tasks on a TCU and CUDA core according to claim 4, wherein determining the division granularity and the non-zero element threshold corresponding to the sparse matrix multiplication based on the type of the sparse matrix multiplication to be currently computed comprises: If the type of the current sparse matrix multiplication to be calculated is SpMM, determining that the partition granularity corresponding to SpMM is non-zero element granularity, and acquiring a first non-zero element threshold corresponding to SpMM; If the type of the sparse matrix multiplication to be calculated is SDDMM, determining that the partition granularity corresponding to SDDMM is TCU block granularity, and obtaining a second non-zero element threshold corresponding to SDDMM.
6. The method for scheduling mixed computing tasks on a TCU and CUDA core according to claim 5, wherein the performing load distribution on the submatrices corresponding to each window in the sparse matrix multiplication according to the division granularity and the non-zero element threshold to obtain the computing groups corresponding to each window respectively includes: If the current type of the sparse matrix multiplication to be calculated is SpMM, based on the granularity of the non-zero elements, judging whether the number of the non-zero elements in each column in each sub-matrix corresponding to each window is equal to or greater than the first non-zero element threshold value, and if the number of the non-zero elements in each sub-matrix is equal to or greater than the first non-zero element threshold value, compressing the columns equal to or greater than the first non-zero element threshold value in the sub-matrix into TCU blocks for each sub-matrix, and if the number of the non-zero elements in the sub-matrix is less than the first non-zero element threshold value, dividing the columns smaller than the first non-zero element threshold value in each sub-matrix into CUDA core calculation blocks to obtain calculation groups corresponding to each window; If the type of the current sparse matrix multiplication to be calculated is SDDMM, sequencing each column of the submatrix corresponding to each window in descending order of the number of non-zero elements contained in each column; dividing each column sequenced in each window into TCU judging units based on the TCU block granularity, judging whether the number of non-zero elements in each TCU judging unit is equal to or larger than a second non-zero element threshold value, dividing each element in each TCU judging unit which is equal to or larger than the second non-zero element threshold value in each submatrix into TCU blocks if the number of the non-zero elements in each submatrix is equal to or larger than the second non-zero element threshold value in each submatrix for each submatrix, and dividing each element in each TCU judging unit which is smaller than the second non-zero element threshold value in each submatrix into CUDA (compute unified device) to obtain each corresponding compute group of each window if the number of the non-zero elements in each submatrix is smaller than the second non-zero element threshold value in each submatrix.
7. The method for scheduling mixed computing tasks on a TCU and CUDA core according to claim 4, wherein performing load balancing processing on each computing group to obtain a plurality of computing tasks corresponding to each computing group respectively includes: Respectively executing a preset splitting step on the calculation groups corresponding to the windows to obtain a plurality of calculation tasks corresponding to the calculation groups; If the TCU blocks are contained in the calculation group, judging whether the number of columns in the TCU blocks exceeds a first column number threshold, if so, dividing each TCU block in the calculation group into a plurality of TCU block calculation tasks based on the first column number threshold; If the computation group comprises the CUDA core computation blocks, judging whether the number of the CUDA core computation blocks in the computation group exceeds a second number threshold, if so, dividing each CUDA core computation block in the computation group into a plurality of CUDA core computation tasks based on the second number threshold, determining each CUDA core computation block in the CUDA core computation task with the number of the CUDA core computation blocks exceeding a third number threshold as a long CUDA core computation block, and determining each CUDA core computation block in the CUDA core computation task with the number of the CUDA core computation blocks not exceeding the third number threshold as a short CUDA core computation block; Correspondingly, the method for scheduling the mixed computing task on the TCU and CUDA cores further comprises the following steps: Recording split information corresponding to each calculation task obtained by splitting in each calculation group of each window based on a plurality of auxiliary arrays; Wherein the auxiliary array comprises: the first array is used for recording the number of TCU blocks corresponding to the calculation tasks in each window; the second group is used for recording the number of the non-zero elements corresponding to the computing tasks in each window; the third array is used for recording indexes of the computing tasks in the corresponding original windows respectively; A fourth array, configured to record indexes of each computing task in each corresponding original row; And a fifth array, configured to record whether each computing task is to execute an atomic operation.
8. The TCU and CUDA core hybrid computing task scheduling method according to claim 4, further comprising, before determining the division granularity and the non-zero element threshold corresponding to the sparse matrix multiplication based on the type of the sparse matrix multiplication to be currently computed: Constructing a data access cost model corresponding to each of the sparse matrix multiplication SpMM, the dense matrix multiplication SDDMM and the sampling dense matrix multiplication SDDMM, wherein the access cost model is used for representing the data access cost ratio of the CUDA core to the TCU; The granularity of division of the SpMM is determined based on the data access cost model corresponding to the SpMM, and the granularity of division of the SDDMM is determined based on the data access cost model corresponding to the SDDMM.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the TCU and CUDA on-core hybrid computing task scheduling method of any one of claims 1 to 8 when the computer program is executed by the processor.
10. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements a TCU and CUDA core hybrid computing task scheduling method according to any of claims 1 to 8.

Description

Method and equipment for scheduling mixed computing tasks on TCU and CUDA cores Technical Field The application relates to the technical field of data processing, in particular to a method and equipment for scheduling mixed computing tasks on a TCU (TCU) and CUDA (compute unified device architecture) core. Background Graphics processing unit GPU (Graphics Processing Unit), also known as an image processor, is a coprocessor for processing images and graphics operations, and is widely used in personal computers, workstations, and some mobile devices. With the rapid development of the fields of artificial intelligence and scientific computing, sparse Matrix and Dense Matrix multiplication SpMM (spark Matrix-Matrix Multiplication) and sampling Dense Matrix multiplication SDDMM (Dense-Dense Matrix Multiplication) in Sparse Matrix multiplication gradually become key computing cores in large-scale data processing and analysis tasks, and are widely applied to the fields of machine learning, scientific computing, recommendation systems and the like. In particular, in a graphics processing unit GPU, these computations typically involve two heterogeneous computing resources, a CUDA Core (Compute Unified Device Architecture Core, abbreviated as CUDA Core) and a Tensor Core (TCU). At present, the existing GPU implementation scheme adopts a TCU and CUDA core mixed calculation strategy, and can simultaneously schedule the calculation tasks of the TCU and the CUDA core by adopting a single kernel function aiming at the divided mixed calculation load so as to improve the overall parallelism and the resource utilization rate. However, this strategy faces two major problems. First, it has been shown that the execution of TCU in parallel with CUDA core may result in reduced computational frequency (e.g., thermal down-conversion or resource contention), which may not be as efficient in some scenarios as using TCU alone. Secondly, in an unstructured sparse scene, the load between the TCU and the CUDA core is often difficult to distribute uniformly in the thread block, and due to the difference of the computing capacities of the two types of computing units, the execution time in the thread bundles (warp) is difficult to align, and part of thread resources are easy to idle, so that the overall computing efficiency is reduced, and the resource utilization rate of the GPU is still affected. Disclosure of Invention In view of this, embodiments of the present application provide methods and apparatus for scheduling hybrid computing tasks on a TCU and CUDA core that obviate or mitigate one or more disadvantages in the prior art. One aspect of the present application provides a method for scheduling a hybrid computing task on a TCU and CUDA core, including: Determining a value of a preset theoretical execution time ratio according to a ratio between CUDA core estimated execution time corresponding to a CUDA core in a graphic processing unit GPU and tensor core TCU estimated execution time corresponding to a TCU in the graphic processing unit GPU, wherein the CUDA core is used for executing matrix multiplication calculation for CUDA core calculation tasks in each calculation task; Inputting the value of the theoretical execution time ratio into a preset TCU utilization rate prediction model to obtain a corresponding TCU utilization rate; and if the TCU utilization rate is smaller than a preset TCU utilization rate threshold value, sequentially and serially scheduling each CUDA core computing task and each TCU computing task in a single computing flow. In some embodiments of the application, further comprising: If the TCU utilization rate is equal to or greater than the TCU utilization rate threshold, determining a value of a preset atomic operation ratio according to a ratio between the total number of atomic operation instructions and the total number of all calculation operation instructions in the execution process of the CUDA core calculation task and the TCU calculation task; And if the value of the atomic operation ratio is larger than a preset atomic index threshold value, sequentially and serially scheduling each CUDA core computing task and each TCU computing task in a single computing flow. In some embodiments of the application, further comprising: And if the value of the atomic operation ratio is smaller than or equal to the atomic index threshold, decoupling each TCU computing task and each CUDA core computing task into different computing flows for parallel scheduling. In some embodiments of the application, the TCU utilization prediction model is as shown in equation (1): (1) Wherein, the Representing the theoretical execution time ratio; indicating the TCU utilization. In some embodiments of the present application, before the inputting the value of the theoretical execution time ratio into a preset TCU utilization prediction model, the method further includes: Constructing an exponential decay model between a theoretic