CN-121996402-A - Model computing system, method and device and related equipment

CN121996402ACN 121996402 ACN121996402 ACN 121996402ACN-121996402-A

Abstract

The application provides a model computing system, a method, a device and related equipment, wherein the method comprises the steps of segmenting a first model according to the number of xPU accelerator cards to generate a plurality of second models, segmenting a CPU according to the number of xPU accelerator cards to generate a plurality of CPU resource groups, binding each xPU accelerator card with the corresponding CPU resource group and the second models to generate a plurality of computing groups, and computing the plurality of second models through the plurality of computing groups. In the method, the model and the CPU are reasonably segmented according to the number of xPU accelerator cards, and the segmented model, CPU resources and xPU accelerator cards are bound, so that the respective computing advantages of a plurality of xPU accelerator cards can be brought into play, the time delay caused by data transmission between different memory areas can be reduced, and the computing efficiency of the whole model is improved.

Inventors

LI CHENG
LI WEI
LI GEZI
ZHANG JINGBIN
ZHANG ZIYANG
WANG YAOYUAN

Assignees

华为技术有限公司

Dates

Publication Date: 20260508
Application Date: 20241108

Claims (14)

1. A model computing system, wherein the system is configured to compute a first model, the first model being divided into a plurality of second models, the system comprising a plurality of computing groups, each computing group of the plurality of computing groups comprising central processing unit CPU resources and heterogeneous processor units xPU accelerator cards, the number of second models being equal to the number of computing groups, each computing group being configured to compute one of the plurality of second models, wherein the number of CPU resources in each computing group is proportional to the computational power of xPU accelerator cards in the computing group, and the number of computation of each computing group of assigned second models is proportional to the computational power of xPU accelerator cards in the computing group.
2. The system of claim 1, wherein the number of xPU accelerator cards in each computing group is 1 and the CPU resources in each computing group comprise one or more CPU cores, wherein when the CPU resources in a computing group comprise a plurality of CPU cores, the plurality of CPU cores are located in one or more CPUs.
3. The system of claim 2, further comprising a segmentation processor, The segmentation processor is configured to segment the first model according to the number of xPU accelerator cards, and generate the plurality of second models, where the computing force of each xPU accelerator card is proportional to the computing amount of the corresponding second model; The splitting processor is further configured to split the CPU according to the number of xPU accelerator cards, and generate a plurality of CPU resource groups, where the computing power of each xPU accelerator card is proportional to the number of CPU resources included in the corresponding CPU resource group.
4. The system of claim 3, wherein the segmentation processor is further configured to segment a first operator to generate a plurality of second operators, the first operator being a computation operator of the first model, one of the plurality of second operators being a computation operator of one of the plurality of second models, each of the plurality of second operators comprising a forward computation operator, a reverse computation operator, an optimizer operator, and a path operator, wherein the optimizer operator is configured to update model parameters based on gradient information obtained by the reverse computation operator; the segmentation processor is further configured to deploy each second operator to a computing group.
5. The system according to claim 4, wherein the deploying a second operator on each computing group is specifically configured to: deploying a forward computation operator and a reverse computation operator in a second operator corresponding to a first computation group on xPU acceleration cards in the first computation group, wherein the first computation group is one of the computation groups, and the second operator corresponding to the first computation group is a computation operator of a second model corresponding to the first computation group; Dividing the optimizer operators in the second operators corresponding to the first computing group into a first optimizer operator and a second optimizer operator according to the functions of the optimizer operators; dividing the road operators in the second operators corresponding to the first computing group into a first road operator and a second road operator according to the function of the road operators; The first optimizer operator and the first path operator are deployed on CPU resources in the first computing group, and the second optimizer operator and the second path operator are deployed on xPU accelerator cards in the first computing group.
6. The system of claim 5, further comprising an operator fusion engine, The operator fusion device is used for fusing operators to be deployed on xPU acceleration cards in the first computing group before the segmentation processor deploys each second operator to one computing group.
7. The system of claim 6, wherein the operator fusion engine is further configured to fuse operators to be deployed on the CPU before the segmentation processor deploys each second operator to a compute group.
8. The system of any one of claims 1 to 7, further comprising a CPU utilization calculator, The CPU utilization rate calculator is used for calculating the utilization rate of a first CPU resource and the utilization rate of a second CPU resource, wherein the first CPU resource is the CPU resource used for the calculation of the plurality of second models in the system, and the second CPU resource is the CPU resource except the first CPU resource in the system; The CPU utilization rate calculator is further configured to allocate a part of the CPU resources in the second CPU resources to the first CPU resources when the utilization rate of the second CPU resources is less than or equal to the utilization rate of the first CPU resources.
9. A method of model calculation, the method comprising: Splitting the first model according to the number of xPU accelerator cards to generate a plurality of second models, wherein the calculation force of each xPU accelerator card is in direct proportion to the calculation amount of the corresponding second model; splitting the CPU according to the number of xPU acceleration cards to generate a plurality of CPU resource groups, wherein the calculation power of each xPU acceleration card is in direct proportion to the number of CPU resources contained in the corresponding CPU resource group; Binding each xPU acceleration card with a corresponding CPU resource group to generate a plurality of calculation groups; the plurality of second models are calculated by the plurality of calculation groups.
10. A model computing device, the device comprising: The model segmentation unit is used for segmenting the first model according to the number of xPU accelerator cards to generate a plurality of second models, wherein the calculation force of each xPU accelerator card is in direct proportion to the calculation amount of the corresponding second model; The resource segmentation unit is used for segmenting the CPU according to the number of xPU acceleration cards to generate a plurality of CPU resource groups, wherein the calculation force of each xPU acceleration card is in direct proportion to the number of CPU resources contained in the corresponding CPU resource group; the processing unit is used for binding each xPU acceleration card with the corresponding CPU resource group to generate a plurality of calculation groups; The processing unit is used for calculating the plurality of second models through the plurality of calculation groups.
11. A computing device comprising a processor and a memory, the memory to store instructions, the processor to execute the instructions to cause the computing device to implement the method of claim 9.
12. A cluster of computing devices, wherein the cluster of computing devices includes at least one computing device, each of the at least one computing device including a processor and a memory, the processor of the at least one computing device to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to implement the method of claim 9.
13. A computer program product comprising instructions that, when run on at least one computing device, cause the at least one computing device to perform the method of claim 9.
14. A computer readable storage medium having instructions stored therein, which when executed by a computing device or cluster of computing devices, implement the method of claim 9.

Description

Model computing system, method and device and related equipment Technical Field The present application relates to the field of heterogeneous acceleration, and in particular, to a model computing system, method, apparatus, and related devices. Background With the continuous development of technology, the demand for model computing power in the leading-edge fields of artificial intelligence, big data analysis and the like is increasing, and the demand is becoming more severe. Traditional model computing architectures increasingly present significant performance bottlenecks in dealing with increasingly complex model computing tasks. Taking the example of computing a model using a central processing unit (Central Processing Unit, CPU), the computing power of the CPU appears to be relatively limited when faced with large model computing tasks, especially massively parallel computing tasks. For example, in the training process of the deep learning model, the huge data volume and the complex calculation requirement make it difficult for the CPU to complete the calculation task within a specified time, resulting in lower calculation efficiency of the whole model. Disclosure of Invention The application provides a model calculation system, a method, a device and related equipment, which not only can realize independent processing of a segmented model and exert the calculation advantages of a plurality of xPU acceleration cards, but also can reduce the time delay caused by data transmission between different memory areas by binding CPU resources with the xPU cards, thereby improving the calculation efficiency of the whole model. In a first aspect, the present application provides a model calculation system for calculating a first model, the first model being divided into a plurality of second models, the system comprising a plurality of calculation groups, each calculation group of the plurality of calculation groups comprising central processing unit CPU resources and heterogeneous processor units xPU accelerator cards, the number of second models being equal to the number of calculation groups, each calculation group being for calculating a second model of the plurality of second models, wherein the number of CPU resources in each calculation group is proportional to the computational effort of xPU accelerator cards in the calculation group, and the number of calculation groups of second models being allocated is proportional to the computational effort of xPU accelerator cards in the calculation group. In the above scenario xPU is meant to cover other processors than CPUs for accelerating certain types of computation, including but not limited to image processors (Graphics Processing Unit, GPU), neural network processors (Neural-ProcessingUnit, NPU), tensor processors (TensorProcessingUnit, TPU), etc., different processors can provide more efficient computing power than CPUs in certain computing scenarios, as exemplified by GPUs, which have a large number of cores that can process multiple tasks in parallel, so that when the requirements of the computing model for data parallelism are high, the use of GPUs can greatly accelerate the computation rate of the model. For example, in deep learning models, a large number of matrix multiplication and matrix addition operations are often required, which have high requirements for data parallelism, and which can be processed simultaneously using a GPU. In the above solution, the xPU accelerator card is a hardware computing device obtained by integrating xPU and other auxiliary components, where the other auxiliary components include a high-speed memory for storing data and an interface circuit for interfacing with a host system. It should be understood here that xPU accelerator card is a complete computing device, so that in the process of computing a model, only the model and the corresponding operators need to be set on xPU accelerator card, and the xPU accelerator card can complete the computation of the model. In the above system, when the model needs to be calculated, the parent model (the first model) is divided into a plurality of sub-models (the second model), and then parallel processing of the plurality of sub-models is achieved through a plurality of calculation groups, so that the calculation efficiency of the model is improved, each calculation group is composed of CPU resources and xPU accelerator cards, the calculation advantages of different xPU accelerator cards can be exerted according to the model structure, for example, a plurality of layers are included in the deep learning model, different layers possibly correspond to different calculation features, a large number of matrix multiplication and matrix addition operations are involved in a convolution layer in the deep learning model, the model corresponding to the convolution layer is cut out from the deep learning model through cutting the different layers in the deep learning model, and the calculat