CN-121996407-A - Intelligent GPU resource allocation method, device and equipment for AI training platform

CN121996407ACN 121996407 ACN121996407 ACN 121996407ACN-121996407-A

Abstract

The embodiment of the application relates to the technical field of cloud computing, in particular to a GPU resource intelligent allocation method and device for an AI training platform and electronic equipment. The method comprises the steps of periodically collecting GPU multidimensional running state data of each GPU node in an AI training platform, classifying and storing the data to generate a GPU resource state library, monitoring a pod creation event, identifying a task type corresponding to a new pod, formulating a matched GPU scheduling policy, judging whether available resources meet the requirements of the GPU scheduling policy according to the GPU resource state library, executing scheduling allocation operation corresponding to the GPU scheduling policy if the available resources meet the requirements of the GPU scheduling policy, and executing the step of judging whether the available resources meet the requirements of the GPU scheduling policy according to the GPU resource state library after preempting online development task resources meeting preset conditions if the available resources do not meet the requirements of the GPU scheduling policy. The method can realize the accurate identification and differential scheduling of the mixed load and improve the utilization rate of GPU resources.

Inventors

GAO DAI
SU YANG
CHEN CUNLI

Assignees

度小满云智科技(北京)有限公司
度小满科技(北京)有限公司

Dates

Publication Date: 20260508
Application Date: 20251225

Claims (10)

1. The intelligent GPU resource allocation method for the AI training platform is characterized by comprising the following steps of: The method comprises the steps of periodically collecting GPU multidimensional operation state data of each GPU node in an AI training platform, classifying and storing the data to generate a GPU resource state library; monitoring a pod creation event, and identifying a task type corresponding to a new pod when the pod creation event is monitored, wherein the task type comprises an online development task and an offline training task; formulating a GPU scheduling strategy matched with the task type; Judging whether available resources meet the requirements of the GPU scheduling policy according to the GPU resource state library, if so, executing scheduling allocation operation corresponding to the GPU scheduling policy, and if not, preempting on-line development task resources meeting preset conditions, and then executing the step of judging whether the available resources meet the requirements of the GPU scheduling policy according to the GPU resource state library.
2. The method for intelligent allocation of GPU resources for an AI training platform of claim 1, wherein the formulating the GPU scheduling policy that matches the task type comprises: If the task type is an online development task, a centralized deployment strategy is generated, wherein the centralized deployment strategy comprises nodes which are preferentially distributed to the similar tasks already running and have residual GPU, GPU resource super-distribution is allowed, and configuration resources only set resource limitation and do not set resource request; And generating an exclusive deployment strategy if the task type is an offline training task, wherein the exclusive deployment strategy comprises the steps of preferentially distributing to nodes of the GPU which meet preset idle conditions or only run similar tasks, and configuring resources and simultaneously setting resource requests and resource limits.
3. The method for intelligent allocation of GPU resources for an AI training platform of claim 2, wherein if the GPU scheduling policy is the centralized deployment policy, the executing the scheduling allocation operation corresponding to the GPU scheduling policy comprises: screening according to the centralized deployment strategy to obtain a target node, and receiving a GPU use mode appointed by a user; If the GPU use mode is the whole card exclusive mode, setting the GPU resource configuration parameters of the target node to the whole card number designated by a user; If the GPU use mode is a sharing mode, setting the GPU resource configuration parameters of the target node as preset sharing identifiers, and carrying out GPU sharing through a GPU virtualization technology; If the GPU use mode is a virtual GPU mode, setting the GPU resource configuration parameters of the target node to virtual GPU shares split according to a preset proportion, and carrying out allocation and use of the virtual GPUs according to a GPU slicing technology.
4. The intelligent allocation method of GPU resources for AI training platforms of claim 2, further comprising: judging whether the offline training task belongs to a single offline training task or a distributed offline training task; And if the distributed offline training task belongs to the distributed offline training task, adding a batch scheduling annotation configuration in the exclusive deployment strategy, wherein the batch scheduling annotation configuration comprises a designated batch scheduling enabling identifier and a minimum available quantity threshold of associated pod required by the task so as to ensure that all the associated pods are scheduled simultaneously.
5. The method for intelligent allocation of GPU resources for an AI training platform of claim 2, wherein if the GPU scheduling policy is the exclusive deployment policy, the executing the scheduling allocation operation corresponding to the GPU scheduling policy comprises: judging whether the number of the GPU in the exclusive deployment strategy is single card or multiple cards; If the special exclusive deployment strategy is a single card, screening to obtain a target node according to the special exclusive deployment strategy, and binding the GPU meeting the resource requirement in the target node with the new pod; If the GPU is multiple cards, GPU topology association data in the GPU resource state library are read, GPU combinations meeting the requirements of task GPU quantity are enumerated, scoring is carried out on each GPU combination according to the topology connection type, and GPU combinations with highest scores are selected to be bound with the new pod.
6. The intelligent distribution method of GPU resources for AI training platform according to claim 1, wherein the preempting the online development task resources meeting the preset conditions comprises: screening target pod meeting the preset condition from the GPU resource state library; sending a preemption reminder and a resource release countdown notice to a user to which the target pod belongs, and waiting for a preset response time; if the resource corresponding to the target pod is released within the preset response time, starting resource recovery; If the resources corresponding to the target pod are not released within the preset response time, sending a deleting signal to the target pod, setting a termination time, forcibly terminating the process of the target pod after the termination time is reached, and storing corresponding preemption information, wherein the preemption information comprises a time stamp, preempted pod information, user information, preemption reason and newly scheduled pod information.
7. The intelligent distribution method of GPU resources for AI training platforms according to claim 6, wherein the screening the target pod from the GPU resource status library that meets the preset condition comprises: selecting pod corresponding to the online development task from the GPU resource state library to serve as a candidate pod; extracting the pod with GPU utilization rate lower than a preset utilization rate threshold value and idle time exceeding the preset time threshold value from the candidate pods by combining GPU utilization rate data and idle time records, and taking the pod as a preselected pod; excluding false idle pod in initial starting stage, process dormancy or waiting for input/output from the preselected pod as a primary selected pod; and calling a preset formula to respectively calculate the preemption cost of each initially selected pod, and selecting the pod with the lowest preemption cost as the target pod, wherein the preset formula is obtained by performing positive correlation weighted calculation on the running time of the pod and performing negative correlation weighted calculation on the idle time of the GPU corresponding to the pod.
8. The intelligent allocation method of GPU resources for an AI training platform of claim 1, further comprising: determining an idle GPU in an online development task according to the GPU multidimensional running state data; judging whether the idle duration of the idle GPU reaches an idle threshold standard corresponding to the online development task; and if so, carrying out gradient resource recovery on the idle GPU.
9. A GPU resource intelligent allocation device for AI pushes away standard platform, characterized by comprising: The data acquisition unit is used for periodically acquiring the GPU multi-dimensional running state data of each GPU node in the AI training platform, classifying and storing the data to generate a GPU resource state library; The system comprises a pod creation identification unit, a task type identification unit and a task management unit, wherein the pod creation identification unit is used for monitoring a pod creation event, and identifying a task type corresponding to a new pod when the pod creation event is monitored; the scheduling policy making unit is used for making a GPU scheduling policy matched with the task type; the real-time resource judging unit is used for judging whether the available resources meet the requirements of the GPU scheduling policy according to the GPU resource state library, triggering the scheduling distributing unit if the available resources meet the requirements, and triggering the resource preempting unit if the available resources do not meet the requirements; The dispatching allocation unit is used for executing dispatching allocation operation corresponding to the GPU dispatching strategy; the resource preempting unit is used for triggering the real-time resource judging unit after preempting the online development task resources meeting the preset conditions.
10. An electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor; Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the intelligent allocation method of GPU resources for AI training platforms as claimed in any of claims 1 to 8.

Description

Intelligent GPU resource allocation method, device and equipment for AI training platform Technical Field The embodiment of the application relates to the technical field of cloud computing, in particular to a GPU resource intelligent allocation method and device for an AI training platform and electronic equipment. Background The AI training platform is a one-stop technical carrier for supporting the whole process of artificial intelligent model development and training, integrates core hardware resources such as GPU (graphics processing unit) and various development and training tools based on a container arrangement management tool (K8S, kubernetes), provides a high-efficiency environment for research and development personnel without paying attention to bottom layer resource management and concentrating on model algorithm optimization, and is widely applied. The GPU is used as a core computational support of the AI training platform, and the distribution efficiency of the GPU is directly related to the progress of AI research and development and the resource utilization cost. The core of the GPU scheduling scheme of the existing container arrangement management tool (K8S) adopts a resource Request (Request) and Limit (Limit) based allocation mechanism, pod (minimum deployable computing unit) is scheduled to a working Node (Node) meeting resource requirements through a First adapting algorithm (First-Fit) or a random strategy, and a part of the scheme can integrate tools such as a batch scheduler (e.g. Volco) and the like so as to support characteristics such as batch scheduling, priority queues and the like and meet basic scheduling requirements of distributed training. However, in the prior art, a plurality of obvious short boards exist, namely, a scheduler decides only according to the application amount of resources, and lacks the sensing and distinguishing capability of mixed load characteristics, so that the actual utilization rate of the GPU in an online development scene is always lower than 20% but resources are occupied for a long time, and offline training tasks wait for a long time due to insufficient resources. Therefore, the problem of how to break the resource occupation without using and distributing unbalance and scheduling deadlock in the AI training platform is a problem which needs to be solved by the technicians in the field. Disclosure of Invention The application aims to at least provide a GPU resource intelligent allocation method and device for an AI training platform and electronic equipment, which can realize accurate identification and differential scheduling of mixed load and improve the utilization rate of GPU resources. In order to solve the above technical problems, at least one embodiment of the present application provides a GPU resource intelligent allocation method for an AI training platform, including: The method comprises the steps of periodically collecting GPU multidimensional operation state data of each GPU node in an AI training platform, classifying and storing the data to generate a GPU resource state library; monitoring a pod creation event, and identifying a task type corresponding to a new pod when the pod creation event is monitored, wherein the task type comprises an online development task and an offline training task; formulating a GPU scheduling strategy matched with the task type; Judging whether available resources meet the requirements of the GPU scheduling policy according to the GPU resource state library, if so, executing scheduling allocation operation corresponding to the GPU scheduling policy, and if not, preempting on-line development task resources meeting preset conditions, and then executing the step of judging whether the available resources meet the requirements of the GPU scheduling policy according to the GPU resource state library. In one embodiment, the formulating the GPU scheduling policy that matches the task type includes: If the task type is an online development task, a centralized deployment strategy is generated, wherein the centralized deployment strategy comprises nodes which are preferentially distributed to the similar tasks already running and have residual GPU, GPU resource super-distribution is allowed, and configuration resources only set resource limitation and do not set resource request; And generating an exclusive deployment strategy if the task type is an offline training task, wherein the exclusive deployment strategy comprises the steps of preferentially distributing to nodes of the GPU which meet preset idle conditions or only run similar tasks, and configuring resources and simultaneously setting resource requests and resource limits. In one embodiment, if the GPU scheduling policy is the centralized deployment policy, the executing the scheduling allocation operation corresponding to the GPU scheduling policy includes: screening according to the centralized deployment strategy to obtain a target node, and r