CN-114942846-B - GPU resource scheduling method, device, equipment and storage medium

CN114942846BCN 114942846 BCN114942846 BCN 114942846BCN-114942846-B

Abstract

The invention relates to the technical field of clouds and discloses a method, a device, equipment and a storage medium for scheduling GPU resources. The method comprises the steps of obtaining registration information of a target container on GPU resources, creating socket joint information of a simulation card slot of the target container in the GPU resources according to the registration information, carrying out virtualization processing on the target container on a K8S cluster based on the socket joint information to obtain node resource information corresponding to the target container on the K8S cluster, dividing the GPU resources according to the node resource information, distributing the divided GPU resources to the target container through the K8S cluster, obtaining operation instructions on the target container on the K8S cluster, and scheduling the GPU resources corresponding to the operation instructions in the GPU resources distributed to the target container by referring to the socket joint information. The invention realizes that a plurality of containers share the resource of one GPU card, and improves the flexibility of GPU resource demand use.

Inventors

LIU WENJIE

Assignees

中国平安财产保险股份有限公司

Dates

Publication Date: 20260508
Application Date: 20220526

Claims (8)

1. The GPU resource scheduling method is characterized by comprising the following steps of: The method comprises the steps of obtaining registration information of a target container on GPU resources, wherein the registration information comprises basic information of equipment where the target container is located, namely, the required video memory capacity, a card slot path and a capacity serial number of a GPU card of the target container, determining a required GPU resource area, and creating socket information of a simulated card slot of the target container in the GPU resources according to the registration information; Carrying out virtualization processing on the target container on a K8S cluster based on the socket joint information to obtain node resource information corresponding to the target container on the K8S cluster; Dividing the GPU resources according to the node resource information, and distributing the divided GPU resources to the target container through the K8S cluster; The method comprises the steps of obtaining an operation instruction of a K8S cluster on a target container, referring to the socket information, scheduling GPU resources corresponding to the operation instruction in GPU resources allocated to the target container, referring to the socket information, scheduling the GPU resources corresponding to the operation instruction in the GPU resources allocated to the target container, determining a simulation card slot pointed by the operation instruction, sending the operation instruction to a GPU partition corresponding to the GPU resources allocated to the target container through the simulation card slot, traversing the GPU partition by adopting the operation instruction, determining the GPU resources operated by the operation instruction based on the traversing result, and scheduling the GPU resources operated by the operation instruction to the target container.
2. The GPU resource scheduling method of claim 1, wherein the performing virtualization processing on the target container on the K8S cluster based on the socket information to obtain node resource information corresponding to the target container on the K8S cluster includes: determining the process information of the target container, and counting the video memory demand of each process in the process information; Determining resource information of each GPU card associated with the K8S cluster, and distributing node resources of each process in corresponding GPU cards based on the resource information and the video memory demand; And generating node resource information of the target container in the K8S cluster according to the allocated node resources and the socket information.
3. The GPU resource scheduling method of claim 2, wherein generating node resource information of the target container in the K8S cluster according to the allocated node resources and the socket information comprises: according to the distributed node resources, binding the node resources of the corresponding GPU card distributed by each process with the socket information in the K8S cluster; and generating node resource information of the target container in the K8S cluster based on the binding result.
4. The GPU resource scheduling method of claim 2, further comprising, after said dividing the GPU resources according to the node resource information and allocating the divided GPU resources to the target container through the K8S cluster: Monitoring the use amount of the video memory of each process in the target container, and respectively judging whether the monitored use amount of each video memory exceeds the video memory demand of the corresponding process; if the number of the video memories exceeds the number of the video memories, executing a preset restarting strategy on the target container in the K8S cluster, and reallocating node resources of each process in the corresponding GPU card according to the using amount of the video memories.
5. A GPU resource scheduling device, wherein the GPU resource scheduling device comprises: The system comprises a creating module, a storage module and a storage module, wherein the registering module is used for acquiring registration information of a target container to GPU resources, the registration information comprises basic information of equipment of the target container, namely, the required video memory capacity, a card slot path and a capacity serial number of a GPU card of the target container are determined, a required GPU resource area is established, socket information of a card slot simulated in the GPU resources of the target container is created according to the registration information, the video memory capacity to be registered of the target container in the GPU resources and the registration name of the simulated card slot are determined according to the registration information, and socket information of the target container in the GPU resources is created according to a preset format by adopting the registration name and is generated based on the socket and the video memory capacity; The virtualization module is used for carrying out virtualization processing on the target container on the K8S cluster based on the socket information to obtain node resource information corresponding to the target container on the K8S cluster; The dividing module is used for dividing the GPU resources according to the node resource information and distributing the divided GPU resources to the target container through the K8S cluster; The scheduling module is used for acquiring an operation instruction of the target container on the K8S cluster, scheduling GPU resources corresponding to the operation instruction in GPU resources allocated to the target container by referring to the socket information, determining a simulation card slot pointed by the operation instruction by referring to the socket information, sending the operation instruction to a GPU partition corresponding to the GPU resources allocated to the target container through the simulation card slot, traversing the GPU partition by adopting the operation instruction, determining the GPU resources operated by the operation instruction based on the traversing result, and scheduling the GPU resources operated by the operation instruction to the target container.
6. The GPU resource scheduling apparatus of claim 5, further comprising a restart module configured to: Monitoring the use amount of the video memory of each process in the target container, and respectively judging whether the monitored use amount of each video memory exceeds the video memory demand of the corresponding process; if the number of the video memories exceeds the number of the video memories, executing a preset restarting strategy on the target container in the K8S cluster, and reallocating node resources of each process in the corresponding GPU card according to the using amount of the video memories.
7. The GPU resource scheduling device is characterized by comprising a memory and at least one processor, wherein instructions are stored in the memory; The at least one processor invoking the instructions in the memory to cause the GPU resource scheduling device to perform the steps of the GPU resource scheduling method of any of claims 1-4.
8. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the steps of a GPU resource scheduling method as claimed in any of claims 1 to 4.

Description

GPU resource scheduling method, device, equipment and storage medium Technical Field The present invention relates to the field of cloud technologies, and in particular, to a method, an apparatus, a device, and a storage medium for GPU resource scheduling. Background With the continuous growth of services, in order to maintain stability and reliability of the services, maintenance management, elastic expansion and the like of application services have increasingly greater challenges, and in order to cope with the continuous and changeable operation and maintenance management and development requirements, numerous services choose to containerize the services and then host K8S (Kubernetes, container cluster management system) to perform container cluster management, rely on the K8S, can flexibly and easily use and efficiently and safely manage containers, and also can support rapid iterative online of development versions, wherein intensive computing services relying on GPUs are more so. Since the K8S node is not capable of managing GPU resources as CPU and memory resources, it is necessary to rely on third party official device plug-ins, such as device plug-ins of mainstream AMD and NVIDIA, but these are limited by 1) that only limits can be specified when setting GPUs resource requirements, requests and limit limits cannot be set like CPU and memory, meaning GPU resources cannot be flexibly allocated, how much to apply for, and 2) that no GPU can be shared between containers at present, or that one GPU can only belong to one container. In short, the problem that the allocation and calling efficiency of GPU resources is low is solved. Disclosure of Invention The invention mainly aims to solve the technical problem of low allocation and calling efficiency of GPU resources. The first aspect of the invention provides a GPU resource scheduling method, which comprises the steps of obtaining registration information of a target container on GPU resources, creating socket joint information of a simulation card slot of the target container in the GPU resources according to the registration information, carrying out virtualization processing on the target container on a K8S cluster based on the socket joint information to obtain node resource information corresponding to the target container on the K8S cluster, dividing the GPU resources according to the node resource information, distributing the divided GPU resources to the target container through the K8S cluster, obtaining operation instructions on the target container on the K8S cluster, and scheduling GPU resources corresponding to the operation instructions in the GPU resources distributed to the target container according to the socket joint information. Optionally, in a first implementation manner of the first aspect of the present invention, the creating the socket information of the analog card slot of the target container in the GPU resource according to the registration information includes determining a video memory capacity to be registered of the target container in the GPU resource and a registration name of the analog card slot according to the registration information, creating a socket of the target container in the GPU resource according to a preset format by using the registration name, and generating the socket information based on the socket and the video memory capacity. Optionally, in a second implementation manner of the first aspect of the present invention, the performing virtualization processing on the target container on the K8S cluster based on the socket information to obtain node resource information corresponding to the target container on the K8S cluster includes determining process information of the target container, counting a video memory demand of each process in the process information, determining resource information of each GPU card associated with the K8S cluster, and allocating node resources of each process on each GPU card based on the resource information and the video memory demand, and generating node resource information of the target container in the K8S cluster according to the allocated node resources and the socket information. Optionally, in a third implementation manner of the first aspect of the present invention, the generating node resource information of the target container in the K8S cluster according to the allocated node resource and the socket information includes binding, according to the allocated node resource, the node resource of the GPU card allocated to each process with the socket information in the K8S cluster, and generating node resource information of the target container in the K8S cluster based on a binding result. Optionally, in a fourth implementation manner of the first aspect of the present invention, after the dividing the GPU resources according to the node resource information and allocating the divided GPU resources to the target container through the K8S cluster, the method fu