CN-121807573-B - Intelligent computing resource sharing scheduling method and device based on non-invasive technology

CN121807573BCN 121807573 BCN121807573 BCN 121807573BCN-121807573-B

Abstract

The invention relates to the technical field of intelligent computing resource sharing, and discloses an intelligent computing resource sharing scheduling method and device based on a non-invasive technology. The method comprises the steps of receiving a GPU resource pool creation request, obtaining an available GPU list, distributing GPUs in the available GPU list to a GPU resource pool detail table to obtain a resource pool identifier, receiving a container creation request containing the resource pool identifier, establishing an SSH session for each physical machine, uploading a container control script to a preset script catalog to obtain the SSH session of the deployed script, detecting the computing card type of the physical machine through the SSH session, generating a container creation command, remotely executing the container control script to obtain a container record, writing the container record and the resource pool identifier into a container resource pool association table, and completing container scheduling of the physical machines. The invention realizes cross-physical machine GPU resource pooling and container scheduling on the premise of component installation and zero configuration modification through a non-invasive architecture based on an SSH remote management technology.

Inventors

WANG WUDONG
PAN DENG
QIU ZHIHAO

Assignees

广东图灵智新技术有限公司

Dates

Publication Date: 20260508
Application Date: 20260309

Claims (9)

1. A method for intelligent computing resource sharing scheduling based on non-invasive technology, comprising: receiving a GPU resource pool creation request, acquiring an available GPU list, and distributing the GPUs in the available GPU list to a GPU resource pool detail table to obtain a resource pool identifier; The method comprises the steps of receiving a container creation request containing a resource pool identifier, acquiring a GPU from a GPU resource pool detail table, grouping according to physical machines, establishing an SSH session for each physical machine, uploading a container control script to a preset script directory to obtain an SSH session of a deployed script, specifically, inquiring the GPU resource pool table according to the resource pool identifier in the container creation request, extracting a use state field of a resource pool record, judging whether the use state field value is an unused state, verifying the exclusive state of the resource pool, inquiring the GPU resource pool detail table based on the resource pool identifier to obtain a GPU identifier list, inquiring the GPU table according to the GPU identifier list to obtain a physical machine identifier field, traversing the GPU record to group according to the physical machine identifier, taking each physical machine identifier as a key, storing the corresponding GPU identifier list as a value to a mapping data structure to obtain a mapping table of the physical machine and the GPU, traversing the mapping table of the physical machine and the GPU, establishing the SSH session for each physical machine and uploading the container control script to the preset script to obtain the SSH session of the deployed script; Detecting the type of the computing card of the physical machine through the SSH session, generating a container creation command according to the type of the computing card and the GPU equipment serial number character string, remotely executing the container control script to obtain a container record, writing the container record and the resource pool identifier into a container resource pool association table, and completing the dispatching of containers across the physical machine.
2. The non-invasive technology-based intelligent computing resource sharing scheduling method according to claim 1, wherein receiving a GPU resource pool creation request and obtaining an available GPU list, and distributing GPUs in the available GPU list to a GPU resource pool detail table to obtain a resource pool identifier, comprises: Receiving a GPU resource pool creation request, inquiring a cluster configuration table according to a cluster identifier in the GPU resource pool creation request, judging whether an isomorphic mode field is true, and if so, extracting the cluster identifier; Acquiring a physical machine identification list based on the cluster identification query cluster and a physical machine association table, and inquiring a GPU table according to the physical machine identification list to acquire a cluster GPU list; querying a GPU resource pool detail table based on the GPU identifications of the cluster GPU list, screening the allocated GPU identifications and eliminating the GPU identifications from the cluster GPU list to obtain an available GPU list; and creating a resource pool record in a GPU resource pool table based on the available GPU list, and distributing the GPUs in the available GPU list to a GPU resource pool detail table to obtain a resource pool identifier.
3. The non-invasive technology based intelligent computing resource sharing scheduling method according to claim 2, wherein creating a resource pool record in a GPU resource pool table based on the available GPU list, and distributing GPUs in the available GPU list to a GPU resource pool detail table to obtain a resource pool identifier, comprises: Inquiring a GPU resource pool table to verify the uniqueness of the name of the resource pool, inserting a resource pool record containing the name of the resource pool and an initial value of a use state field into an unused state into the GPU resource pool table, and obtaining a resource pool identifier formed by self-proliferation of a database; initializing a residual GPU quantity counter to the quantity of the GPUs in the GPU resource pool creation request, traversing the available GPU list, extracting GPU identifications, combining the GPU identifications with the resource pool identifications, inserting the GPU identifications into a GPU resource pool detail table, decrementing the residual GPU quantity counter until the residual GPU quantity is reset to zero, and establishing an association relation between the resource pool identifications and the GPU identifications; And inquiring a GPU (graphics processing unit) table through the GPU identifier of the GPU resource pool detail table to acquire a physical machine identifier field, and establishing a three-layer mapping relation from the resource pool identifier to the physical machine identifier through the GPU identifier.
4. The intelligent computing resource sharing scheduling method based on the non-invasive technology according to claim 3, wherein the step of obtaining the physical machine identification field by querying a GPU table through the GPU identification of the GPU resource pool detail table, and establishing a three-layer mapping relationship from the resource pool identification to the physical machine identification through the GPU identification, comprises the following steps: extracting all records containing the resource pool identifier from the GPU resource pool detail table, and acquiring a GPU identifier field in each record to form a GPU identifier list; Searching the GPU identification list in the GPU table as a query condition, extracting a physical machine identification field of a record corresponding to each GPU identification, and forming a corresponding relation between the GPU identifications and the physical machine identifications; and establishing a three-layer mapping relation from the resource pool identifier to the physical machine identifier through the GPU identifier by combining the corresponding relation between the GPU identifier and the physical machine identifier based on the association relation between the resource pool identifier and the GPU identifier in the GPU resource pool detail table.
5. The non-invasive technology-based intelligent computing resource sharing scheduling method according to claim 1, wherein traversing the mapping table of the physical machines and the GPU, establishing an SSH session for each physical machine and uploading a container control script to a preset script directory, obtaining an SSH session of a deployed script, comprises: Traversing a mapping table of the physical machine and the GPU to extract a physical machine identifier and a GPU identifier list, inquiring the physical machine table to obtain an IP address, an SSH port, a user name and a password of the physical machine, generating a unique container name based on the number of the physical machine identifier and the GPU identifier list, creating an SSH session object, setting connection parameters, and establishing SSH connection with the physical machine; And opening an SFTP file transmission channel through the SSH connection, uploading a container control script to a preset script directory of the physical machine, and executing a command for setting script execution permission to obtain an SSH session of the deployed script.
6. The non-invasive technology based intelligent computing resource sharing scheduling method according to claim 5, wherein detecting a computing card type of a physical machine through the SSH session, generating a container creation command according to the computing card type and a GPU device number character string and remotely executing the container control script to obtain a container record, and writing the container record and the resource pool identifier into a container resource pool association table, completing cross-physical machine container scheduling, comprising: Traversing the GPU identification list, inquiring a GPU table to obtain a device number field, connecting the device number field with a comma to form a GPU device number character string, executing a power card detection command through the SSH session, and determining the power card type according to a command exit code; constructing a container creation command containing container configuration parameters based on the computing card type and the GPU equipment number character string, remotely executing the container control script through the SSH session to generate a container configuration file and starting a container, analyzing a command exit code to verify that the container creation is successful, and inserting container configuration information into a container table to obtain a container record; And updating the use state field of the resource pool record into a used state, and writing the container record and the resource pool identifier into a container resource pool association table to finish cross-physical machine container scheduling.
7. The non-invasive technique-based intelligent computing resource sharing scheduling method according to claim 6, wherein constructing a container creation command containing container configuration parameters based on the computing card type and GPU device number character string, remotely executing the container control script through the SSH session to generate a container configuration file and start a container, parsing a command exit code to verify that the container creation was successful, inserting container configuration information into a container table to obtain a container record, comprising: Extracting a mirror name, a CPU core number, a memory size and a disk space parameter from the container creation request, and constructing a container creation command character string containing a script path, operation parameters and all configuration parameters by combining the computing card type, the GPU equipment number character string, the container name and a physical machine password; And executing the container creation command character string through the SSH session, generating a container configuration file containing resource limitation and GPU equipment mapping according to the computing card type by the container control script, calling a container arrangement command to start a container, reading a command standard output stream and an error output stream, extracting a command exit code to judge whether the container is in a successful state, and inserting the container name, the mirror image, the resource configuration and the physical machine information into a container table to obtain a container record.
8. The non-invasive technique based intelligent computing resource sharing scheduling method according to claim 7, wherein updating the usage status field of the resource pool record to a used status, writing the container record and the resource pool identifier into a container resource pool association table, completing cross-physical machine container scheduling, comprises: Verifying that the container creation states of all physical machines in the mapping table of the physical machines and the GPU are successful, inquiring the GPU resource pool table through the resource pool identifier to obtain a resource pool record object, calling a state field setting method of the resource pool record object to update a use state field value from an unused state to a used state, executing database update operation to persist the use state field, and preventing the resource pool from being repeatedly used by other container creation requests; traversing a successfully created container name list, analyzing a physical machine identifier from each container name, inquiring a container table according to the container name and the physical machine IP address to obtain a container identifier of a container record, constructing a container resource pool association record containing the container identifier, the resource pool identifier and a current timestamp, inserting the container resource pool association record into a container resource pool association table, and establishing a many-to-one association relationship between a container and a resource pool so as to support a resource pool releasing operation when the container is deleted.
9. A non-invasive technology based intelligent computing resource sharing scheduling apparatus, characterized by steps for implementing the non-invasive technology based intelligent computing resource sharing scheduling method according to any one of claims 1 to 8, comprising: the receiving module is used for receiving a GPU resource pool creation request, acquiring an available GPU list, distributing the GPUs in the available GPU list to a GPU resource pool detail table, and obtaining a resource pool identifier; the deployment module is used for receiving a container creation request containing the resource pool identifier, acquiring a GPU from the GPU resource pool detail table, grouping the GPU according to physical machines, establishing an SSH session for each physical machine, uploading a container control script to a preset script catalog, and obtaining the SSH session of the deployed script; And the scheduling module is used for detecting the type of the computing card of the physical machine through the SSH session, generating a container creation command according to the type of the computing card and the GPU equipment serial number character string, remotely executing the container control script to obtain a container record, writing the container record and the resource pool identifier into a container resource pool association table, and completing the scheduling of containers across the physical machine.

Description

Intelligent computing resource sharing scheduling method and device based on non-invasive technology Technical Field The invention relates to the technical field of intelligent computing resource sharing, in particular to an intelligent computing resource sharing scheduling method and device based on a non-invasive technology. Background With the rapid increase of artificial intelligence and big data computing demands, efficient management and scheduling of computing power resources such as GPUs becomes a core problem in the field of cloud computing. The current mainstream container cluster management scheme realizes GPU resource pooling and container scheduling based on Kubernetes, but the scheme has significant invasive problems that a large number of components such as kubelet, container running time, network plug-in, GPU equipment plug-in and the like are required to be installed on each physical machine, system kernel parameters, network configuration and firewall rules are modified, the deployment period is as long as days or even weeks, the existing service of users is required to be stopped, and the technical threshold is high and professional operation and maintenance personnel are required. When a user needs to aggregate the computational power resources of a plurality of heterogeneous or isomorphic GPU servers into a uniform resource pool, the traditional scheme faces the dilemma of unified environment, wherein the system configuration of different servers is different, partial servers run business which cannot be interrupted, and the network configuration is different, and the factors cause the difficulty of fast implementation of the Kubernetes scheme, so that the user cannot accept the damage to the existing environment and the long-time interruption of the business. The prior art lacks a technical scheme for realizing unified scheduling of cross-physical machine GPU resource pooling and containerized application on the premise of keeping the physical host environment of a user intact and not installing a heavyweight container arrangement component, and a non-invasive intelligent computing resource sharing scheduling method is urgently needed, so that pooling management of GPU resources and cross-machine deployment of containers can be rapidly realized under the condition that the existing service and system configuration of the user are not influenced. Disclosure of Invention The invention mainly aims to provide an intelligent computing resource sharing scheduling method and device based on a non-invasive technology, and the invention provides a method and device for scheduling intelligent computing resources based on the non-invasive technology through a non-invasive architecture based on an SSH remote management technology, the method realizes cross-physical machine GPU resource pooling and container scheduling on the premise of component installation and zero configuration modification. In order to achieve the above object, the present invention provides a method for sharing and scheduling intelligent computing resources based on a non-invasive technique, comprising the following steps: receiving a GPU resource pool creation request, acquiring an available GPU list, and distributing the GPUs in the available GPU list to a GPU resource pool detail table to obtain a resource pool identifier; receiving a container creation request containing the resource pool identifier, acquiring a GPU from the GPU resource pool detail table, grouping according to physical machines, establishing an SSH session for each physical machine, and uploading a container control script to a preset script directory to obtain an SSH session of deployed scripts; Detecting the type of the computing card of the physical machine through the SSH session, generating a container creation command according to the type of the computing card and the GPU equipment serial number character string, remotely executing the container control script to obtain a container record, writing the container record and the resource pool identifier into a container resource pool association table, and completing the dispatching of containers across the physical machine. Optionally, in a first implementation manner of the first aspect of the present invention, receiving a GPU resource pool creation request and obtaining an available GPU list, and distributing GPUs in the available GPU list to a GPU resource pool detail table to obtain a resource pool identifier, including: Receiving a GPU resource pool creation request, inquiring a cluster configuration table according to a cluster identifier in the GPU resource pool creation request, judging whether an isomorphic mode field is true, and if so, extracting the cluster identifier; Acquiring a physical machine identification list based on the cluster identification query cluster and a physical machine association table, and inquiring a GPU table according to the physical machine identification list