Search

CN-122019111-A - Heterogeneous resource scheduling method, system and storage medium

CN122019111ACN 122019111 ACN122019111 ACN 122019111ACN-122019111-A

Abstract

The application discloses a heterogeneous resource scheduling method, a heterogeneous resource scheduling system and a storage medium, and relates to the technical field of computers. The heterogeneous resource scheduling method comprises the steps of obtaining a plurality of computing power clusters based on a current computing task, obtaining first computing power clusters from the computing power clusters, obtaining corresponding communication efficiency values for each first computing power cluster, obtaining target computing power clusters based on the communication efficiency values, and executing the current computing task based on the target computing power clusters. The method has the core advantages that the limitation that the traditional method only uses the resource quota to construct the computing power cluster is broken through, the physical interconnection topological characteristic of the bottom layer of the computing power node is fully considered, the communication bottleneck of the cross-equipment can be avoided, the linear acceleration rate reduction risk is further reduced, and the problem of 'computing power island' is fundamentally solved.

Inventors

  • Zheng Hanxun
  • LIU YOUQUN

Assignees

  • 中昊芯英(杭州)科技有限公司

Dates

Publication Date
20260512
Application Date
20260413

Claims (10)

  1. 1. The heterogeneous resource scheduling method is applied to an artificial intelligent power computing service platform, and the artificial intelligent power computing service platform comprises a plurality of power computing nodes, and is characterized by comprising the following steps of: acquiring a plurality of computing force clusters based on a current computing task, wherein each computing force cluster comprises a plurality of first computing force nodes, and each first computing force node is any computing force node in all computing force nodes; The method comprises the steps of obtaining a first computing force cluster from each computing force cluster, wherein the first computing force cluster is a computing force cluster in which corresponding communication efficiency values are not obtained in each computing force cluster, and the communication efficiency values are at least used for representing the degree of physical connection tightness between first computing force nodes in the first computing force cluster; The communication efficiency values are corresponding to the computing force clusters one by one; Acquiring a target computing power cluster based on each communication efficiency value; and executing the current computing task based on the target computing power cluster.
  2. 2. The heterogeneous resource scheduling method of claim 1, wherein the obtaining a communication efficiency value based on the first computing power cluster comprises: Traversing each first computing node in the first computing cluster to obtain a second computing node and all third computing nodes, wherein the second computing node is any computing node which does not obtain a corresponding physical topology value in each first computing node; for each second computing node, the following steps are executed until each second computing node obtains a corresponding communication distance value: Based on the second computing force node and each third computing force node, obtaining a physical topological distance value corresponding to each third computing force node one by one; The communication distance value is at least used for representing the physical topological distance between the second computing node and each third computing node; and if each second computing node obtains the corresponding communication distance value, acquiring the communication efficiency value based on each communication distance value.
  3. 3. The heterogeneous resource scheduling method of claim 2, wherein the obtaining the communication efficiency value based on each communication distance value includes: Based on the second computing force node and each third computing force node, obtaining time-consuming quantized values corresponding to each third computing force node one by one, wherein the time-consuming quantized values are at least used for representing the time consumption of data transmitted from the second computing force node to the corresponding third computing force node; and acquiring the communication efficiency value based on each time-consuming quantization value and each communication distance value.
  4. 4. The heterogeneous resource scheduling method of claim 3, wherein the time-consuming quantized values include any one or more of bandwidth, delay, hop count, signal decay rate, non-uniform memory access distance, and aggregate communication primitive theoretical completion time.
  5. 5. The heterogeneous resource scheduling method according to any one of claims 1 to 4, wherein the communication efficiency value is positively correlated with a degree of tightness of physical connection between each of the first computing nodes in the first computing cluster, and the obtaining the target computing cluster based on each communication efficiency value includes: Acquiring a first communication efficiency value based on each communication efficiency value, wherein the first communication efficiency value is the maximum value in each communication efficiency value; Acquiring a second computing power cluster based on the first communication efficiency value, wherein the second computing power cluster is a computing power cluster corresponding to the first communication efficiency value; And taking the second computing force cluster as the target computing force cluster.
  6. 6. The heterogeneous resource scheduling method according to any one of claims 1 to 4, wherein the communication efficiency value is in positive correlation with a size of a degree of tightness of physical connection between each of the first computing nodes in the first computing clusters, each computing cluster includes a corresponding computing pointer at least for indicating whether the corresponding computing cluster is busy or idle, and the acquiring, based on each communication efficiency value, a target computing cluster includes: acquiring a plurality of second communication efficiency values from each communication efficiency value, wherein the second communication efficiency values are communication efficiency values corresponding to any idle state computing force cluster in each computing force cluster; Acquiring a third communication efficiency value based on each second communication efficiency value, wherein the third communication efficiency value is the maximum value in each second communication efficiency value; acquiring a third computing power cluster based on the third communication efficiency value, wherein the third computing power cluster is a computing power cluster corresponding to the third communication efficiency value; If the power pointer state corresponding to the third power cluster is idle, taking the third power cluster as the target power cluster, and adjusting the power pointer state of the third power cluster to be busy; and if the state of the power pointer corresponding to the third power cluster is busy, acquiring a plurality of new second communication efficiency values again until the target power cluster is obtained.
  7. 7. The heterogeneous resource scheduling method according to any one of claims 1 to 4, wherein the performing the current computing task based on the target computing power cluster includes: Traversing all execution instructions contained in the current computing task to obtain a first instruction to be executed, wherein the first instruction is any instruction required to be executed for completing the current computing task; For each first instruction, executing the following steps until all execution instructions contained in the current computing task are executed completely: Mapping the first instruction into a second instruction based on a unified heterogeneous instruction set, wherein the second instruction is a manufacturer private driving instruction corresponding to a fourth computing node, and the fourth computing node is a computing node configured to execute the first instruction in the target computing cluster; and issuing the second instruction to the fourth computing node, and driving the fourth computing node to execute the corresponding operation in the second instruction so as to complete the execution of the current computing task.
  8. 8. The heterogeneous resource scheduling method according to any one of claims 1 to 4, wherein after the current computing task is performed based on the target computing power cluster, the method further comprises: Acquiring a fourth communication efficiency value and a fifth communication efficiency value based on the target computing power cluster, wherein the fourth communication efficiency value is the communication efficiency value before the target computing power cluster executes the current computing task, and the fifth communication efficiency value is the communication efficiency value of the target computing power cluster at any time point before the current computing task is completed after the target computing power cluster is obtained; acquiring an abnormal value based on the fourth communication efficiency value and the fifth communication efficiency value, wherein the abnormal value is at least used for representing the abnormal probability of the target computing power cluster; And if the abnormal value is greater than or equal to a preset value, resetting the target computing power cluster.
  9. 9. A heterogeneous resource scheduling system is applied to an artificial intelligence computing power service platform, wherein the artificial intelligence computing power service platform comprises a plurality of computing power nodes, and is characterized in that the heterogeneous resource scheduling system comprises: The system comprises a processor, a current computing task, a processor, a power generation module and a power generation module, wherein the processor is used for acquiring a plurality of computing power clusters based on the current computing task, wherein the current computing task is acquired in advance; the method comprises the steps of obtaining a first computing force cluster from each computing force cluster, wherein the first computing force cluster is a computing force cluster in which corresponding communication efficiency values are not obtained in each computing force cluster, and the communication efficiency values are at least used for representing the degree of physical connection tightness between all first computing force nodes in the first computing force cluster; And obtaining a corresponding communication efficiency value for each first computing power cluster; the communication efficiency value corresponds to the computing power cluster one by one; based on each communication efficiency value, acquiring a target computing power cluster; and the scheduler is used for executing the current computing task based on the target computing force cluster.
  10. 10. A computer readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the heterogeneous resource scheduling method of any of claims 1 to 8.

Description

Heterogeneous resource scheduling method, system and storage medium Technical Field The application relates to the technical field of computers, in particular to a heterogeneous resource scheduling method, a heterogeneous resource scheduling system and a storage medium. Background Federal heterogeneous resource scheduling is a core collaborative management and control technology of a pointer to a distributed computing scene, and a management and control object of the core collaborative management and control technology is resources such as computing, storage, network and the like which belong to a plurality of independent autonomous management domains. Because the resources have remarkable heterogeneous characteristics in terms of computational power architecture, hardware types, storage configuration, network facilities and the like, the core goal of federal heterogeneous resource scheduling is to schedule various computing tasks to optimal heterogeneous nodes or cross-domain resources (namely computational power nodes in the following) as required by uniform resource abstraction modeling, global collaborative decision mechanism and dynamic adaptation allocation strategy on the premise of strictly guaranteeing autonomous authority, data security and compliance requirements of each management domain, and finally realize overall optimization of global resource utilization rate, task execution efficiency and system reliability. However, the existing federal heterogeneous resource scheduling method has the technical defect that the existing federal heterogeneous resource scheduling method adopts a scalar counting mode of resource quota to construct a computing power cluster, and the underlying physical interconnection topological characteristic among computing power nodes is not considered at all. This drawback is particularly pronounced in situations where communication efficiency is extremely high, such as large model training. For example, if the physical distribution of each selected computing power node is discrete, communication needs to be performed across peripheral component interconnect express (PERIPHERAL COMPONENT INTERCONNECT EXPRESS, PCIe) switches, which leads to that the communication bandwidth of the computing power cluster is limited by the transmission capacity across the node network, further the linear acceleration ratio of the computing power cluster is reduced by more than 50%, so that the task execution efficiency is seriously affected, the problem of 'computing power island' that resources cannot cooperate efficiently is easily caused, and the whole performance of heterogeneous resources is restricted. Disclosure of Invention The application aims to provide a heterogeneous resource scheduling method, a heterogeneous resource scheduling system and a storage medium, so as to solve the technical problem that the conventional heterogeneous resource scheduling method is easy to influence the task execution efficiency due to the data transmission efficiency. In order to achieve the above purpose, the present application provides the following technical solutions: in a first aspect, the present application provides a technical solution of a heterogeneous resource scheduling method, where the heterogeneous resource scheduling method is applied to an artificial intelligence computing power service platform, the artificial intelligence computing power service platform includes a plurality of computing power nodes, and the heterogeneous resource scheduling method includes: acquiring a plurality of computing force clusters based on a current computing task, wherein each computing force cluster comprises a plurality of first computing force nodes, and each first computing force node is any computing force node in all computing force nodes; The method comprises the steps of obtaining a first computing force cluster from each computing force cluster, wherein the first computing force cluster is a computing force cluster in which corresponding communication efficiency values are not obtained in each computing force cluster, and the communication efficiency values are at least used for representing the degree of physical connection tightness between first computing force nodes in the first computing force cluster; The communication efficiency values are corresponding to the computing force clusters one by one; Acquiring a target computing power cluster based on each communication efficiency value; and executing the current computing task based on the target computing power cluster. As a specific solution in the technical solution of the present application, the obtaining a communication efficiency value based on the first computing power cluster includes: Traversing each first computing node in the first computing cluster to obtain a second computing node and all third computing nodes, wherein the second computing node is any computing node which does not obtain a corresponding physical topology value