Search

CN-121996391-A - Non-uniform computing power scheduling and training method and device, computer equipment and storage medium

CN121996391ACN 121996391 ACN121996391 ACN 121996391ACN-121996391-A

Abstract

The application relates to a non-uniform computing power scheduling and training method, a device, computer equipment and a storage medium, wherein the method comprises the steps of acquiring dynamic computing power resources of a hardware cluster and static resources required by a model to be trained in real time, wherein the dynamic computing power resources comprise computing power, topological structures and node state information of hardware nodes, the static resources comprise memory occupation and computing time delay of operator stages, constructing training performance functions according to the dynamic computing power resources, the static resources and the model structure of the model to be trained, wherein the training performance functions are used for outputting an optimal strategy under a distributed training strategy space, determining a target distributed training strategy according to the training performance functions, and executing training tasks according to the target distributed training strategy. The method and the device solve the problem that the cluster resource utilization rate is low due to the fact that non-uniform calculation fragments cannot be utilized in the related technology, and achieve the purpose of improving the cluster resource utilization rate.

Inventors

  • WANG YUANYUAN
  • TANG NANA
  • WANG YUYANG
  • ZHU XIAOFENG
  • PAN SHU
  • Fei Zheyao
  • DU JIN
  • YANG FEI

Assignees

  • 之江实验室

Dates

Publication Date
20260508
Application Date
20260408

Claims (10)

  1. 1. A non-uniform power scheduling and training method, comprising: The method comprises the steps of acquiring dynamic computing power resources of a hardware cluster and static resources required by a model to be trained in real time, wherein the dynamic computing power resources comprise computing power, topological structure and node state information of hardware nodes; constructing a training performance function according to the dynamic computing power resource, the static resource and the model structure of the model to be trained, wherein the training performance function is used for outputting an optimal strategy under a distributed training strategy space; determining a target distributed training strategy according to the training performance function; and executing training tasks according to the target distributed training strategy.
  2. 2. The heterogeneous computing power scheduling and training method of claim 1, wherein the acquiring the dynamic computing power resources of the hardware cluster in real time comprises: The method comprises the steps of monitoring an agent through pre-constructed resources, acquiring the dynamic computing power resources of the hardware cluster in real time, wherein the dynamic computing power resources are used for updating the training performance function, and in the dynamic computing power resources, the node state information of the hardware node comprises the quantity and the available state of GPU.
  3. 3. The non-uniform power scheduling and training method according to claim 1, wherein the distributed training strategy space comprises distributed training strategies, and wherein each distributed training strategy is generated concurrently by a concurrent second level distributed strategy generator.
  4. 4. The non-uniform power scheduling and training method according to claim 3, wherein the concurrent second level distributed policy generator comprises a memory occupancy state table and a computation delay state table; The memory occupation state table is used for storing the estimated memory occupation amount of each operator under specific configuration; the calculation time delay state table is used for storing estimated calculation time delay of each operator under the specific configuration.
  5. 5. The non-uniform power scheduling and training method of claim 1, wherein said determining a target distributed training strategy based on said training performance function comprises: determining performance indexes of different distributed training strategies in the distributed training strategy space according to the training performance function; And determining the distributed training strategy corresponding to the performance index with the highest numerical value as the target distributed training strategy.
  6. 6. The non-uniform power scheduling and training method of claim 1, wherein said performing training tasks according to said target distributed training strategy comprises: starting a training task container according to the target parallel training strategy; and controlling the training task container to load required model parameters nearby according to the mapping relation between the fragments and the check points determined by the target parallel training strategy, and executing the distributed training task of the model to be trained.
  7. 7. The non-uniform power scheduling and training method of claim 1, wherein the training performance function is expressed as ; In the formula, For the number of tokens processed per second per GPU, For the model structure of the model to be trained, For the dynamic computational power resource in question, And (3) a strategy space is trained for the distributed mode.
  8. 8. The non-uniform training scheduling device is characterized by comprising an acquisition module, a construction module and a scheduling module; The acquisition module is used for acquiring dynamic computing power resources of the hardware cluster and static resources required by the model to be trained in real time, wherein the dynamic computing power resources comprise computing power, topological structure and node state information of hardware nodes; The building module is used for building a training performance function according to the dynamic computing power resource, the static resource and the model structure of the model to be trained, wherein the training performance function is used for outputting an optimal strategy under a distributed training strategy space; The scheduling module is used for determining a target distributed training strategy according to the training performance function and executing training tasks according to the target distributed training strategy.
  9. 9. A computer device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the steps of the non-uniform power scheduling and training method of any one of claims 1 to 7.
  10. 10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps of the non-uniform power scheduling and training method of any one of claims 1 to 7.

Description

Non-uniform computing power scheduling and training method and device, computer equipment and storage medium Technical Field The present application relates to the field of artificial intelligence infrastructure technology, and in particular, to a non-uniform power scheduling and training method, apparatus, computer device, and storage medium. Background With the rapid development of large model technology, in a multi-tenant shared cluster, the distribution, type and number of available computing power in the cluster are in a continuously and dynamically changing non-uniform state due to the fact that the time window of a user submitting and releasing task is highly random, and the resource allocation of the underlying computing infrastructure is challenged unprecedented. Under the background, an efficient and reliable non-uniform power scheduling and training technology directly influences the cluster resource utilization rate in the training process. Currently mainstream cluster scheduling systems or training frameworks, the resource allocation strategy of which typically treats the training task as a static request for a fixed number of fixed model computing cards, once allocated, typically remain unchanged throughout the training period. The disadvantage of this approach is that when higher priority task interventions or partial nodes are released, the system cannot actively take in and distribute these newly emerging heterogeneous computation fragments to ongoing training tasks, resulting in long-term low cluster resource utilization. Aiming at the problem that the utilization rate of cluster resources is low due to the fact that non-uniform calculation fragments cannot be utilized in the related technology, an effective solution is not proposed at present. Disclosure of Invention In this embodiment, a method, an apparatus, a computer device, and a storage medium for non-uniform power scheduling and training are provided, so as to solve the problem that in the related art, the cluster resource utilization rate is low due to the fact that non-uniform power fragments cannot be utilized. In a first aspect, in this embodiment, a non-uniform power scheduling and training method is provided, including: The method comprises the steps of acquiring dynamic computing power resources of a hardware cluster and static resources required by a model to be trained in real time, wherein the dynamic computing power resources comprise computing power, topological structure and node state information of hardware nodes; constructing a training performance function according to the dynamic computing power resource, the static resource and the model structure of the model to be trained, wherein the training performance function is used for outputting an optimal strategy under a distributed training strategy space; determining a target distributed training strategy according to the training performance function; and executing training tasks according to the target distributed training strategy. In some of these embodiments, the acquiring, in real time, the dynamic computing power resources of the hardware cluster includes: The method comprises the steps of monitoring an agent through pre-constructed resources, acquiring the dynamic computing power resources of the hardware cluster in real time, wherein the dynamic computing power resources are used for updating the training performance function, and in the dynamic computing power resources, the node state information of the hardware node comprises the quantity and the available state of GPU. In some embodiments, the distributed training strategy space comprises distributed training strategies, and each distributed training strategy is generated through a concurrent second distributed strategy generator in a concurrent mode. In some embodiments, the concurrent second level distributed policy generator includes a memory occupancy state table and a computation delay state table; The memory occupation state table is used for storing the estimated memory occupation amount of each operator under specific configuration; the calculation time delay state table is used for storing estimated calculation time delay of each operator under the specific configuration. In some of these embodiments, the determining a target distributed training strategy according to the training performance function includes: determining performance indexes of different distributed training strategies in the distributed training strategy space according to the training performance function; And determining the distributed training strategy corresponding to the performance index with the highest numerical value as the target distributed training strategy. In some of these embodiments, the performing training tasks according to the target distributed training strategy includes: starting a training task container according to the target parallel training strategy; and controlling the training task container to l