CN-121979679-A - Multi-dimensional scheduling and energy efficiency optimizing system oriented to artificial intelligence computing power cluster

CN121979679ACN 121979679 ACN121979679 ACN 121979679ACN-121979679-A

Abstract

The invention relates to the technical field of artificial intelligence and discloses a multi-dimensional scheduling and energy efficiency optimizing system for an artificial intelligence computing power cluster, which comprises a physical efficiency topology management module, a virtual computing topology extraction module, a topology matching optimizing decision module, an online collaborative reconstruction execution module and a continuous learning evolution module, wherein a global physical efficiency topological graph is built based on static attribute data and dynamic monitoring data of all physical devices in the computing power cluster, a dynamic virtual computing topological graph is extracted, a topology matching degree optimizing model is combined and solved, an optimal mapping target and an expected matching degree gain for mapping a virtual operator to the physical devices are obtained, when the expected matching degree gain exceeds a gain threshold, computing power flow topology reconstruction is completed, and continuous learning and evolution are carried out on the topology matching degree optimizing model; the invention can improve the multidimensional scheduling and energy efficiency optimizing efficiency facing the artificial intelligence computing power cluster.

Inventors

WU MENGJIE

Assignees

北京愿璟科技信息有限公司

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. The system is characterized by comprising a physical efficiency topology management module, a virtual computing topology extraction module, a topology matching optimization decision module, an online collaborative reconstruction execution module and a continuous learning evolution module, wherein the physical efficiency topology management module is used for carrying out the online collaborative reconstruction, and the system comprises the following components: the physical efficiency topology management module is used for constructing and updating a global physical efficiency topology map in real time based on static attribute data and dynamic monitoring data of all physical devices in the computing power cluster; The virtual computing topology extraction module is used for extracting and updating a dynamic virtual computing topology graph of an artificial intelligent computing task in real time through a lightweight run-time instrumentation technology; The topology matching optimization decision module is used for constructing a topology matching degree optimization model based on the global physical efficiency topological graph and the dynamic virtual calculation topological graph and solving the topology matching degree optimization model to obtain an optimal mapping target and an expected matching degree gain for mapping a virtual operator to physical equipment; the online collaborative reconstruction execution module is used for determining an operator sub-graph set to be migrated based on the difference between the optimal mapping target and the current actual mapping when the expected matching gain exceeds a preset gain threshold, and migrating the operator sub-graph set to be migrated to target physical equipment in a streaming state migration mode to complete the calculation power flow topology reconstruction; the continuous learning evolution module is used for continuously learning and evolving the topological matching degree optimization model based on the actual performance and energy efficiency feedback data after topological reconstruction.
2. The system of claim 1, wherein the system for constructing and updating the global physical performance topology map in real time based on static attribute data and dynamic monitoring data of all physical devices in the computing power cluster comprises: Calculating a real-time comprehensive cost-effectiveness ratio scoring vector of each physical device based on the dynamic monitoring data and the static attribute data; based on the interconnection topological relation and the real-time network monitoring data, calculating the real-time effective communication cost of each physical link; And constructing and continuously updating a weighted global physical efficiency topological graph by taking physical equipment as a node, physical connection as an edge, a real-time comprehensive cost ratio scoring vector as node weight and real-time effective communication cost as edge weight.
3. The system for multi-dimensional scheduling and energy efficiency optimization for artificial intelligence computing power clusters of claim 2, wherein said computing real-time integrated cost/benefit score vectors for each physical device comprises: calculating a standardized instantaneous power score, a unit power consumption power score and a heat dissipation efficiency score based on a comprehensive efficiency-to-cost ratio scoring equation set, wherein the mathematical expression of the comprehensive efficiency-to-cost ratio scoring equation set is as follows: ; Wherein P score is a standardized instantaneous power score, f current is the current operating frequency of the device, f base is the reference frequency of the device, gamma throttle is a frequency reduction factor, E score is a unit power score, P current is the real-time power consumption of the device, C score is a heat dissipation efficiency score, T max is the upper limit of the junction temperature of a chip, T junction is the real-time core temperature, and T coolant_in is the air inlet temperature; And orderly combining the normalized instantaneous power calculation score, the unit power consumption power calculation score and the heat dissipation efficiency score to form a three-dimensional vector, so as to obtain a real-time comprehensive efficiency-to-cost ratio score vector.
4. The multi-dimensional scheduling and energy efficiency optimization system for artificial intelligence computing power clusters according to claim 2, wherein said computing real-time effective communication costs for each physical link comprises: Acquiring real-time network performance measurement data of each physical link based on the interconnection topological relation and the real-time network monitoring data; processing the bandwidth utilization rate and the transmission error rate in the real-time network performance measurement data based on a link effective bandwidth calculation formula to obtain an effective bandwidth of each physical link; Based on a communication delay comprehensive estimation algorithm, processing a one-way delay measured value and a packet loss rate in the real-time network performance measured data to obtain the communication delay of each physical link; Based on a graph theory shortest path principle, the effective bandwidth is taken as a capacity constraint, the communication delay is taken as a path cost, and the maximum available bandwidth and the minimum communication delay between any two physical equipment nodes in the global physical efficiency topological graph are calculated and respectively used as the effective bandwidth and the communication delay between the nodes.
5. The system for multidimensional scheduling and energy efficiency optimization for artificial intelligence computing power clusters as recited in claim 1, wherein said dynamic virtual computing topology map for real-time extraction and update of artificial intelligence computing tasks via lightweight runtime instrumentation techniques comprises: mounting a topology sensor in a runtime framework of an artificial intelligence computing task, and capturing running information of a computing graph at a preset sampling frequency; Identifying key operator nodes in the computational graph, and generating a feature vector representing the computational intensity based on the shape and the data type of the input/output tensor; Monitoring tensor transfer on the data dependent edge between operators, and recording transmission data quantity and communication trigger frequency as the data flow characteristics of the edge; and dynamically updating the node set, the edge set and the associated characteristics thereof according to the evolution of the task execution stage to form a dynamic virtual computing topological graph.
6. The system of claim 1, wherein the means for constructing and solving a topology matching optimization model based on the global physical performance topology map and the dynamic virtual computing topology map to obtain an optimal mapping target and an expected matching gain for mapping virtual operators to physical devices comprises: defining a mapping function from the virtual operator node to the physical device node; Constructing an overall topology matching degree scoring function, and taking the maximized overall topology matching degree scoring function as an optimization model of the target; and under the constraint of resource capacity and the constraint of data dependence reachability, solving the optimization model by adopting a simulated annealing algorithm to obtain an optimal mapping target and calculate an expected matching degree gain.
7. The multi-dimensional scheduling and energy efficiency optimizing system for artificial intelligence computing power cluster of claim 6, the method is characterized in that the mathematical expression of the overall topological matching degree scoring function is as follows: ; where S (M) is an overall topology matching degree scoring function, S node (M) is a node matching score sum, S edge (M) is an edge matching cost sum, λ is a communication cost weight coefficient, V v is a virtual operator set, E v is a dependency edge set between virtual operators, C v is a computation density feature vector of an operator V, W v (M (V)) is a real-time comprehensive efficiency ratio scoring vector of a physical device node M (V), D e . V and D e . F are data transmission amount and communication frequency of a dependent edge E, BW eff (M(v i ),M(v j ) and Lat (M (V i ),M(v j )) are an effective bandwidth and communication delay between device nodes M (V i ) and M (V j ), sim is a similarity computing function, M (V i ) and M (V j ) represent physical device nodes mapped to by virtual operators V i and V j , respectively, and V i and V j are virtual operators.
8. The system of claim 1, wherein the determining the operator subgraph to be migrated based on the difference between the optimal mapping target and the current actual mapping when the expected matching gain exceeds a preset gain threshold comprises: Based on the optimal mapping target and the current actual mapping, mapping position comparison is carried out on all virtual operators in the dynamic virtual computing topological graph, and an operator set to be moved is obtained; Based on the data dependency edges of the dynamic virtual computation topological graph, carrying out connectivity analysis on operators in the operator set to be moved to obtain at least one connected component, wherein each connected component forms a candidate migration subgraph; Based on the state data quantity of operators contained in each candidate migration sub-graph to be migrated and the predicted communication overhead generated by migrating the sub-graph to corresponding target equipment in the optimal mapping target under the current mapping, performing migration cost evaluation to obtain the migration cost value of each candidate migration sub-graph; and based on the migration cost value and the expected matching degree gain, carrying out cost-benefit balance, and determining a final operator sub-graph set to be migrated.
9. The system of claim 1, wherein the migration of the operator sub-graph set to be migrated to the target physical device in a streaming state migration manner to complete the reconstruction of the computing power flow topology comprises: Based on the optimal mapping target and the current actual mapping, operators with mapping changed are identified in the dynamic virtual computing topological graph, and connected subgraphs formed by the operators under the data dependency relationship in the dynamic virtual computing topological graph are extracted to obtain an operator subgraph set to be migrated; Coordinating task processes corresponding to operators related to the operator sub-graph set to be migrated, and performing collaborative checkpointing to obtain a consistent checkpointed state; starting a new computing process on each target physical device designated by the optimal mapping target, loading the check point state, and reconstructing communication connection with the physical devices where other non-migration operators in the cluster are located according to the optimal mapping target; And recovering calculation execution of all operators in the operator subgraph set to be migrated from the break point position, enabling the migrated operators to be seamlessly accessed into a data stream, updating the global task mapping state to be that the current actual mapping is equal to the optimal mapping target, and completing calculation power flow topology reconstruction.
10. The system of claim 1, wherein the continuous learning and evolution of the topology matching optimization model based on the actual performance and energy efficiency feedback data after topology reconstruction comprises: monitoring the calculation task for completing the topology reconstruction, and collecting actual performance and energy efficiency indexes in the stable operation period after the reconstruction to obtain a feedback data set; Based on the feedback data set, evaluating an actual matching degree gain obtained after the optimal mapping target is executed; and based on the difference between the actual matching degree gain and the expected matching degree gain, carrying out feedback adjustment on the parameters of the topological matching degree optimization model so as to update the model.

Description

Multi-dimensional scheduling and energy efficiency optimizing system oriented to artificial intelligence computing power cluster Technical Field The invention relates to the technical field of artificial intelligence, in particular to a multidimensional scheduling and energy efficiency optimization system for an artificial intelligence computing power cluster. Background In the field of artificial intelligent computing power cluster scheduling, the prior art generally adopts a traditional resource allocation thought, hardware resources such as a CPU (Central processing Unit), a memory, a GPU (graphics processing Unit) and the like of physical equipment are regarded as independent slots, scheduling decisions are focused on whether resource allowance meets task demands or not, deep coupling consideration on cluster physical topology characteristics and task computing characteristics is lacked, the inherent graph computing characteristics of AI training tasks are not fully mined by the scheduling mode, uniform coding and collaborative optimization on real-time efficiency and network states of the physical equipment are not carried out, matching of the tasks and the resources is caused to stay on a basic adaptation layer, optimal balance of computing power utilization efficiency and energy efficiency ratio is difficult to realize, meanwhile, the adaptability of the existing scheme to dynamic changes is insufficient, the problems of equipment load fluctuation, temperature change and dynamic evolution of a task computing communication mode are not perceived in real time, computing power waste, excessive heat and frequency reduction of hot equipment or equipment-crossing communication bottleneck and the like are easy to occur. In addition, the existing power dispatching system lacks a scientific reconstruction decision mechanism and a non-inductive migration scheme, so that the core pain points of 'when reconstruction' and 'how to reconstruct without inductive', on one hand, the traditional system does not establish an effective expected gain evaluation system, or blindly redistributes resources to lead to migration cost higher than performance improvement, or misses optimization opportunity due to untimely adjustment of mapping relation, on the other hand, when the mapping relation between a task and equipment needs to be adjusted, an integral migration or interrupt migration mode is often adopted, so that a large amount of state data transmission overhead is generated, task execution interrupt is caused, service continuity and user experience are seriously influenced, and the defects cause the existing dispatching system to be difficult to adapt to the dynamic dispatching requirement of a large-scale AI cluster, restrict the improvement of the comprehensive performance and the energy efficiency optimization level of the power cluster, and therefore how to improve the comprehensive performance and the energy efficiency optimization level of the power cluster becomes a problem to be solved. Disclosure of Invention The invention provides a multidimensional scheduling and energy efficiency optimizing system oriented to an artificial intelligence computing power cluster, which aims to solve the problems in the background technology. In order to achieve the above purpose, the invention provides an artificial intelligence computing power cluster-oriented multidimensional scheduling and energy efficiency optimization system, which is characterized by comprising a physical efficiency topology management module, a virtual computing topology extraction module, a topology matching optimization decision module, an online collaborative reconstruction execution module and a continuous learning evolution module, wherein: the physical efficiency topology management module is used for constructing and updating a global physical efficiency topology map in real time based on static attribute data and dynamic monitoring data of all physical devices in the computing power cluster; The virtual computing topology extraction module is used for extracting and updating a dynamic virtual computing topology graph of an artificial intelligent computing task in real time through a lightweight run-time instrumentation technology; The topology matching optimization decision module is used for constructing a topology matching degree optimization model based on the global physical efficiency topological graph and the dynamic virtual calculation topological graph and solving the topology matching degree optimization model to obtain an optimal mapping target and an expected matching degree gain for mapping a virtual operator to physical equipment; the online collaborative reconstruction execution module is used for determining an operator sub-graph set to be migrated based on the difference between the optimal mapping target and the current actual mapping when the expected matching gain exceeds a preset gain threshold, and migrating the operator su