CN-122019141-A - Distributed training parallel strategy searching method, device and system

CN122019141ACN 122019141 ACN122019141 ACN 122019141ACN-122019141-A

Abstract

The invention relates to a distributed training parallel strategy searching method, device and system. The searching method comprises the steps of collecting hardware resource information X (i) and network communication performance information Y (i) of each computing node i, extracting static meta-characteristics of a target model and configuration parameter characteristics of parallel strategies, searching an applicable parallel strategy set K according to the hardware resource information X (i) of each computing node i and the static meta-characteristics of the target model, and calculating to execute the parallel strategies at the computing nodes i according to the hardware resource information X (i) and the network communication performance information Y (i) of each computing node i, the static meta-characteristics of the target model and the configuration parameter characteristics of the applicable parallel strategy set K Training performance indexes of the target model are trained, And screening and executing the optimal parallel strategy through training the performance index according to the optimization target specified by the user. The distributed training parallel strategy searching method provided by the invention has the advantages of being capable of realizing globally optimal parallel strategy selection and good training effect.

Inventors

GONG HANG
LIU CHUNYAN
XIANG YANG
LI DINGQUAN
SUN QIANG
XIAO JIE

Assignees

韶关市数据产业研究院

Dates

Publication Date: 20260512
Application Date: 20251230

Claims (10)

1. A distributed training parallel strategy search method, comprising: s10, acquiring hardware resource information X (i) and network communication performance information Y (i) of each computing node i; s20, extracting static meta-features of a target model and configuration parameter features of a parallel strategy; s30, searching an applicable parallel strategy set K according to hardware resource information X (i) of each computing node i and static meta-characteristics of a target model; S40, calculating to execute parallel strategies at the computing nodes according to the hardware resource information X (i) and the network communication performance information Y (i) of each computing node i, the static meta-characteristics of the target model and the configuration parameter characteristics of the applicable parallel strategy set K Training performance indexes of the target model are trained, ; S50, screening and executing the optimal parallel strategy through training the performance index according to the optimization target specified by the user.
2. The distributed training parallel policy search method according to claim 1, wherein said step S40 comprises: s41, constructing a resource diagram G according to hardware resource information X (i) and network communication performance information Y (i) of each computing node i; s42, obtaining a target model feature vector according to the static meta-feature of the target model ; S43, according to each parallel strategy in the applicable parallel strategy set K Is characterized by the configuration parameters of (a), acquiring parallel policies Feature vectors of (a) ; S44, for the resource graph G and the target model feature vector And each parallel policy Feature vectors of (a) Feature fusion is carried out to obtain a joint characterization vector ; S45, according to the joint characterization vector Calculating the target model according to each parallel strategy Training throughput for training at compute node i Single step communication is time consuming And peak display memory occupancy 。
3. The distributed training parallel policy search method of claim 1, further comprising: and S60, periodically calculating the performance error LOSS of the target model training, and if the performance error LOSS exceeds a threshold value, re-executing the steps S10 to S50.
4. The distributed training parallel policy search method according to claim 2, wherein said step S41 comprises: S411, constructing a node set V; S412, constructing an edge set E; S413, constructing a feature matrix of the node set V ; S414, constructing a feature matrix of the edge set E ; S415, constructing a resource graph G.
5. The distributed training parallel policy search method according to claim 2, wherein said step S44 comprises: S441, regarding the resource graph G and the target model feature vector And each parallel policy Feature vectors of (a) Carrying out normalization treatment on the continuous numerical value type characteristics; s442, aggregating adjacent node information to obtain each node Local topology aware vector associating neighboring nodes ; S443, adopting an attention mechanism to weight and aggregate contributions of different adjacent nodes, and aiming at all nodes Local topology aware vector associating neighboring nodes Weighted summation is carried out to obtain a global graph representation vector ; S444, representing the global graph into vectors And normalized target model feature vector Each parallel policy Feature vectors of (a) Performing splicing and fusion to obtain a joint characterization vector 。
6. The distributed training parallel policy search method according to claim 5, wherein the specific procedure of step S442 is as follows: For each node Performing L-layer convolution iterative computation, and performing L-layer convolution computation on each node Local topology aware vector associating neighboring nodes The calculation can be performed by the following formula: In the formula, Is a node The calculated input values are convolved at the L-th layer, For each node Is (are) initial feature vectors N (i) is a node Is provided for the set of neighboring nodes of the network, For connecting nodes With adjacent nodes Edges of (2) Is used for the feature vector of (a), In order to be an edge feature fusion module, And The learning parameters calculated for the layer L convolution, As a function of the non-linear activation, Is a vector concatenation operation.
7. The distributed training parallel policy search method according to claim 6, wherein the specific procedure of step S443 is as follows: computing node Local topology aware vector associating neighboring nodes Attention weight of (a) : For each node Local topology aware vector associating neighboring nodes Weighted summation is carried out to obtain a global graph representation vector : In the formula, To feature vectors according to the parallel strategy The context query vector that is generated is used, As a learnable parameter, m is a node Is a sum of (3).
8. The distributed training parallel policy search method according to claim 7, wherein the specific procedure of step S444 is as follows: The joint token vector is calculated by the following formula : In the formula, As a function of the fusion, Is a vector concatenation operation.
9. A distributed training parallel policy search apparatus, comprising: The multi-dimensional resource acquisition module is used for acquiring hardware resource information X (i) and network communication performance information Y (i) of each computing node i; The feature extraction module is used for extracting static meta features of the target model and configuration parameter features of the parallel strategy; The strategy searching module is used for searching an applicable parallel strategy set K according to the hardware resource information X (i) of each computing node i and the static meta-characteristics of the target model; The performance prediction module is used for calculating the execution of the parallel strategy at each computing node i according to the hardware resource information X (i) and the network communication performance information Y (i) of each computing node i, the static meta-characteristic of the target model and the configuration parameter characteristic of the applicable parallel strategy set K Training performance indexes of the target model are trained, ; And the execution module is used for screening and executing the optimal parallel strategy through training the performance index according to the optimization target specified by the user.
10. A distributed training parallel strategy searching system is characterized by comprising at least one computing node i and the distributed training parallel strategy searching device of claim 9; The distributed training parallel strategy searching device searches and executes a parallel strategy optimally adapted to training the target model at the computing node i according to the target model and characteristic information of each computing node i and an optimization target designated by a user.

Description

Distributed training parallel strategy searching method, device and system Technical Field The invention relates to the technical field of artificial intelligence, in particular to a distributed training parallel strategy searching method, device and system. Background With explosive growth of model parameters, single-node computing power has been difficult to meet the needs of large model training. Distributed training is used as a core technology for realizing multi-node collaborative computing, and relates to various parallel strategies including data parallel, tensor parallel, pipeline parallel, mixed parallel and the like. These strategies exhibit significant differences in training performance under different hardware configurations, network topologies, and computational task characteristics. Currently, distributed training is highly dependent on manual experience in parallel policy selection, requires users to manually design a parallel scheme on the basis of deeply understanding various resource characteristics (such as GPU computing power, video memory bandwidth, network bandwidth and communication delay), has a high professional threshold, lacks joint perception and modeling capability on the overall state of a system, and is difficult to realize globally optimal parallel policy selection. Disclosure of Invention Based on the above, the invention aims to provide a distributed training parallel strategy searching method, device and system. A distributed training parallel policy search method, comprising: s10, acquiring hardware resource information X (i) and network communication performance information Y (i) of each computing node i; s20, extracting static meta-features of a target model and configuration parameter features of a parallel strategy; s30, searching an applicable parallel strategy set K according to hardware resource information X (i) of each computing node i and static meta-characteristics of a target model; S40, calculating to execute parallel strategies at the computing nodes according to the hardware resource information X (i) and the network communication performance information Y (i) of each computing node i, the static meta-characteristics of the target model and the configuration parameter characteristics of the applicable parallel strategy set K Training performance indexes of the target model are trained,; S50, screening and executing the optimal parallel strategy through training the performance index according to the optimization target specified by the user. According to the distributed training parallel strategy searching method, hardware resource information and network communication performance information of a computing node are collected, static element characteristics of a target model and configuration parameter characteristics of a parallel strategy are respectively extracted, then the three are spliced and fused through a T-GNN graph neural network prediction model to obtain a joint characterization vector capable of characterizing the overall state of a system, training throughput y1, single-step communication time consumption y2 and peak video memory occupation y3 of each parallel strategy training target model are calculated and applied according to the joint characterization vector, finally the optimal parallel strategy is selected and executed according to an optimization target, so that overall optimal parallel strategy selection is achieved, the whole process is automatically carried out, manual setting is not needed, and a control threshold is reduced. Further, the step S40 includes: s41, constructing a resource diagram G according to hardware resource information X (i) and network communication performance information Y (i) of each computing node i; s42, obtaining a target model feature vector according to the static meta-feature of the target model ; S43, according to each parallel strategy in the applicable parallel strategy set KIs characterized by the configuration parameters of (a), acquiring parallel policiesFeature vectors of (a); S44, for the resource graph G and the target model feature vectorAnd each parallel policyFeature vectors of (a)Feature fusion is carried out to obtain a joint characterization vector; S45, according to the joint characterization vectorCalculating the target model according to each parallel strategyTraining throughput for training at compute node iSingle step communication is time consumingAnd peak display memory occupancy。 Further, the distributed training parallel strategy searching method further comprises the step of S60, periodically calculating a performance error LOSS of the target model training, and if the performance error LOSS exceeds a threshold value, re-executing the steps S10 to S50. Further, the step S41 includes: S411, constructing a node set V; S412, constructing an edge set E; S413, constructing a feature matrix of the node set V ; S414, constructing a feature matrix of the edge set E; S415, constructin