CN-122021798-A - Model training acceleration method based on parameter dynamics

CN122021798ACN 122021798 ACN122021798 ACN 122021798ACN-122021798-A

Abstract

The invention discloses a model training acceleration method based on parameter dynamics, and relates to the technical field of model training. The invention comprises the steps of step S1 of dynamically calculating a parameter importance score for each parameter in the model locally at the working node At each training step Calculating the instantaneous importance score thereof The importance score is determined by the current gradient information and the parameter history change information. The invention effectively solves the core problems of large communication expenditure and computation redundancy in the prior art by introducing the importance scores of dynamic computation parameters and adaptively generating the dynamic sparse mask, and the operation principle is that in each training iteration, the working node does not process all parameters indifferently, but dynamically screens out a key parameter subset with obvious current gradient and active historical change based on a formula, and according to the dynamic sparse rate A binary mask is generated.

Inventors

SUN XIANG
FANG ANKANG
WANG FAPENG
LIU YINGYING

Assignees

南京先进计算产业发展有限公司

Dates

Publication Date: 20260512
Application Date: 20251203

Claims (10)

1. A model training acceleration method based on parameter dynamics is characterized by comprising the following steps: step S1, dynamically calculating a parameter importance score, locally at the working node, for each parameter in the model At each training step Calculating the instantaneous importance score thereof The importance score is determined by the current gradient information and the parameter history change information; Step S2, adaptively generating a dynamic sparse mask according to the importance scores of all parameters And a dynamic sparsity ratio Generating a binary mask Wherein the corresponding position of the parameter with high importance is 1, and the rest is 0; Step S3, sparse communication is carried out by applying a mask, namely, the complete gradient obtained by calculation is subjected to With the mask Multiplying to obtain sparse gradient And sparse gradient is performed And the corresponding parameter indexes are sent to a parameter server; Step S4, importance-aware delay update compensation, wherein the parameter server updates the global parameters according to the received sparse gradients, and maintains an unexplored counter for each parameter If a parameter is not selected for updating in the current iteration, its counter value is incremented, which will be used to amplify its importance score in subsequent iterations, ensuring that all parameters are updated.
2. The method for accelerating model training based on parameter dynamics as claimed in claim 1, wherein the step S1 calculates instantaneous importance scores The specific formula of (2) is: , wherein, Is a parameter The gradient at the time instant t is such that, Is the parameter of the ultrasonic wave to be used as the ultrasonic wave, Representing parameters In the recent past The magnitude of the change in the number of iterations.
3. The method for accelerating model training based on parameter dynamics as claimed in claim 1, wherein the dynamic sparsity in step S2 The method is not a fixed value, but is self-adaptive adjustment according to training iteration times and model convergence states, and the calculation formula is as follows: , wherein, And Respectively the minimum and maximum values of the sparsity, And Is the parameter of the ultrasonic wave to be used as the ultrasonic wave, For the current number of iterations, Is the variance of the recent training loss values.
4. A method for accelerating model training based on parameter dynamics as set forth in claim 3, wherein said generating a binary mask in step S2 The specific method of (a) is as follows: Importance scores for all parameters Sorting in a descending order; according to the current dynamic sparsity Calculating the number of parameters to be reserved ; Ranking importance top The parameters of the bits are in mask The corresponding position of (1) is set to 1 and the remaining positions are set to 0.
5. The method for accelerating model training based on parameter dynamics as recited in claim 1, wherein in said step S3, the data transmitted between the working node and the parameter server only includes sparse gradients And its one-or multi-dimensional index in the gradient tensor.
6. The method for accelerating model training based on parameter dynamics as recited in claim 1, wherein the updating operation of the parameter server in step S4 is as follows: Wherein, the Is the learning rate.
7. The method for accelerating model training based on parameter dynamics as claimed in claim 1, wherein the counter is not updated in step S4 The maintenance and use method of (2) comprises the following substeps: step S4.1 initializing counters for all parameters ; Step S4.2 generating a local round mask After that, for any parameter If (if) Then execute If (1) Reset then ; Step S4.3 in the next training step Calculating parameters When the importance score of (2) is calculated, the counter value is introduced as a compensation factor, and the calculation formula is modified as follows: , wherein, To calculate a function of importance scores based on gradients and historical changes, Is an importance decay factor.
8. The method for accelerating model training based on parameter dynamics as claimed in claim 7, wherein the modified importance score calculation formula in the step S4.3 is specifically: 。
9. a model training acceleration method based on parameter dynamics as set forth in claim 3, characterized in that the super-parameters The range of the values is as follows: 。
10. The method for accelerating model training based on parameter dynamics as recited in claim 7, wherein in said step S4.2, for Of the parameter of (a) its counter The updating mode of (a) is smooth increment: wherein Is an attenuation coefficient, and 。

Description

Model training acceleration method based on parameter dynamics Technical Field The invention belongs to the technical field of model training, and particularly relates to a model training acceleration method based on parameter dynamics. Background With the rapid development of deep learning technology, the model scale shows explosive growth, and gradually evolves from an early million-level parameter convolutional neural network to a trillion-level parameter large language model and a multi-modal model. The training of the large-scale model has the problem that the calculation power of a single node cannot meet the requirement, and the distributed training becomes a mainstream technical scheme, namely the model parameters and training data are split into a plurality of working nodes for parallel calculation, and then the gradients of the nodes are summarized by a parameter server and global parameters are updated, so that the training efficiency is greatly improved. However, the communication link of distributed training is always a bottleneck. In traditional distributed training, each working node needs to transmit complete gradients of all parameters to a parameter server, gradient data volume linearly grows along with the scale expansion of model parameters, and communication delay often exceeds calculation delay far in a cluster environment with limited bandwidth, so that the communication delay becomes a core factor for limiting training speed. To alleviate this problem, sparse training techniques have been proposed to reduce traffic by transmitting only a fraction of the significant gradients, but existing sparse training schemes still suffer from significant drawbacks. On one hand, most schemes adopt fixed sparsity and cannot adapt to the dynamic requirements of the whole training process. In the initial stage of training, a large amount of parameter updating is needed for stable convergence, a fixed high sparsity rate may filter key parameter gradients, so that the convergence is slow or the model falls into local optimum, in the later stage of training, the model gradually converges, the parameter updating requirement is reduced, and a fixed low sparsity rate transmits redundant gradients, so that communication resources are wasted. Although the partial dynamic sparse scheme tries to adjust the sparse rate, the partial dynamic sparse scheme is only in single dimension according to the iteration times, and a die-type convergence state (such as a loss fluctuation condition) is not combined, so that accurate adaptation is difficult to realize. On the other hand, the existing scheme has single judgment dimension on the importance of parameters, and lacks a compensation mechanism for long-term non-updated parameters. Most schemes only judge the importance of the parameters through the current gradient, ignore the historical change trend of the parameters, for example, the current gradient of a certain parameter is smaller, but the change amplitude in recent multiple iterations is larger, the model performance improvement has long-term value, and the model performance improvement is possibly eliminated due to misjudgment of 'unimportant' due to single gradient judgment, meanwhile, the parameters which are not selected and updated are easy to idle for a long time, and along with the progress of training iteration, the deviation between the parameters and the global parameters is gradually enlarged, so that the model convergence precision is reduced, and even the problem of unstable training occurs. Furthermore, existing sparse training has a disadvantage in the synergy of gradient transmission and parameter updating. The partial scheme realizes gradient sparsification, but still carries a large amount of zero value gradient or redundant index information during transmission, and does not fully compress data volume, and the parameter server does not optimize update logic aiming at the characteristics of the sparse gradient when updating global parameters, so that the condition of unbalanced parameter update is easy to occur, and the model training effect is further influenced. In summary, a better technical solution is needed for the short-circuit board of the existing distributed model training technology in terms of communication efficiency, parameter screening accuracy and training stability. For this reason, we provide a model training acceleration method based on parameter dynamics to solve the above problems Disclosure of Invention Aiming at the defects of the prior art, the invention provides a model training acceleration method based on parameter dynamics, which solves the problems of large communication overhead, calculation redundancy and unstable convergence caused by old gradients in an asynchronous random gradient descent algorithm in the prior art. In order to solve the technical problems, the invention is realized by the following technical scheme. The invention relates to a model training a