CN-121997986-A - Large model accelerated training method based on staged learning

CN121997986ACN 121997986 ACN121997986 ACN 121997986ACN-121997986-A

Abstract

The invention discloses a large model acceleration training method based on staged learning, and relates to the technical field of large model training. The method comprises the following steps of a, dividing a training process of a model into a continuous core structure learning period, a detail feature enrichment period and a final fine adjustment period, b, injecting a first information bottleneck constraint after the core structure learning period and a first middle layer of a network, wherein a loss function of the first information bottleneck constraint is as follows: among which To output characteristics based on the first intermediate layer Variational prior distribution Calculated and obtained Divergence. According to the invention, through constructing a three-stage progressive training frame from core feature learning to detail feature enrichment to final fine tuning, differential information bottleneck constraint is applied at each stage, and the model is guided to follow a feature learning rule from macroscopic to microscopic, so that calculation waste on redundant features in the initial training stage is effectively avoided, the overall convergence process of the model is accelerated, and the accuracy is improved.

Inventors

SUN XIANG
WANG FAPENG
FANG ANKANG
LIU YINGYING

Assignees

南京先进计算产业发展有限公司

Dates

Publication Date: 20260508
Application Date: 20251208

Claims (10)

1. A large model acceleration training method based on staged learning is characterized by comprising the following steps: Dividing the training process of the model into a continuous core structure learning period, a detail feature enrichment period and a final fine adjustment period; Step b, injecting a first information bottleneck constraint after the core structure learning period and a first middle layer of a network, wherein a loss function of the first information bottleneck constraint is as follows: , Wherein the method comprises the steps of To output characteristics based on the first intermediate layer Variational prior distribution Calculated and obtained The calculated formula of the divergence is as follows: , Is based on characteristics Cross entropy loss of the auxiliary classifier of (c), A first constraint intensity coefficient, and a total loss function: , performing network training, wherein In order to be the main task loss, Is a first weight coefficient; and c, injecting a second information bottleneck constraint after the detail feature enrichment period and a second middle layer of the network, wherein the loss function of the second information bottleneck constraint is as follows: , Wherein the method comprises the steps of Based on the second intermediate layer output feature Is a second constraint intensity coefficient, and the first constraint intensity coefficient Linear decay from an initial high value to a final low value, and with a total loss function: , performing network training, wherein To be over time The coefficient after the attenuation of the change, Is a second weight coefficient; step d, removing the first information bottleneck constraint and the second information bottleneck constraint in the final fine tuning period to obtain a total loss function The network is trained.
2. The method for accelerating training of a large model based on staged learning of claim 1, wherein in the step a, the training period is divided based on a total training time proportion, the learning period of the core structure is 20% of the total training period, the rich period of the detail features is 50% of the total training period, and the final fine adjustment period is 30% of the total training period.
3. The method for accelerating training of a large model based on staged learning as set forth in claim 1, wherein the first constraint intensity coefficient Is set to a fixed high value, a specific value of 10.0, for applying a strong predictive approximation during the learning phase of the core structure.
4. The method for accelerating training of a large model based on staged learning as set forth in claim 1, wherein the second constraint intensity coefficient Is set to a fixed low value, a specific value of 1.0, for imposing relatively weak constraints during the detail feature rich period.
5. The method for accelerating training of a large model based on stage learning according to claim 1, wherein the initial high value of the linear decay is set to 10.0, the final low value is set to 1.0, and the decay process is continuously performed in the whole period of rich detail features.
6. The method for accelerating training of a large model based on stage learning according to claim 1, wherein the variational prior distribution is characterized in that Using standard normal distribution, the mean value of the distribution is 0, and the variance is the unit matrix 。
7. The method for accelerating training of a large model based on staged learning as set forth in claim 1, wherein the first middle layer and the second middle layer are selected from Vision Transformer's architecture, the middle layers corresponding to a specific Transformer Block in the model.
8. The method for accelerating training of a large model based on staged learning as set forth in claim 7, wherein the first intermediate layer is specifically 16 th Transformer Block and the second intermediate layer is specifically 32 nd Transformer Block and is positioned deeper than the first intermediate layer.
9. The method for accelerating training of a large model based on staged learning as set forth in claim 1, wherein the first weight coefficient And a second weight coefficient The values of (a) are the same, said And (3) with The specific value of (2) is set to 0.1 for controlling the degree of contribution of the information bottleneck constraint in the total loss.
10. The large model acceleration training method based on stage learning of claim 1, wherein the main task loss A cross entropy loss function is employed that is used to calculate the difference between the model's main output and the real label.

Description

Large model accelerated training method based on staged learning Technical Field The invention relates to the technical field of large model training, in particular to a large model acceleration training method based on staged learning. Background The AI large model training learning refers to the capability of gradually optimizing parameters by using massive data and computational resources to realize complex tasks, and is characterized in that a deep learning framework is utilized to perform multi-level feature extraction and pattern recognition, the training process generally comprises pre-training (and fine-tuning two stages, the key technology relates to distributed computing, gradient descent algorithm, attention mechanism and the like, and the final aim is to construct an artificial intelligent system with generalization, reasoning capability and multi-mode processing advantages. With the development of deep learning, the model scale is continuously enlarged, and meanwhile, the model scale is also brought with huge calculation cost and long training period while various tasks are shown, the training of a large model usually needs to last for a plurality of weeks or even months on thousands of GPUs, and a large amount of calculation resources are consumed, so that the large model training acceleration technology has become a core research focus of academia and industry. Although the prior art can improve the performance or training stability of the model to a certain extent, the core problem is that the model lacks an internal mechanism capable of effectively guiding the model to follow the cognition rule from thick to thin and from main to secondary in the training process, so that the model consumes a large amount of computing resources to learn a large amount of low-value, redundant and even noisy features especially in the initial stage of training, and the features are corrected or forgotten in the subsequent training, thereby causing serious waste of computing resources and prolonging the training period. Therefore, we provide a large model acceleration training method based on staged learning to solve the above problems. Disclosure of Invention The invention aims to provide a large model acceleration training method based on staged learning, which solves the problems of waste of computing resources and slow convergence speed in the initial training stage in the prior art by injecting dynamic information bottleneck constraint in stages and matching with a progressive characteristic learning mechanism. In order to solve the technical problems, the invention is realized by the following technical scheme: The invention relates to a large model acceleration training method based on stage learning, which comprises the following steps: Dividing the training process of the model into a continuous core structure learning period, a detail feature enrichment period and a final fine adjustment period; Step b, injecting a first information bottleneck constraint after the core structure learning period and a first middle layer of a network, wherein a loss function of the first information bottleneck constraint is as follows: , Wherein the method comprises the steps of To output characteristics based on the first intermediate layerVariational prior distributionCalculated and obtainedThe calculated formula of the divergence is as follows: , Is based on characteristics Cross entropy loss of the auxiliary classifier of (c),A first constraint intensity coefficient, and a total loss function: , performing network training, wherein In order to be the main task loss,Is a first weight coefficient; and c, injecting a second information bottleneck constraint after the detail feature enrichment period and a second middle layer of the network, wherein the loss function of the second information bottleneck constraint is as follows: , Wherein the method comprises the steps of Based on the second intermediate layer output featureIs a second constraint intensity coefficient, and the first constraint intensity coefficientLinear decay from an initial high value to a final low value, and with a total loss function: , performing network training, wherein To be over timeThe coefficient after the attenuation of the change,Is a second weight coefficient; step d, removing the first information bottleneck constraint and the second information bottleneck constraint in the final fine tuning period to obtain a total loss function Training the network, and gradually learning the model from macroscopic to microscopic by injecting differentiated information bottleneck constraint in stages and dynamically adjusting constraint intensity, so that redundant feature calculation in the initial stage of training is obviously reduced, the overall convergence process is accelerated, and the feature learning efficiency and generalization capability of the model are improved. The invention further sets that in the step a, the division of the training