CN-121981216-A - Large model annealing training method and device and electronic equipment

CN121981216ACN 121981216 ACN121981216 ACN 121981216ACN-121981216-A

Abstract

The invention relates to the technical field of artificial intelligence and provides a large model annealing training method, a large model annealing training device and electronic equipment, which comprise the steps of carrying out semantic feature extraction on each text data in a target data set and a candidate corpus based on a pre-configured lightweight agent model with frozen parameters to obtain a target feature vector set and a candidate feature vector set; constructing a target distribution fingerprint based on the target feature vector set, and determining a sampling weight according to the distribution difference between the target distribution fingerprint and the subset distribution fingerprint recombined by the candidate feature vector set; and sampling each candidate data subset in the candidate corpus according to the sampling weight to generate an annealing training data set. According to the invention, a unified feature space is constructed by introducing the lightweight agent model, and the optimal sampling weight is objectively and quantitatively solved based on the distribution difference to generate the annealing data set, so that the deep semantic accurate alignment of training data and a target scene is realized, and the directional optimization efficiency of the model in a specific field is improved with extremely low calculation cost.

Inventors

WANG PEIYANG
ZHU XINYU
Tan chang

Assignees

安徽飞数信息科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260331

Claims (10)

1. A large model annealing training method, comprising: acquiring a target data set and a candidate corpus; Based on a pre-configured lightweight agent model with frozen parameters, semantic feature extraction is respectively carried out on the target data set and each text data in the candidate corpus, so as to obtain a target feature vector set and a candidate feature vector set which are mapped in a unified feature space; Constructing a target distribution fingerprint based on the target feature vector set, and determining sampling weights of all candidate data subsets in the candidate corpus according to distribution differences between the target distribution fingerprint and the subset distribution fingerprints after the recombination of the candidate feature vector set; and sampling each candidate data subset according to the sampling weight to generate an annealing training data set, and carrying out annealing training on the base large model by utilizing the annealing training data set.
2. The large model annealing training method according to claim 1, wherein the performing semantic feature extraction on each text data in the target data set and the candidate corpus based on the pre-configured and parameter frozen lightweight proxy model to obtain a target feature vector set and a candidate feature vector set mapped in a unified feature space includes: Respectively inputting text data in the target data set and text data in the candidate corpus into the lightweight proxy model, wherein the text data are multi-source heterogeneous text data comprising multiple languages and/or various minority languages, the multiple languages comprise at least one of Chinese and English, and the various minority languages comprise at least one of Tibetan, wei and Mongolian; extracting a global average pooling vector of the tail end hidden layer of the lightweight proxy model, and carrying out normalization processing on the global average pooling vector to obtain a normalized feature vector; And respectively constructing the target feature vector set and the candidate feature vector set based on the normalized feature vector of each text data in the target data set and the normalized feature vector of each text data in the candidate corpus.
3. The large model annealing training method according to claim 2, wherein the inputting the text data in the target dataset and the text data in the candidate corpus into the lightweight proxy model, respectively, comprises: And under the condition that the currently input text data is long text data, segmenting the long text data into text fragments adapting to the context window length of the lightweight proxy model by adopting a sliding window strategy, and inputting the text fragments into the lightweight proxy model.
4. The large model annealing training method according to claim 1, wherein the constructing a target distribution fingerprint based on the target feature vector set, and determining the sampling weight of each candidate data subset in the candidate corpus according to the distribution difference between the target distribution fingerprint and the subset distribution fingerprint after the candidate feature vector set is recombined, comprises: fitting probability densities of all feature vectors in the target feature vector set to generate the target distribution fingerprint; pre-dividing the candidate corpus into a plurality of pre-clustering clusters serving as the candidate data subsets; Taking the divergence between the target distribution fingerprint and the subset distribution fingerprint as an optimization target, and introducing a regularization term as a constraint condition to generate an objective function, wherein the subset distribution fingerprint is a fitting distribution fingerprint obtained by weighting and recombining probability densities of feature vectors corresponding to each candidate data subset based on weights; And carrying out iterative solution on the objective function by adopting a projection gradient descent method, and projecting the weight vector back to a set constraint space after gradient updating until the objective function converges to obtain the sampling weight of each candidate data subset.
5. The large model annealing training method according to claim 1, wherein said sampling each candidate data subset according to the sampling weight, generating an annealing training data set, comprises: calculating the sampling quantity of each candidate data subset according to the sampling weight and the total number of preset training samples; Sampling samples from any candidate data subset according to sampling data of the any candidate data subset, and constructing a training data subset corresponding to the any candidate data subset; Calculating the similarity between normalized feature vectors of each sample pair in any training data subset, and if the similarity is larger than a preset semantic similarity threshold value, eliminating redundant samples from any training data subset; And cleaning each training data subset after redundant samples are removed to generate the annealing training data set, wherein the cleaning process comprises at least one of formatting process, global random scrambling and serialization storage.
6. The large model annealing training method according to any one of claims 1 to 5, further comprising: In the annealing training process, training is suspended according to a preset interval, and an output text of the base large model is sampled to obtain a text sequence; Inputting the text sequence into the lightweight agent model for semantic feature extraction, and constructing a real-time distribution fingerprint according to the extracted feature vector; And calculating a local characteristic deviation vector between the real-time distribution fingerprint and the target distribution fingerprint, and updating the current sampling weight according to the local characteristic deviation vector and a preset feedback coefficient so as to execute the next sampling based on the updated sampling weight.
7. The method of any one of claims 1 to 5, wherein using the annealing training dataset to annealing the base large model comprises: When the base model weight of the base large model is loaded, the first moment and the second moment state information of the pre-training phase optimizer are acquired and reserved so as to initialize a training environment; And carrying out gradient guide training on the base large model loaded with the weight by utilizing the annealing training data set and adopting a learning rate scheduling strategy with a fading trend.
8. The method for training annealing of large model according to claim 7, wherein the gradient-oriented training of the base large model after loading weight by using learning rate scheduling strategy with attenuation trend comprises: Acquiring a preset initial annealing learning rate, a minimum learning rate and an annealing total step number; in the gradient guide training process, calculating a real-time learning rate corresponding to the current training step number based on the annealing initial learning rate, the minimum learning rate and the annealing total step number; and updating parameters of the base large model by utilizing the real-time learning rate.
9. A large model annealing training device, comprising: the data acquisition unit is used for acquiring a target data set and a candidate corpus; the feature extraction unit is used for extracting semantic features of each text data in the target data set and the candidate corpus respectively based on a pre-configured lightweight proxy model with frozen parameters to obtain a target feature vector set and a candidate feature vector set which are mapped in a unified feature space; The weight determining unit is used for constructing a target distribution fingerprint based on the target feature vector set and determining the sampling weight of each candidate data subset in the candidate corpus according to the distribution difference between the target distribution fingerprint and the subset distribution fingerprint after the recombination of the candidate feature vector set; And the annealing training unit is used for sampling each candidate data subset according to the sampling weight, generating an annealing training data set, and carrying out annealing training on the base large model by utilizing the annealing training data set.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements the large model annealing training method according to any of claims 1 to 8 when executing the computer program.

Description

Large model annealing training method and device and electronic equipment Technical Field The invention relates to the technical field of artificial intelligence, in particular to a large model annealing training method and device and electronic equipment. Background In the pre-training process of large models, the annealing step is critical, which usually occurs in the final stage of training. At this stage, the model is typically continuously trained in conjunction with a specific learning rate decay mechanism, thereby achieving refined convergence and capacity solidification of the model within a specific task space. At present, when an annealing training data set is constructed, the prior art is mainly realized by adopting a manual proportioning strategy or a rule screening strategy. The method comprises the steps of presetting a data mixing proportion by means of manual priori experience and keeping static constant in training, and carrying out coarse-grained screening on corpus by means of surface rules such as metadata labels, regular expressions or keyword matching. However, these methods lack objective quantitative metrics, lack deep implicit semantic features, resulting in significant deviation between the feature distribution of the training data and the target evaluation scene, i.e., misalignment of the distribution. The method not only leads to the low-efficiency utilization of computational resources, but also severely restricts the performance breakthrough of the model on complex cognitive tasks such as mathematical logic, professional codes and the like. Disclosure of Invention The invention provides a large model annealing training method, a large model annealing training device and electronic equipment, which are used for solving the defects that the existing large model annealing training data construction depends on manual experience and shallow layer characteristics, so that data distribution is not aligned, model convergence is difficult and calculation efficiency is low. The invention provides a large model annealing training method, which comprises the following steps: acquiring a target data set and a candidate corpus; Based on a pre-configured lightweight agent model with frozen parameters, semantic feature extraction is respectively carried out on the target data set and each text data in the candidate corpus, so as to obtain a target feature vector set and a candidate feature vector set which are mapped in a unified feature space; Constructing a target distribution fingerprint based on the target feature vector set, and determining sampling weights of all candidate data subsets in the candidate corpus according to distribution differences between the target distribution fingerprint and the subset distribution fingerprints after the recombination of the candidate feature vector set; and sampling each candidate data subset according to the sampling weight to generate an annealing training data set, and carrying out annealing training on the base large model by utilizing the annealing training data set. According to the large model annealing training method provided by the invention, the lightweight proxy model based on pre-configuration and parameter freezing respectively performs semantic feature extraction on each text data in the target data set and the candidate corpus to obtain a target feature vector set and a candidate feature vector set mapped in a unified feature space, and the method comprises the following steps: Respectively inputting text data in the target data set and text data in the candidate corpus into the lightweight proxy model, wherein the text data are multi-source heterogeneous text data comprising multiple languages and/or various minority languages, the multiple languages comprise at least one of Chinese and English, and the various minority languages comprise at least one of Tibetan, wei and Mongolian; extracting a global average pooling vector of the tail end hidden layer of the lightweight proxy model, and carrying out normalization processing on the global average pooling vector to obtain a normalized feature vector; And respectively constructing the target feature vector set and the candidate feature vector set based on the normalized feature vector of each text data in the target data set and the normalized feature vector of each text data in the candidate corpus. According to the large model annealing training method provided by the invention, the text data in the target data set and the text data in the candidate corpus are respectively input into the lightweight proxy model, and the large model annealing training method comprises the following steps: And under the condition that the currently input text data is long text data, segmenting the long text data into text fragments adapting to the context window length of the lightweight proxy model by adopting a sliding window strategy, and inputting the text fragments into the lightweight proxy model.