CN-122021764-A - Training method, data processing method, device, equipment, storage medium and program product for large language model

CN122021764ACN 122021764 ACN122021764 ACN 122021764ACN-122021764-A

Abstract

The application provides a training method, a data processing method, a device, equipment, a storage medium and a program product of a large language model, wherein the method comprises the steps of obtaining a sequence data sample, carrying out attention calculation on the sequence data sample through a model to be trained to obtain a first attention intensity matrix, determining a time scheduling factor corresponding to a training time step and a space scaling factor corresponding to each data unit, determining a mask matrix based on the time scheduling factor and the space scaling factor, adjusting the first attention intensity matrix based on the mask matrix to obtain a second attention intensity matrix, determining an attention characteristic sample of the sequence data sample based on the second attention intensity matrix, determining a loss value based on the attention characteristic sample, and updating parameters of the model to be trained based on the loss value to obtain the model obtained by training the training time step. The application can improve the performance of the large language model.

Inventors

LI SHIYU
TANG YANG

Assignees

腾讯科技（深圳）有限公司

Dates

Publication Date: 20260512
Application Date: 20260205

Claims (20)

1. A method of training a large language model, the method comprising: obtaining a sequence data sample, wherein the sequence data sample comprises a plurality of data units; Performing attention computation on the sequence data samples through a model to be trained to obtain a first attention intensity matrix, wherein the first attention intensity matrix comprises a plurality of first element values, and each first element value is used for representing the attention intensity between any two data units in the plurality of data units; Determining a time scheduling factor corresponding to the training time step, and determining a space scaling factor corresponding to the position of each data unit in the sequence data sample; determining a mask matrix based on the time scheduling factor and the spatial scaling factor, wherein the mask matrix comprises a plurality of second element values, each of the second element values being for adjusting an intensity of attention between the two data units; adjusting the first attention intensity matrix based on the mask matrix to obtain a second attention intensity matrix, and determining attention feature samples of the sequence data samples based on the second attention intensity matrix; and determining a loss value based on the attention characteristic sample, updating parameters of the model to be trained based on the loss value, and obtaining the model obtained by training the training time step.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises, The training of the model comprises a plurality of training time steps, and the time scheduling factor is in a non-decreasing trend along with the increase of the training time steps; the determining the time scheduling factor corresponding to the training time step comprises the following steps: any one of the following processes is performed: performing linear mapping on the training time to obtain the time scheduling factor; nonlinear mapping is carried out on the training time steps to obtain the time scheduling factor; Inquiring a time scheduling factor value interval corresponding to the training time step from a preset mapping table, and determining the time scheduling factor based on the time scheduling factor value interval corresponding to the training time step, wherein the mapping table comprises the corresponding relation between a plurality of different training time steps and different time scheduling factor value intervals.
3. The method of claim 2, wherein said linearly mapping said training time steps to obtain said time scheduling factor comprises: Taking the ratio between the training time step and a first numerical value as the time scheduling factor, wherein the first numerical value is determined by any one of the following modes: taking a preset second numerical value as the first numerical value; Obtaining a loss value of a verification set in a preset historical time window, and calculating a loss fluctuation value of the loss value of the verification set, if the loss fluctuation value is larger than a preset fluctuation threshold value, increasing the second value according to a preset proportion to obtain the first value, and if the loss fluctuation value is smaller than or equal to the fluctuation threshold value, taking the second value as the first value; Determining the maximum sequence length from the sequence lengths corresponding to the sequence data samples, and determining the first value based on the maximum sequence length, wherein the maximum sequence length is positively correlated with the first value.
4. The method according to claim 2, wherein the determining the time scheduling factor based on the time scheduling factor value interval corresponding to the training time step includes: If the training time step is smaller than or equal to a first time step threshold value, taking a preset initial value as the time scheduling factor; If the training time step is greater than the first time step threshold and less than the second time step threshold, mapping the training time step into the time scheduling factor through a preset linear increasing function, wherein the time scheduling factor is greater than the initial value and less than a preset target value; And if the training time step is greater than or equal to the second time step threshold, taking the target value as the time scheduling factor.
5. The method of claim 1, wherein said determining a spatial scaling factor corresponding to a position of each of the data units in the sequence data samples comprises: the following is performed for each of the data units: acquiring a position index of the position of the data unit in the sequence data sample; and determining the spatial scaling factor based on the position index, wherein the spatial scaling factor has a non-incremental trend along with the increase of the position index.
6. The method of claim 5, wherein the determining the spatial scaling factor based on the location index comprises: any one of the following processes is performed: acquiring the sequence length of the sequence data sample, and taking the ratio between the sequence length and the position index as the space scaling factor; and calling a preset mapping rule, and mapping the position index into the space scaling factor.
7. The method of claim 6, wherein the step of providing the first layer comprises, The mapping rules comprise nonlinear mapping rules and segmentation mapping rules; the invoking a preset mapping rule to map the position index to the spatial scaling factor includes: any one of the following processes is performed: Invoking the nonlinear mapping rule to map the position index to obtain the spatial scaling factor, wherein the attenuation rate of the spatial scaling factor is reduced along with the increase of the position index; and calling the segmentation mapping rule, dividing the index sequence in which the position index is positioned into a plurality of continuous index intervals, and taking scaling values corresponding to the index intervals in which the position index is positioned as the space scaling factors, wherein the scaling values corresponding to different index intervals are in a decreasing trend along with the backward movement of the index intervals.
8. The method of claim 1, wherein the determining a mask matrix based on the temporal scheduling factor and the spatial scaling factor comprises: the following processing is performed for a target position of an ith row and a jth column in the mask matrix: If i is more than or equal to j, taking a preset third numerical value as the second element value corresponding to the target position; If i is less than j, the spatial scaling factor corresponding to the ith data unit in the sequence data sample is obtained and used as a target spatial scaling factor, and the target spatial scaling factor and the time scheduling factor are fused to obtain the second element value corresponding to the target position.
9. The method of claim 8, wherein the fusing the target spatial scaling factor and the time scheduling factor to obtain the second element value corresponding to the target location comprises: calculating a product between the time scheduling factor and the target spatial scaling factor; And taking the smaller value of the product and the third numerical value as the second element value corresponding to the target position.
10. The method according to any one of claims 1 to 8, wherein said adjusting the first attention intensity matrix based on the mask matrix results in a second attention intensity matrix, comprising: any one of the following processes is performed: multiplying the first attention intensity matrix by the mask matrix element by element to obtain the second attention intensity matrix; performing logarithmic domain conversion on the mask matrix to obtain a bias matrix, fusing the first attention intensity matrix and the bias matrix to obtain a fusion matrix, and performing normalization processing on the fusion matrix to obtain the second attention intensity matrix.
11. The method according to any one of claims 1 to 8, wherein performing attention calculations on the sequence data samples by means of the model to be trained to obtain a first attention intensity matrix comprises: Performing linear transformation on the characteristics of the sequence data samples through the model to be trained to respectively obtain a query matrix, a key matrix and a value matrix, wherein the query matrix comprises query vectors of each data unit, the key matrix comprises key vectors of each data unit, and the value matrix comprises value vectors of each data unit; for any two data units of the plurality of data units, determining a first element value for characterizing an attention intensity between a first data unit and a second data unit based on a query vector corresponding to the first data unit in the query matrix and a key vector corresponding to the second data unit in the key matrix, wherein the two data units comprise the first data unit and the second data unit.
12. The method according to any one of claims 1 to 8, wherein said determining a loss value based on the attention profile sample comprises: performing feature mapping processing on the attention feature sample to obtain a mapped feature sample; Calculating a first similarity between features of the mapped feature sample and positive samples, and calculating a second similarity between features of the mapped feature sample and negative samples, wherein the positive samples are data samples having semantic association with the sequence data samples, and the negative samples are data samples not having semantic association with the sequence data samples; The loss value is determined based on the first similarity and the second similarity.
13. The method according to any one of claims 1 to 8, wherein the updating parameters of the model to be trained based on the loss values comprises: Generating fine adjustment weight parameters based on a first initialization matrix and a second initialization matrix which are generated in advance, and adding the fine adjustment weight parameters into original model parameters of the model to be trained, wherein the initialization modes of the first initialization matrix and the second initialization matrix are different; And keeping the original model parameters unchanged, and updating the fine tuning weight parameters based on the loss value to obtain the updated model.
14. A method of data processing, the method comprising: acquiring sequence data to be processed; Determining the attention characteristic of the sequence data through a pre-trained model, and performing characteristic mapping processing on the attention characteristic to obtain a mapping characteristic, wherein the model is obtained through training by the method of any one of claims 1 to 13; And executing a preset data processing task based on the mapping characteristics to obtain a processing result.
15. The method of claim 14, wherein performing a predetermined data processing task based on the mapping feature to obtain a processing result comprises: calculating the similarity between the mapping feature and each candidate feature in a preset database, wherein each candidate feature corresponds to one piece of text data; Screening a preset number of candidate documents from a plurality of text data based on the similarity; And generating a reply text based on the sequence data and the preset number of candidate documents, and taking the generated reply text as the processing result.
16. A training apparatus for large language models, the apparatus comprising: A data acquisition module for acquiring a sequence data sample, wherein the sequence data sample comprises a plurality of data units; The training module is used for carrying out attention calculation on the sequence data samples through a model to be trained to obtain a first attention intensity matrix, wherein the first attention intensity matrix comprises a plurality of first element values, and each first element value is used for representing the attention intensity between any two data units in the plurality of data units; The training module is further configured to determine a time scheduling factor corresponding to a training time step, and determine a spatial scaling factor corresponding to a position of each data unit in the sequence data sample; The training module is further configured to determine a mask matrix based on the time scheduling factor and the spatial scaling factor, wherein the mask matrix includes a plurality of second element values, each of the second element values being used to adjust an intensity of attention between the two data units; The training module is further configured to adjust the first attention intensity matrix based on the mask matrix to obtain a second attention intensity matrix, and determine attention feature samples of the sequence data samples based on the second attention intensity matrix; The training module is further configured to determine a loss value based on the attention feature sample, update parameters of the model to be trained based on the loss value, and obtain a model obtained by training the training time step.
17. A data processing apparatus, the apparatus comprising: the data acquisition module is used for acquiring the sequence data to be processed; The data processing module is used for determining the attention characteristic of the sequence data through a pre-trained model and carrying out characteristic mapping processing on the attention characteristic to obtain a mapping characteristic, wherein the model is obtained through training by the method of any one of claims 1 to 13; the data processing module is further used for executing a preset data processing task based on the mapping characteristics to obtain a processing result.
18. An electronic device, the electronic device comprising: A memory for storing computer executable instructions or computer programs; A processor for implementing the training method of the large language model of any one of claims 1 to 13, or the data processing method of any one of claims 14 to 15, when executing computer-executable instructions or computer programs stored in the memory.
19. A computer readable storage medium storing computer executable instructions or a computer program, which when executed by a processor, implements the method of training a large language model according to any one of claims 1 to 13 or implements the method of data processing according to any one of claims 14 to 15.
20. A computer program product comprising computer executable instructions or a computer program which, when executed by a processor, implements the method of training a large language model according to any one of claims 1 to 13 or implements the method of data processing according to any one of claims 14 to 15.

Description

Training method, data processing method, device, equipment, storage medium and program product for large language model Technical Field The present application relates to the field of computer technologies, and in particular, to a training method, a data processing method, a device, equipment, a storage medium, and a program product for a large language model. Background In the related art, during the training process of a large language model, an attention mechanism is generally utilized to capture the association between different data units in the sequence data. In order to realize the control of the attention degree between the data units, a Mask Matrix (Mask Matrix) can be used for controlling the visible range of the data units in the attention calculation process, however, the Mask Matrix in the related technology is usually constructed based on a fixed geometric structure, so that the flexible adjustment capability of an attention mechanism to the context information interaction range in the model parameter optimization process is limited, and the performance of a trained large language model is further influenced. Disclosure of Invention The embodiment of the application provides a training method, a data processing method, a device, equipment, a storage medium and a program product for a large language model, which can improve the performance of the large language model. The technical scheme of the embodiment of the application is realized as follows: The embodiment of the application provides a training method of a large language model, which comprises the following steps: obtaining a sequence data sample, wherein the sequence data sample comprises a plurality of data units; Performing attention computation on the sequence data samples through a model to be trained to obtain a first attention intensity matrix, wherein the first attention intensity matrix comprises a plurality of first element values, and each first element value is used for representing the attention intensity between any two data units in the plurality of data units; Determining a time scheduling factor corresponding to the training time step, and determining a space scaling factor corresponding to the position of each data unit in the sequence data sample; determining a mask matrix based on the time scheduling factor and the spatial scaling factor, wherein the mask matrix comprises a plurality of second element values, each of the second element values being for adjusting an intensity of attention between the two data units; adjusting the first attention intensity matrix based on the mask matrix to obtain a second attention intensity matrix, and determining attention feature samples of the sequence data samples based on the second attention intensity matrix; and determining a loss value based on the attention characteristic sample, updating parameters of the model to be trained based on the loss value, and obtaining the model obtained by training the training time step. The embodiment of the application provides a data processing method, which comprises the following steps: acquiring sequence data to be processed; Determining the attention characteristic of the sequence data through a pre-trained model, and performing characteristic mapping processing on the attention characteristic to obtain a mapping characteristic, wherein the model is obtained through training by the training method of the large language model provided by the embodiment of the application; And executing a preset data processing task based on the mapping characteristics to obtain a processing result. The embodiment of the application provides a training device of a large language model, which comprises the following components: A data acquisition module for acquiring a sequence data sample, wherein the sequence data sample comprises a plurality of data units; The training module is used for carrying out attention calculation on the sequence data samples through a model to be trained to obtain a first attention intensity matrix, wherein the first attention intensity matrix comprises a plurality of first element values, and each first element value is used for representing the attention intensity between any two data units in the plurality of data units; The training module is further configured to determine a time scheduling factor corresponding to a training time step, and determine a spatial scaling factor corresponding to a position of each data unit in the sequence data sample; The training module is further configured to determine a mask matrix based on the time scheduling factor and the spatial scaling factor, wherein the mask matrix includes a plurality of second element values, each of the second element values being used to adjust an intensity of attention between the two data units; The training module is further configured to adjust the first attention intensity matrix based on the mask matrix to obtain a second attention intensity matrix, and dete