CN-121981264-A - Model training method and device for variable length sequences, storage medium and electronic equipment

CN121981264ACN 121981264 ACN121981264 ACN 121981264ACN-121981264-A

Abstract

The application discloses a model training method for a variable length sequence, which comprises the steps of determining a training range, a training sample and a micro batch size corresponding to model fine adjustment according to a fine adjustment request for a target model, dividing the training sample into the micro batch according to the micro batch size to obtain the micro batch sample, inputting the micro batch sample into pipeline stages to convert the micro batch sample into intermediate representation, carrying out shape adjustment on the intermediate representation according to the sequence length of the training sample, the micro batch size and the role type of the pipeline stages when the intermediate representation is transferred between adjacent pipeline stages, carrying out attention weight constraint on the intermediate representation input into the stages by using an attention mask in the attention calculation process in each pipeline stage, and carrying out shape adjustment and attention weight constraint through forward propagation and finishing reverse propagation to calculate gradients in the pipeline stages to update parameters of the target model. The application improves training efficiency and reduces computing resource consumption.

Inventors

LU JIAN

Assignees

中国工商银行股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260116

Claims (10)

1. A model training method for variable length sequences, the method comprising: determining a training model, a training sample and a micro batch size corresponding to fine tuning of a model according to a fine tuning request aiming at a target model, wherein the training sample is a variable-length sequence; Dividing the training samples into micro batches according to the size of the micro batches to obtain micro batch samples, and inputting the micro batch samples into a pipeline stage to convert the micro batch samples into intermediate representation in the pipeline stage, wherein the pipeline stage is obtained by dividing a network layer sequence of the target model by adopting a pipeline parallel strategy according to the model scale and parallel configuration of the target model; When intermediate representation is transferred between adjacent pipeline stages, the intermediate representation is subjected to shape adjustment according to the sequence length of the training samples, the micro batch size and the role type of the pipeline stages; In the process of attention calculation in each pipeline stage, attention mask is used for carrying out attention weight constraint on the intermediate representation input into the stage, wherein the attention mask is generated according to the training paradigm, the micro-batch size and the structural characteristics of training samples in the same micro-batch; in the pipeline stage, the shape adjustment and the attention weight constraint are performed by forward propagation, and backward propagation is completed to calculate gradients, updating parameters of the target model.
2. The method of claim 1, wherein the shaping the intermediate representation according to the sequence length of the training samples, the microblock size, and the role type of the pipeline stage comprises: determining an expected length adopted by the intermediate representation when passing between adjacent pipeline stages according to a sequence length of the training samples and the micro-batch size; performing a filling operation and/or a unfilling operation on the intermediate representation to shape the intermediate representation according to the expected length, the actual length of the micro-batch samples, and the role type of the pipeline stage; Wherein the role type of the pipeline stage is used to indicate whether the pipeline stage is associated with an upstream stage and/or a downstream stage.
3. The method of claim 2, wherein determining the expected length that the intermediate representation employs when passing between adjacent pipeline stages based on the sequence length of the training samples and the microblock size comprises: Determining the maximum length of the training samples according to the sequence length of the training samples, and determining the number of samples included in each micro batch according to the micro batch size; The expected length taken by the intermediate representation in passing between adjacent pipeline stages is determined based on the number of samples included in each micro-batch and the maximum length of the training samples.
4. The method of claim 2, wherein said performing a fill operation and/or a unfill operation on the intermediate representation to shape the intermediate representation based on the expected length, the actual length of the micro-batch samples, and the type of role of the pipeline stage comprises: If the pipeline stage is only associated with a downstream stage, filling the intermediate representation obtained by the pipeline stage with the expected length to fill the intermediate representation to the expected length, and transferring the intermediate representation of the expected length and the actual length of the micro batch sample to the downstream stage; if the pipeline stage is associated with a downstream stage and an upstream stage at the same time, adopting the actual length acquired from the upstream stage to perform a depuffering operation on the intermediate representation of the expected length to obtain an effective part, and calculating the effective part; after the calculation of the active portion is completed, performing a filling operation on the calculated intermediate representation with the expected length will fill the intermediate representation to the expected length and pass the intermediate representation of the expected length and the actual length of the micro-batch sample to a downstream stage; If the pipeline stage is associated with only an upstream stage, a depuffering operation is performed on the intermediate representation of the expected length to obtain an effective portion using the actual length obtained from the upstream stage, and the effective portion is calculated.
5. The method according to claim 1, wherein the method further comprises: Determining a mask type according to the training pattern and the structural characteristics of training samples in the same micro batch, wherein the mask type comprises a standard causal mask and a grouping causal mask; if the size of the micro batch is larger than 1, splicing the basic calculation units in the same micro batch end to end in the sequence dimension to obtain a sample combination; an attention mask in the form of a block diagonal is generated for the sample combination using the mask type.
6. The method of claim 5, wherein determining the mask type based on the training pattern and the structural characteristics of the training samples within the same micro-batch comprises: if the training paradigm is a supervised fine tuning, determining that the mask type is a standard causal mask; If the training paradigm is reinforcement learning and the structural feature of the training samples in the same micro-batch is that one prompt corresponds to at least two responses, determining that the mask type is a grouping causal mask.
7. The method of claim 6, wherein the method further comprises: If the mask type is the grouping causal mask, determining a prompt length, the number of responses corresponding to the prompt and each response length based on the structural characteristics of training samples in the same micro batch; Calculating the total key value demand length of all training samples in the micro batch based on the prompt length, the response quantity corresponding to the prompt and the response lengths; Planning a memory layout of a key value cache and generating a linear index sequence based on the total length of key value requirements, wherein the memory layout is that newly added key values of each response are arranged in sequence after the key value is prompted before; Calculating basic offset for each response based on the memory layout, and repeatedly expanding the basic offset according to each response and the corresponding prompt length thereof to generate an index offset sequence, wherein the index offset sequence is used for redirecting the prompt index of the non-first response to the key value starting position of the shared prompt; And adding the index offset sequence and the linear index sequence element by element to obtain an adjustment index sequence, and collecting corresponding key data and value data from the key value cache according to the adjustment index sequence to form a calculation tensor, wherein the calculation tensor is used for carrying out attention calculation in combination with the grouping causal mask.
8. A model training apparatus for a variable length sequence, the apparatus comprising: the training data determining module is used for determining a training range, a training sample and a micro batch size corresponding to the fine adjustment of the model according to the fine adjustment request aiming at the target model, wherein the training sample is a variable-length sequence; The micro batch sample determining module is used for dividing the training samples into micro batches according to the size of the micro batches to obtain micro batch samples, and inputting the micro batch samples into a pipeline stage to convert the micro batch samples into intermediate representations in the pipeline stage, wherein the pipeline stage is obtained by dividing a network layer sequence of the target model by adopting a pipeline parallel strategy according to the model scale and parallel configuration of the target model; the shape adjustment module is used for carrying out shape adjustment on the intermediate representation according to the sequence length of the training sample, the micro batch size and the role type of the pipeline stage when the intermediate representation is transferred between the adjacent pipeline stages; The mask generation module is used for carrying out attention weight constraint on the intermediate representation input into each pipeline stage by using an attention mask in the attention calculation process in each pipeline stage, wherein the attention mask is generated according to the training normal form, the micro-batch size and the structural characteristics of training samples in the same micro-batch; And the model parameter updating module is used for executing the shape adjustment and the attention weight constraint through forward propagation in the pipeline stage and completing backward propagation to calculate gradient so as to update the parameters of the target model.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a model training method for variable length sequences according to any one of claims 1-7.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a model training method for variable length sequences as claimed in any one of claims 1-7 when the computer program is executed by the processor.

Description

Model training method and device for variable length sequences, storage medium and electronic equipment Technical Field The application relates to the technical field of artificial intelligence and model training, and is applicable to financial and scientific scenes, in particular to a model training method and device for a variable length sequence, a storage medium and electronic equipment. Background In the field of natural language processing, etc., training samples of large models are typically text sequences of varying lengths, i.e., variable length sequences. To accommodate hardware requirements for regular data formats, the related art generally employs a filling method to fill a batch of samples to a maximum length within the batch to form a fixed-size tensor for training. However, the related art introduces a large number of nonsensical computations for the padding when processing samples with actual lengths far below the maximum length, resulting in serious waste of computing resources. Especially when a pipelined parallel equal distributed training strategy is employed, the inter-stage tensor transfer needs to be kept consistent in shape, further solidifying and amplifying this redundancy. In addition, in a scenario involving one hint corresponding to a plurality of responses, such as reinforcement learning, calculation of a hint portion that cannot be multiplexed is simply filled, resulting in repetition overhead. These defects make model training efficiency for variable length sequences low, and cost high, and restrict efficient fine tuning and application deployment of large models. Disclosure of Invention The application provides a model training method, a device, a storage medium and electronic equipment for a variable length sequence, which thoroughly eliminate filling redundancy and invalid attention calculation in the variable length sequence training while ensuring the training accuracy, remarkably improve the training efficiency and reduce the calculation resource consumption. According to a first aspect of the present application, there is provided a model training method for a variable length sequence, the method comprising: determining a training model, a training sample and a micro batch size corresponding to fine tuning of a model according to a fine tuning request aiming at a target model, wherein the training sample is a variable-length sequence; Dividing the training samples into micro batches according to the size of the micro batches to obtain micro batch samples, and inputting the micro batch samples into a pipeline stage to convert the micro batch samples into intermediate representation in the pipeline stage, wherein the pipeline stage is obtained by dividing a network layer sequence of the target model by adopting a pipeline parallel strategy according to the model scale and parallel configuration of the target model; When intermediate representation is transferred between adjacent pipeline stages, the intermediate representation is subjected to shape adjustment according to the sequence length of the training samples, the micro batch size and the role type of the pipeline stages; In the process of attention calculation in each pipeline stage, attention mask is used for carrying out attention weight constraint on the intermediate representation input into the stage, wherein the attention mask is generated according to the training paradigm, the micro-batch size and the structural characteristics of training samples in the same micro-batch; in the pipeline stage, the shape adjustment and the attention weight constraint are performed by forward propagation, and backward propagation is completed to calculate gradients, updating parameters of the target model. According to a second aspect of the present application there is provided a model training apparatus for a variable length sequence, the apparatus comprising: the training data determining module is used for determining a training range, a training sample and a micro batch size corresponding to the fine adjustment of the model according to the fine adjustment request aiming at the target model, wherein the training sample is a variable-length sequence; The micro batch sample determining module is used for dividing the training samples into micro batches according to the size of the micro batches to obtain micro batch samples, and inputting the micro batch samples into a pipeline stage to convert the micro batch samples into intermediate representations in the pipeline stage, wherein the pipeline stage is obtained by dividing a network layer sequence of the target model by adopting a pipeline parallel strategy according to the model scale and parallel configuration of the target model; the shape adjustment module is used for carrying out shape adjustment on the intermediate representation according to the sequence length of the training sample, the micro batch size and the role type of the pipeline stage when the intermed