CN-121981192-A - Training method and device for large model, electronic equipment and storage medium

CN121981192ACN 121981192 ACN121981192 ACN 121981192ACN-121981192-A

Abstract

The disclosure provides a training method and device for a large model, electronic equipment and a storage medium, relates to the technical field of data processing, and particularly relates to the fields of large model, artificial intelligence and deep learning. The method comprises the steps of obtaining a sample data set, training a large model based on the sample data set to obtain direct preference optimization DPO loss information and supervision fine adjustment SFT loss information of the large model, and carrying out gradient adjustment on model parameters of the large model according to the DPO loss information and the SFT loss information to obtain a target large model to improve model training accuracy.

Inventors

CHEN MUHAN
Wen Dailin
Lv Zhonghou
BAO CHENFU
WANG GUOQIU
TIAN WEIJUAN
HAN MIAO

Assignees

北京百度网讯科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260114

Claims (16)

1. A method of training a large model, wherein the method comprises: obtaining a sample data set, wherein the sample data comprises task instructions and sample pairs, and the sample pairs comprise positive samples and negative samples; training a large model based on the sample data set to obtain direct preference optimization DPO loss information and supervision fine tuning SFT loss information of the large model; And carrying out gradient adjustment on the model parameters of the large model according to the DPO loss information and the SFT loss information so as to obtain a target large model.
2. The method of claim 1, wherein the training a large model based on the sample data set to obtain direct preference optimized DPO loss information and supervised fine tuning SFT loss information for the large model comprises: training a large model based on the sample data set to obtain a prediction sample pair output by the large model, wherein the prediction sample pair comprises a prediction positive sample and a prediction negative sample; Obtaining DPO loss information of the large model according to the sample pair and the prediction sample pair; And obtaining the SFT loss information according to the positive sample and the predicted positive sample.
3. The method of claim 1, wherein the gradient adjusting the model parameters of the large model according to the DPO loss information and SFT loss information to obtain a target large model comprises: Recognizing a training phase of the large model; weighting the DPO loss information and the SFT loss information according to the training stage to obtain global loss information of the large model; and carrying out gradient adjustment on model parameters of the large model according to the global loss information to obtain a target large model.
4. The method of claim 3, wherein the weighting the DPO loss information and SFT loss information according to the training phase to obtain global loss information for the large model comprises: Determining weights corresponding to the DPO loss information and the SFT loss information respectively according to the training stage; And determining global loss information of the large model according to the DPO loss information and the SFT loss information and the weights corresponding to the DPO loss information and the SFT loss information.
5. The method of claim 4, wherein the determining weights for each of the DPO loss information and SFT loss information according to the training phase comprises: in an initial training stage of the large model, determining the weight corresponding to the DPO loss information in a first setting range; and determining the weight corresponding to the SFT loss information based on the weight corresponding to the DPO loss information.
6. The method of claim 5, wherein the method further comprises: And in the initial training stage, maintaining the weight corresponding to the DPO loss information and the weight corresponding to the SFT loss information unchanged.
7. The method of claim 5, wherein the determining weights for each of the DPO loss information and SFT loss information according to the training phase comprises: in a non-initial training stage of the large model, determining the weight corresponding to the DPO loss information in a second setting range; Determining the weight corresponding to the SFT loss information based on the weight corresponding to the DPO loss information; wherein the upper limit value of the first setting range is smaller than the lower limit value of the second setting range.
8. The method of claim 7, wherein the method further comprises: in the non-initial training phase of the large model, the weight corresponding to the DPO loss information increases along with the growth of the training phase, and the weight corresponding to the SFT loss information decreases along with the growth of the training phase.
9. The method of claim 2, wherein the deriving DPO loss information for the large model from the pair of samples and the pair of predicted samples comprises: determining deviation information of the large model relative to an initial large model after current training aiming at any sample pair in sample data; and determining DPO loss information of the large model according to the deviation information.
10. The method of claim 9, wherein the determining deviation information of the large model from an initial large model after current training comprises: Determining first prediction information of the sample data output by the large model during current training; Determining second prediction information of the sample data output by the initial large model; and determining the deviation information according to the first prediction information and the second prediction information.
11. The method of claim 10, wherein the determining the deviation information from the first prediction information and the second prediction information comprises: determining local deviation information corresponding to the sample data according to the first prediction information and the second prediction information; and weighting the local deviation information corresponding to the first sample data to obtain the deviation information.
12. The method of claim 11, wherein the determining local deviation information corresponding to the sample data from the first prediction information and the second prediction information comprises: Determining a first positive sample prediction probability of the positive sample and a first negative sample prediction probability of the negative sample according to the first prediction information; Determining a second positive sample prediction probability of the predicted positive sample and a second negative sample prediction probability of the predicted positive sample according to the second prediction information; determining positive sample deviation information corresponding to the sample data according to the first positive sample prediction probability and the second positive sample prediction probability; Determining negative sample deviation information corresponding to the sample data according to the first negative sample prediction probability and the second negative sample prediction probability; And determining local deviation information corresponding to the sample data according to the positive sample deviation information and the negative sample deviation information.
13. A training device for a large model, comprising: An acquisition module for acquiring a set of sample data, wherein the sample data comprises task instructions and sample pairs, the sample pairs comprising positive and negative samples; The training module is used for training the large model based on the sample data set to obtain direct preference optimization DPO loss information and supervision fine adjustment SFT loss information of the large model; And the optimization module is used for carrying out gradient adjustment on the model parameters of the large model according to the DPO loss information and the SFT loss information so as to obtain a target large model.
14. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.
15. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-12.
16. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method according to any of claims 1-12.

Description

Training method and device for large model, electronic equipment and storage medium Technical Field The disclosure relates to the technical field of data processing, in particular to the fields of large models, artificial intelligence and deep learning, and especially relates to a training method and device of a large model, electronic equipment and a storage medium. Background The model training is a process of accurately predicting or classifying new data by repeatedly adjusting internal parameters of the model by utilizing a specific algorithm and data, and the current training method of the large model is difficult to generate texts which are natural and smooth and accord with human expectations and value standards in complex scenes, so that high-quality intelligent generation and decision assistance cannot be accurately realized. Disclosure of Invention The disclosure provides a training method and device for a large model, electronic equipment and a storage medium. According to an aspect of the present disclosure, there is provided a training method of a large model, the method comprising: obtaining a sample data set, wherein the sample data comprises task instructions and sample pairs, and the sample pairs comprise positive samples and negative samples; training a large model based on the sample data set to obtain direct preference optimization DPO loss information and supervision fine tuning SFT loss information of the large model; And carrying out gradient adjustment on the model parameters of the large model according to the DPO loss information and the SFT loss information so as to obtain a target large model. According to another aspect of the present disclosure, there is provided a training apparatus of a large model, including: An acquisition module for acquiring a set of sample data, wherein the sample data comprises task instructions and sample pairs, the sample pairs comprising positive and negative samples; The training module is used for training the large model based on the sample data set to obtain direct preference optimization DPO loss information and supervision fine adjustment SFT loss information of the large model; And the optimization module is used for carrying out gradient adjustment on the model parameters of the large model according to the DPO loss information and the SFT loss information so as to obtain a target large model. According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect embodiment. According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of the first aspect embodiment. According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of the embodiments of the first aspect. It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification. Drawings The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein: FIG. 1 is a schematic diagram of a training method for a large model provided by embodiments of the present disclosure; FIG. 2 is a schematic diagram of a training method for a large model provided by embodiments of the present disclosure; FIG. 3 is a schematic diagram of a training method for a large model provided by an embodiment of the present disclosure; FIG. 4 is a schematic diagram of a large model training device provided in an embodiment of the present disclosure; fig. 5 shows a schematic block diagram of an electronic device used to implement an embodiment of the present disclosure. Detailed Description Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness. Data processing (data processing) is the collection, s