CN-122021763-A - Training method, device, equipment, storage medium and product of large language model

CN122021763ACN 122021763 ACN122021763 ACN 122021763ACN-122021763-A

Abstract

The application relates to a training method, a training device, training equipment, training storage media and training products for a large language model. The method comprises the steps of obtaining target sample identifications corresponding to sample data requests from sample identification sets of a plurality of sample data based on sample data requests of a current training round, determining target physical positions corresponding to the target sample identifications based on preset corresponding relations, wherein the preset corresponding relations represent corresponding relations between the sample identifications of the plurality of sample data and the sample physical positions, the sample physical positions are determined according to text physical positions of original training texts associated with corresponding sample data in a storage device, and based on the target physical positions, directly loading the target sample data in the storage device into a memory for executing training of the current training round on a target large language model. The method can improve the overall training efficiency and throughput of the large language model.

Inventors

LI YANLIANG
LIU HUAIJUN
LIU XIAOHUI
HUANG GUANGWEI

Assignees

金蝶软件（中国）有限公司

Dates

Publication Date: 20260512
Application Date: 20260127

Claims (12)

1. A method of training a large language model, the method comprising: Based on a sample data request of a current training round, acquiring a target sample identifier corresponding to the sample data request from a sample identifier set of a plurality of sample data; determining a target physical position corresponding to the target sample identifier based on a preset corresponding relation, wherein the preset corresponding relation represents the corresponding relation between the sample identifiers of a plurality of sample data and the sample physical positions, and the sample physical positions are determined according to the text physical positions of the original training texts associated with the corresponding sample data in the storage equipment; And based on the target physical position, directly loading target sample data in the storage device into a training memory to be used for executing the training of the current training round on the target large language model.
2. The method of claim 1, wherein the predetermined correspondence includes a first correspondence and a second correspondence, and wherein before the obtaining, from a sample identifier set of a plurality of sample data, a target sample identifier corresponding to the sample data request based on a sample data request of a current training round, further includes: Dividing the original training text based on the text length of the original training text and a preset target sample length to obtain a plurality of sample data and relative position information of the plurality of sample data, wherein the relative position information is used for representing the positions of the corresponding sample data in the associated original training text; Combining the relative position information of the plurality of sample data and a preset sample identifier to construct the first corresponding relation; and constructing the second corresponding relation by combining the relative position information, the text physical position of the original training text and the sample physical position.
3. The method of claim 2, wherein determining the target physical location corresponding to the target sample identity based on the preset correspondence comprises: Determining target relative position information corresponding to the target sample identifier based on the first corresponding relation, wherein the target relative position information is used for representing the position of corresponding target sample data in an associated original training text; And determining the text physical position of the original training text and the corresponding target physical position of the target sample data, which are associated with the target relative position information of the target sample data and the target sample data, based on the second corresponding relation.
4. The method according to claim 2, wherein the dividing the original training text based on the text length of the original training text and a preset target sample length to obtain a plurality of sample data and sample relative positions of the plurality of sample data includes: Under the condition that the original training texts are training texts aiming at a pre-training task, taking each original training text as continuous text data according to the arrangement sequence; splitting the continuous text data according to the target sample length to obtain a plurality of sample data; and combining the target sample length and the text length of each original training text to obtain the sample relative positions of a plurality of sample data.
5. The method according to claim 2, wherein the dividing the original training text based on the text length of the original training text and a preset target sample length to obtain a plurality of sample data and relative position information of the plurality of sample data includes: in the case that the original training text is a training text for a fine-tuning training task, performing a sample data dividing step for each original training text, the sample data dividing step including: taking the original training text as one sample data under the condition that the text length of the original training text is smaller than or equal to the target sample length, and determining relative position information of corresponding sample data based on the initial position of the original training text; And under the condition that the text length of the original training text is larger than the target sample length, sequentially dividing the original training text into at least one text segment with the length equal to the target sample length from the initial position of the original training text to obtain at least one sample data, and determining the relative position information of the corresponding sample data based on the initial position of the at least one text segment.
6. The method of claim 5, wherein directly loading the target sample data in the storage device into the training memory based on the target physical location comprises: Performing filling operation on the target sample data based on the target sample length under the condition that the length of the target sample data is smaller than the target sample length to obtain filled target sample data, wherein the filled target sample data comprises valid data before the filling operation and filled invalid data; Generating a mask sequence based on the positions of the valid data and the invalid data in the filled target sample data, wherein the mask sequence comprises valid bits and invalid bits, the valid bits correspond to the valid data, and the invalid bits correspond to the invalid data; and directly loading the filled target sample data and the mask sequence into the training memory.
7. The method according to any one of claims 1 to 6, wherein before the obtaining, based on the sample data request of the current training round, the target sample identifier corresponding to the sample data request from the sample identifier set of the plurality of sample data, the method further comprises: acquiring current model architecture information and current hardware topology information corresponding to a current training task; Based on the current model architecture information and the current hardware topology information, a plurality of historical training tasks similar to the current training tasks and a plurality of historical training parameter strategies corresponding to the similar historical training tasks are matched from a current knowledge graph, wherein the current knowledge graph comprises a historical model architecture of the historical training tasks, a historical hardware topology, historical training parameters and corresponding relations among the historical training tasks, the historical hardware topology and the historical training parameters; and determining a target training parameter strategy from a plurality of historical training parameter strategies, wherein the target training parameter strategy is used for executing the current training task on the target large language model.
8. The method of claim 7, wherein said determining a target training parameter strategy from a plurality of said historical training parameter strategies comprises: inputting the current model architecture information, the current hardware topology information and a plurality of historical training parameter strategies into a trained large language model; Obtaining target cost values of a plurality of historical training parameter strategies respectively corresponding to the current model architecture information and the current hardware topology information by utilizing the trained large language model through a built-in target cost model, wherein the target cost model comprises at least one of a memory cost model, a communication cost model and an efficiency cost model; and determining a target training parameter strategy from a plurality of historical training parameter strategies based on the target cost value by utilizing the trained large language model.
9. A training apparatus for large language models, the apparatus comprising: The acquisition module is used for acquiring a target sample identifier corresponding to a sample data request from a sample identifier set of a plurality of sample data based on the sample data request of the current training round; The system comprises a target sample identification module, a determining module and a processing module, wherein the target sample identification module is used for identifying a target physical position corresponding to the target sample identification based on a preset corresponding relation, the preset corresponding relation represents the corresponding relation between the sample identifications of a plurality of sample data and the sample physical position, and the sample physical position is determined according to the text physical position of an original training text associated with the corresponding sample data in a storage device; and the loading module is used for directly loading the target sample data in the storage device into the training memory based on the target physical position.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.
11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.
12. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 8.

Description

Training method, device, equipment, storage medium and product of large language model Technical Field The present application relates to the field of computer technologies, and in particular, to a training method, apparatus, device, storage medium, and product for a large language model. Background With the breakthrough of artificial intelligence technology, large language models (Large Language Model, LLM) have become core technologies that push natural language processing and multi-modal intelligent development. Large language models are usually trained based on massive text data, and as the scale of model parameters continues to increase, the amount of data required for training also increases exponentially, which presents a serious challenge to the training efficiency of the model. In the related art, since the data reading and processing speed of the training sample often hardly keeps up with the computing speed of the high-performance computing device, the problems of data supply delay and excessive memory occupation easily occur, so that the computing resources are in a waiting state for a long time, and the overall training efficiency of the large language model is seriously reduced. Disclosure of Invention In view of the foregoing, it is desirable to provide a training method, apparatus, device, storage medium, and product for a large language model that can improve training efficiency. In a first aspect, the present application provides a training method for a large language model, including: Based on a sample data request of a current training round, acquiring a target sample identifier corresponding to the sample data request from a sample identifier set of a plurality of sample data; determining a target physical position corresponding to the target sample identifier based on a preset corresponding relation, wherein the preset corresponding relation represents the corresponding relation between the sample identifiers of a plurality of sample data and the sample physical positions, and the sample physical positions are determined according to the text physical positions of the original training texts associated with the corresponding sample data in the storage equipment; And based on the target physical position, directly loading target sample data in the storage device into a training memory to be used for executing the training of the current training round on the target large language model. In a second aspect, the present application further provides a training device for a large language model, including: The acquisition module is used for acquiring a target sample identifier corresponding to a sample data request from a sample identifier set of a plurality of sample data based on the sample data request of the current training round; The system comprises a target sample identification module, a determining module and a processing module, wherein the target sample identification module is used for identifying a target physical position corresponding to the target sample identification based on a preset corresponding relation, the preset corresponding relation represents the corresponding relation between the sample identifications of a plurality of sample data and the sample physical position, and the sample physical position is determined according to the text physical position of an original training text associated with the corresponding sample data in a storage device; and the loading module is used for directly loading the target sample data in the storage device into the training memory based on the target physical position. In a third aspect, the present application further provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the training method for a large language model provided in the first aspect of the embodiment of the present application when the processor executes the computer program. In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the training method for a large language model provided in the first aspect of the embodiment of the present application. In a fifth aspect, the present application also provides a computer program product, which comprises a computer program, the computer program implementing the steps of the training method of the large language model provided in the first aspect of the embodiment of the present application when being executed by a processor. According to the training method, the device, the equipment, the storage medium and the product of the large language model, based on the sample data request of the current training round, the target sample identification corresponding to the sample data request is obtained from the sample identification set of a plurality of sample data, the target physical position