CN-122021734-A - Model training method, device, equipment, storage medium and product
Abstract
The application discloses a model training method, a device, equipment, a storage medium and a product, and relates to the technical field of artificial intelligence, comprising the steps of respectively inputting model training data into a student model and a teacher model, obtaining student model output data and teacher model output data, wherein the model output data comprises output layer probability distribution and hidden layer output states; generating a combined loss value according to the student model output data and the teacher model output data, and adjusting model parameters of the student model based on the combined loss value. In the knowledge distillation process, a joint loss value is constructed according to probability distribution of an output layer of the student model and the teacher model and the output state of a hidden layer, and model parameters of the student model are adjusted based on the joint loss value, so that the student model can pay attention to deep semantic information contained in the hidden state of the teacher model, the understanding capacity of the student model on context semantics is improved, and the coping capacity of the student model on complex tasks is enhanced.
Inventors
- ZHENG HANZHONG
- HUANG SHAOMANG
- PAN JIANFENG
Assignees
- 三六零数字安全科技集团有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260130
Claims (10)
- 1. A method of model training, the method comprising: Respectively inputting model training data into a student model and a teacher model, and obtaining student model output data and teacher model output data, wherein the model output data comprises output layer probability distribution and hidden layer output states; generating a joint loss value according to the student model output data and the teacher model output data; And adjusting model parameters of the student model based on the joint loss value.
- 2. The model training method of claim 1, wherein before the model training data is input to the student model and the teacher model, respectively, further comprising: Acquiring the embedding dimension of the teacher model or the student model; constructing a soft prompt word based on the embedded dimension and a preset length; and splicing the soft prompt words with the original input data to construct model training data.
- 3. The model training method of claim 2, wherein the generating a joint loss value from the student model output data and the teacher model output data comprises: extracting student output layer probability distribution and student hidden layer output states from the student model output data; Extracting teacher output layer probability distribution and teacher hidden layer output state from the teacher model output data; constructing a first loss value according to the probability distribution of the student output layer and the probability distribution of the teacher output layer; Constructing a second loss value according to the student hidden layer output state and the teacher hidden layer output state; Constructing a joint loss value according to the first loss value and the second loss value.
- 4. A model training method as claimed in claim 3, wherein said constructing a joint loss value from said first loss value and said second loss value comprises: Constructing a third loss value according to the soft prompt word; Constructing a joint loss value according to the first loss value, the second loss value and the third loss value.
- 5. The model training method of claim 2, wherein the adjusting model parameters of the student model based on the joint loss value comprises: and adjusting model parameters of the soft prompt word and/or the student model based on the joint loss value.
- 6. The model training method of claim 5, wherein the step of adjusting model parameters of the soft cue word and/or the student model based on the joint loss value comprises: acquiring a parameter training type; and if the parameter training type is the first type, adjusting model parameters of the soft prompt word and the student model based on the joint loss value.
- 7. A model training apparatus, characterized in that the model training apparatus comprises: The acquisition module is used for respectively inputting the model training data into the student model and the teacher model, and acquiring student model output data and teacher model output data, wherein the model output data comprises output layer probability distribution and hidden layer output states; the generation module is used for generating a joint loss value according to the student model output data and the teacher model output data; And the training module is used for adjusting the model parameters of the student model based on the joint loss value.
- 8. Model training apparatus, characterized in that it comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the model training method according to any of claims 1 to 6.
- 9. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the model training method according to any of claims 1 to 6.
- 10. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the steps of the model training method according to any of claims 1 to 6.
Description
Model training method, device, equipment, storage medium and product Technical Field The application relates to the technical field of artificial intelligence, in particular to a model training method, a device, equipment, a storage medium and a product. Background Large models have achieved excellent performance in a variety of tasks. However, these large models tend to have huge parameter amounts, slow reasoning speed and high resource consumption, and are difficult to be directly deployed into edge devices or actual application scenes. Therefore, how to compress the model scale on the premise of maintaining the performance of the model in a certain task becomes one of the research hotspots. Knowledge distillation (knowledge distillation) is a classical model compression technology, the core idea is to use a high-performance 'teacher model' to guide a training process of a lightweight 'student model', but in the current knowledge distillation method, only probability distribution of an output layer of the teacher model is usually focused as a supervision signal for student model learning, deep semantic information of the teacher model is ignored, so that the student model is difficult to comprehensively understand a context structure of an input text, and the student model is limited in performance in complex tasks. This limitation directly restricts the depth and breadth of knowledge migration, so that the student model cannot fully inherit the teacher model capability. Disclosure of Invention The application mainly aims to provide a model training method, device, equipment, storage medium and product, and aims to solve the technical problem that a student model in the related technology is difficult to fully inherit the capability of a teacher model. To achieve the above object, the present application provides a model training method, which includes: Respectively inputting model training data into a student model and a teacher model, and obtaining student model output data and teacher model output data, wherein the model output data comprises output layer probability distribution and hidden layer output states; generating a joint loss value according to the student model output data and the teacher model output data; And adjusting model parameters of the student model based on the joint loss value. Optionally, before the model training data is input to the student model and the teacher model, the method further includes: Acquiring the embedding dimension of the teacher model or the student model; constructing a soft prompt word based on the embedded dimension and a preset length; and splicing the soft prompt words with the original input data to construct model training data. Optionally, the generating a joint loss value according to the student model output data and the teacher model output data includes: extracting student output layer probability distribution and student hidden layer output states from the student model output data; Extracting teacher output layer probability distribution and teacher hidden layer output state from the teacher model output data; constructing a first loss value according to the probability distribution of the student output layer and the probability distribution of the teacher output layer; Constructing a second loss value according to the student hidden layer output state and the teacher hidden layer output state; Constructing a joint loss value according to the first loss value and the second loss value. Optionally, the constructing a joint loss value according to the first loss value and the second loss value includes: Constructing a third loss value according to the soft prompt word; Constructing a joint loss value according to the first loss value, the second loss value and the third loss value. Optionally, the adjusting the model parameters of the student model based on the joint loss value includes: and adjusting model parameters of the soft prompt word and/or the student model based on the joint loss value. Optionally, the step of adjusting the soft-hint word and/or the model parameters of the student model based on the joint loss value includes: acquiring a parameter training type; and if the parameter training type is the first type, adjusting model parameters of the soft prompt word and the student model based on the joint loss value. Optionally, after the parameter training type is obtained, the method further includes: if the parameter training type is the second type, a model training stage is obtained; If the model training phase is a prompt word training phase, adjusting the soft prompt word based on the joint loss value; And if the model training stage is a model optimizing stage, adjusting model parameters of the student model based on the joint loss value. Optionally, if the parameter training type is the second type, the obtaining a model training stage includes: if the parameter training type is the second type, acquiring the current training roun