CN-121998030-A - Data processing method, device, equipment, storage medium and product
Abstract
The application provides a data processing method, a device, equipment, a storage medium and a product, which comprise the steps of obtaining an inference path of a reference model for generating a reference answer based on a training question, and an inference path of a to-be-trained model for generating an inference answer based on the training question, respectively executing attention information calculated in the respective inference paths based on the reference model and the to-be-trained model, and determining the attention intensity corresponding to the respective inference paths. Based on the difference between the reference answer and the reasoning answer and the difference between the attention intensities of the respective reasoning paths of the reference model and the model to be trained aiming at the training problem, the model to be trained is trained, and a target model is obtained. According to the technical scheme, the attention intensity representing the influence degree of the reasoning step on the answer is aligned, so that the reasoning step playing a key role in the learning and reasoning process of the model to be trained is realized, the operation and processing of the redundant reasoning step are reduced, the model reasoning efficiency is improved, and meanwhile, the calculation cost is reduced.
Inventors
- WANG XINGHUA
Assignees
- 腾讯科技(深圳)有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260126
Claims (16)
- 1. A data-based processing method, comprising: Acquiring a first reasoning path of a reference model for generating a reference answer based on a training question, and acquiring a second reasoning path of a model to be trained for generating a reasoning answer based on the training question; Determining a first attention intensity corresponding to the first reasoning path based on first attention information calculated by the reference model in the process of executing the first reasoning path, wherein the first attention intensity is used for representing the influence degree of each reasoning step in the first reasoning path on a reference answer; Determining a second attention intensity corresponding to the second reasoning path based on second attention information calculated by the model to be trained in the process of executing the second reasoning path, wherein the second attention intensity is used for representing the influence degree of each reasoning step in the second reasoning path on a reasoning answer; And training the model to be trained based on the difference between the reference answer and the reasoning answer and the difference between the first attention intensity and the second attention intensity to obtain a target model.
- 2. The method of claim 1, wherein the obtaining a first inference path for a reference model to generate a reference answer based on a training question and obtaining a second inference path for a model to be trained to generate an inference answer based on the training question comprises: Inputting the training questions into the reference model to perform inference operation based on the training questions through the reference model to obtain a first inference path including at least two inference steps and the reference answer; Inputting the training questions into the model to be trained, and carrying out reasoning operation based on the training questions through the model to be trained so as to obtain a second reasoning path comprising at least two reasoning steps and the reasoning answers.
- 3. The method of claim 1, wherein the first attention information includes first attention matrices respectively calculated by at least two attention modules in the reference model, and each first attention matrix includes attention weights between any two reasoning steps in the first reasoning path; The determining, based on the first attention information calculated by the reference model in the process of executing the first inference path, a first attention intensity corresponding to the first inference path includes: Partial attention weights between each reasoning step and the subsequent reasoning step are respectively extracted from the at least two first attention matrices, and the sum of the extracted partial attention weights is determined as the first attention intensity.
- 4. The method of claim 1, wherein the first attention information includes first attention matrices respectively calculated by at least two attention modules in the reference model, each first attention matrix including attention weights between any two inference steps in the first inference path; The determining, based on the first attention information calculated by the reference model in the process of executing the first inference path, a first attention intensity corresponding to the first inference path includes: Selecting a target attention module from the at least two attention modules, wherein the attention weight larger than a set weight threshold exists in a first attention matrix calculated by the target attention module; And extracting partial attention weights between each reasoning step and the subsequent reasoning step from the first attention matrix calculated by the target attention module, and determining the sum of the extracted partial attention weights as the first attention intensity.
- 5. The method of claim 4, wherein selecting the target attention module from the at least two attention modules comprises: Determining module kurtosis values corresponding to the attention modules based on a set mapping relation between the attention weights in the attention matrix and the module kurtosis values of the attention modules and the attention weights in the first attention matrices; And extracting attention modules with module kurtosis values larger than a specified kurtosis threshold value from the at least two attention modules, and obtaining the target attention module.
- 6. The method of claim 5, wherein the method further comprises: calculating an average kurtosis value and a kurtosis value standard deviation based on module kurtosis values respectively corresponding to the at least two attention modules; And determining a kurtosis threshold associated with the average kurtosis value and the kurtosis value standard deviation based on the association relation among the kurtosis mean value, the kurtosis standard deviation and the kurtosis threshold, and obtaining the appointed kurtosis threshold.
- 7. The method according to claim 1, wherein the second attention information comprises second attention matrices respectively calculated by at least two attention modules in the model to be trained, and each second attention matrix comprises attention weights between any two reasoning steps in the second reasoning path; The determining, based on the second attention information calculated by the model to be trained in the process of executing the second reasoning path, the second attention intensity corresponding to the second reasoning path includes: Partial attention weights between each reasoning step and the subsequent reasoning step are respectively extracted from the at least two second attention matrices, and the sum of the extracted partial attention weights is determined as the second attention intensity.
- 8. The method of claim 1, wherein training the model to be trained based on the difference between the reference answer and the inferential answer and the difference between the first attention intensity and the second attention intensity to obtain a target model comprises: generating first loss data based on a difference between the reference answer and the inferential answer, and generating second loss data based on a difference between the first attention intensity and the second attention intensity; and training the model to be trained based on the sum of the first loss data and the second loss data to obtain the target model.
- 9. The method of claim 8, wherein the first attention information comprises a first attention matrix respectively calculated by at least two attention modules in the reference model, and the second attention information comprises a second attention matrix respectively calculated by at least two attention modules in the model to be trained; The training the model to be trained based on the sum of the first loss data and the second loss data to obtain the target model, including: Third loss data is generated based on the difference between the at least two first attention matrixes and the at least two second attention matrixes, and the model to be trained is trained based on the sum of the first loss data, the second loss data and the third loss data, so that the target model is obtained.
- 10. The method of claim 9, wherein the generating third loss data based on differences between the at least two first attention matrices and the at least two second attention matrices comprises: Based on the attention weights of the element positions in the at least two first attention matrixes, calculating a weight average value corresponding to each element position to obtain a first attention average value matrix; Based on the attention weights of the element positions in the at least two second attention matrixes, calculating a weight average value corresponding to each element position to obtain a second attention average value matrix; And generating the third loss data based on the difference between weight average values corresponding to the positions of the elements in the first attention average value matrix and the second attention average value matrix respectively.
- 11. The method of claim 8, wherein the training the model to be trained based on the sum of the first loss data and the second loss data to obtain the target model comprises: Under the constraint of removing any reasoning step in the first reasoning path, invoking the reference model to execute reasoning operation of set times based on the training problem to obtain operation answers of the set times; Selecting a key reasoning step from the first reasoning path based on the similarity between the operation answers of the set times and the reference answers, and generating fourth loss data based on the difference between the key reasoning step and the reasoning step in the second reasoning path; and training the model to be trained based on the sum of the first loss data, the second loss data and the fourth loss data to obtain the target model.
- 12. The method of claim 11, wherein the selecting a key inference step from the first inference path based on the similarity between the operational answer and the reference answer for the set number of times comprises: determining the operation probability distribution corresponding to the operation answers with the set times based on the proportion of the number of the same answers to the set times in the operation answers with the set times; And determining an inference step removed from the first inference information when the similarity is larger than a set similarity threshold value as the key inference step based on the similarity between the operation probability distribution and the reference probability distribution corresponding to the reference answer.
- 13. A data processing apparatus, comprising: The acquisition unit is used for acquiring a first reasoning path of a reference model for generating a reference answer based on a training question and acquiring a second reasoning path of a model to be trained for generating a reasoning answer based on the training question; The determining unit is used for determining first attention intensity corresponding to the first reasoning path based on first attention information calculated by the reference model in the process of executing the first reasoning path, wherein the first attention intensity is used for representing the influence degree of each reasoning step in the first reasoning path on a reference answer; The determining unit is further configured to determine a second attention intensity corresponding to the second inference path based on second attention information calculated by the model to be trained in a process of executing the second inference path, where the second attention intensity is used to represent an influence degree of each inference step in the second inference path on an inference answer; The training unit is used for training the model to be trained based on the difference between the reference answer and the reasoning answer and the difference between the first attention intensity and the second attention intensity to obtain a target model.
- 14. An electronic device, comprising: One or more processors; A memory for storing one or more computer programs that, when executed by the one or more processors, cause the electronic device to implement the data processing method of any of claims 1-12.
- 15. A computer readable medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the data processing method of any of claims 1-12.
- 16. A computer program product, characterized in that the computer program product comprises a computer program, which is stored in a computer-readable storage medium, from which computer-readable storage medium a processor of an electronic device reads and executes the computer program, causing the electronic device to perform the data processing method of any one of claims 1-12.
Description
Data processing method, device, equipment, storage medium and product Technical Field The present application relates to the field of computer technologies, and in particular, to a data processing method, apparatus, device, storage medium, and product. Background With the rapid development of artificial intelligence technology, large language models (Large Language Models, LLM) are widely applied to various scenes such as text understanding, knowledge question answering and complex logic analysis by virtue of strong reasoning capability. LLM can generally perform multi-stage reasoning processing on input information and generate more accurate reasoning results. However, in practical application, the LLM reasoning process often depends on the operation of a large number of model parameters and a plurality of continuous reasoning steps, so that the whole reasoning process has longer response time and lower reasoning efficiency. In this context, knowledge distillation techniques have evolved, which refer to migrating complex knowledge contained in a large model to a small model to obtain inference capability near the large model at smaller parameter scales, thereby reducing the computational complexity and deployment costs of the model. In the related art, the output of a large model is generally taken as a learning target, and the output of a small model is guided as close as possible to the large model. Although the model scale can be compressed to a certain extent, the model scale is only simulated, and a relatively complex reasoning process is still required in the reasoning process, so that the reasoning efficiency is relatively low. Therefore, how to train the small model based on knowledge distillation to improve the reasoning efficiency of the small model in executing multi-step reasoning tasks has become a technical problem to be solved. Disclosure of Invention The embodiment of the application provides a data processing method, a device, equipment, a storage medium and a product, which are beneficial to improving the reasoning efficiency of a model, improving the reasoning speed of a target model in a question-answer scene and reducing unnecessary calculation expenditure by aligning the attention intensity of the influence degree of two model characterization reasoning steps, enabling the model to be trained to learn the reasoning steps which have decisive influence on an answer in the reasoning process and reducing the operation and processing of redundant reasoning steps. In a first aspect, an embodiment of the present application provides a data processing method, including: Acquiring a first reasoning path of a reference model for generating a reference answer based on a training question, and acquiring a second reasoning path of a model to be trained for generating a reasoning answer based on the training question; Determining a first attention intensity corresponding to the first reasoning path based on first attention information calculated by the reference model in the process of executing the first reasoning path, wherein the first attention intensity is used for representing the influence degree of each reasoning step in the first reasoning path on a reference answer; Determining a second attention intensity corresponding to the second reasoning path based on second attention information calculated by the model to be trained in the process of executing the second reasoning path, wherein the second attention intensity is used for representing the influence degree of each reasoning step in the second reasoning path on a reasoning answer; And training the model to be trained based on the difference between the reference answer and the reasoning answer and the difference between the first attention intensity and the second attention intensity to obtain a target model. In a second aspect, an embodiment of the present application provides a data processing apparatus, including: The acquisition unit is used for acquiring a first reasoning path of a reference model for generating a reference answer based on a training question and acquiring a second reasoning path of a model to be trained for generating a reasoning answer based on the training question; The determining unit is used for determining first attention intensity corresponding to the first reasoning path based on first attention information calculated by the reference model in the process of executing the first reasoning path, wherein the first attention intensity is used for representing the influence degree of each reasoning step in the first reasoning path on a reference answer; The determining unit is further configured to determine a second attention intensity corresponding to the second inference path based on second attention information calculated by the model to be trained in a process of executing the second inference path, where the second attention intensity is used to represent an influence degree of each inference step in th