CN-121980545-A - Training performance analysis method, device, equipment and medium for deep learning model
Abstract
The embodiment of the application provides a training performance analysis method and device of a deep learning model, electronic equipment and a computer readable storage medium. The method comprises the steps of obtaining first operation data of operators in a deep learning model in different dimensions under the condition that the deep learning model is trained, enabling the deep learning model to comprise at least two operators, associating the first operation data of the operators in the different dimensions through a preset association algorithm to obtain joint operation data of the operators, and carrying out training performance analysis on the deep learning model according to the joint operation data to obtain training performance analysis results of the deep learning model. The method can establish the association relation of operators among the operation data with different dimensions, realize the association of the operator operation data and the pipeline structure, provide a data basis for the performance analysis of the model and the positioning of abnormal conditions, improve the efficiency of the model training performance analysis, and improve the accuracy of the model training performance analysis result.
Inventors
- Request for anonymity
- Request for anonymity
- Request for anonymity
- Request for anonymity
- Request for anonymity
Assignees
- 摩尔线程智能科技(北京)股份有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20251210
Claims (13)
- 1. A training performance analysis method for a deep learning model, the method comprising: Under the condition that the deep learning model is trained, first operation data of operators in different dimensions in the deep learning model are obtained, wherein the deep learning model comprises at least two operators; correlating the first operation data of the operators in different dimensions through a preset correlation algorithm to obtain joint operation data of the operators; And carrying out training performance analysis on the deep learning model according to the combined operation data to obtain a training performance analysis result of the deep learning model.
- 2. The method according to claim 1, wherein the deep learning model comprises at least two stages, each of the stages is composed of a plurality of continuous model layers, each of the stages comprises at least two operators, the correlating the first operation data of the operators in different dimensions by a preset correlation algorithm to obtain the joint operation data of the operators comprises: Determining mapping information of the operator on first operation data with different dimensions through a preset association algorithm, wherein the mapping information is used for representing the corresponding relation between the first operation data with different dimensions; And according to the mapping information, correlating the first operation data of the operators in different dimensions to obtain the joint operation data of the operators.
- 3. The method of claim 2, wherein the at least two phases are processed by a distributed hardware device, the training data of the deep learning model is divided into a plurality of different micro-batch data, the dimensions include a time dimension and a space dimension, and the determining, by a preset association algorithm, mapping information of first running data of the operator in the different dimensions includes: Acquiring time stamps of the operators and the micro batch data, the phases corresponding to the operators, information of hardware equipment for processing the phases and a communication topological structure, wherein the communication topological structure is used for representing structural information of communication links between the phases; determining first mapping information between operators with the same time stamp and the micro-batch data according to the time stamp of the operators and the time stamp of the micro-batch data; determining the operator, the stage, the distributed hardware equipment for processing the stage and second mapping information of a communication link between the stages according to the stage corresponding to the operator, the information of the hardware equipment for processing the stage and the communication topological structure; And determining the mapping information of the first operation data of the operator in different dimensions according to the first mapping information and the second mapping information.
- 4. The method according to claim 1, wherein the performing training performance analysis on the deep learning model according to the joint operation data to obtain training performance analysis results of the deep learning model includes: Calculating index values of training performance indexes preset for the deep learning model according to the joint operation data of the operators, wherein the training performance indexes comprise one or more of cavitation rate, utilization rate, effective calculation time, equipment idle time, scheduling waiting time, communication time and communication masking rate; and adding the index value of the training performance index of the deep learning model to the training performance analysis result of the deep learning model.
- 5. The method according to claim 4, wherein calculating an index value of a training performance index preset for the deep learning model according to the operator joint operation data comprises: in the joint operation data of the operators, determining parameters required for calculating index values of the training performance indexes and parameter values of the parameters; and calculating the index value of the training performance index according to the parameter, the parameter value of the parameter and a preset index calculation formula.
- 6. The method according to claim 4, wherein the method further comprises: determining the operation efficiency of the deep learning model according to the index value of the training performance index and a preset operation efficiency calculation formula; And adding the operation efficiency of the deep learning model to a training performance analysis result of the deep learning model.
- 7. The method of claim 1, wherein the deep learning model comprises at least two stages, each stage comprising at least two operators, the at least two stages being processed by a distributed hardware device, the training performance analysis of the deep learning model based on the joint operation data, resulting in training performance analysis results of the deep learning model, comprising: Obtaining simulation operation data of the operator on the distributed hardware equipment; comparing the simulated operation data of the operator with the joint operation data to obtain a comparison result of the simulated operation data and the joint operation data; and adding the comparison result to a training performance analysis result of the deep learning model.
- 8. The method of claim 7, wherein the obtaining simulated operational data of the operator on the distributed hardware device comprises: The configuration information of the deep learning model comprises the structural information of the deep learning model and the equipment information of a single hardware equipment for running the deep learning model; Determining second operation data of operators of the deep learning model when the operators run on the single hardware device based on configuration information of the deep learning model; And determining simulation operation data of the operator on the distributed hardware equipment according to the second operation data.
- 9. The method according to claim 4, wherein the method further comprises: determining an index value of a first index exceeding a preset range according to the index value of the training performance index of the deep learning model and the preset range; According to the joint operation data of the operators, carrying out attribution analysis on the index value of the first index to obtain attribution results corresponding to the index value of the first index; and determining an optimization strategy aiming at the first index according to the attribution result and preset optimization configuration information so as to optimize the training strategy of the deep learning model.
- 10. The method according to claim 9, wherein the method further comprises: Generating performance alarm information according to the first index; Determining device information of the first index on the distributed hardware device according to the performance alarm information; and marking the region corresponding to the first index on the visual interface corresponding to the training performance index of the deep learning model.
- 11. A training performance analysis apparatus for a deep learning model, the apparatus comprising: The deep learning model comprises an acquisition module, a data processing module and a data processing module, wherein the acquisition module is used for acquiring first operation data of operators in the deep learning model in different dimensionalities under the condition that the deep learning model is trained; The association module is used for associating the first operation data of the operators in different dimensions through a preset association algorithm to obtain the joint operation data of the operators; and the analysis module is used for carrying out training performance analysis on the deep learning model according to the combined operation data to obtain a training performance analysis result of the deep learning model.
- 12. An electronic device is characterized by comprising a processor; A memory for storing the processor-executable instructions; Wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 10.
- 13. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any one of claims 1 to 10.
Description
Training performance analysis method, device, equipment and medium for deep learning model Technical Field The present application relates to the field of artificial intelligence, and in particular, to a training performance analysis method and apparatus for a deep learning model, an electronic device, and a computer readable storage medium. Background With the rapid increase of the parameter scale of the large model, the computing resources, the storage capacity and the training time required by training are correspondingly increased, and the hybrid parallel training has become a key technology for training the large model. In an actual training scenario, the performance of a large model during training needs to be analyzed. In the related art, performance analysis data is usually acquired in a manual acquisition mode, and offline performance analysis is performed. However, the offline analysis method in the related art takes a long time, which results in low efficiency of model performance analysis and insufficient quantization accuracy of large model training performance in the related art. Disclosure of Invention The embodiment of the application provides a training performance analysis method and device for a deep learning model, electronic equipment and a computer readable storage medium, which can improve the efficiency of training performance analysis of a large model. In a first aspect, the present application provides a training performance analysis method for a deep learning model, the method comprising: Under the condition that the deep learning model is trained, first operation data of operators in the deep learning model in different dimensions are obtained; Correlating the first operation data of the operators in different dimensions through a preset correlation algorithm to obtain joint operation data of the operators; And carrying out training performance analysis on the deep learning model according to the combined operation data to obtain a training performance analysis result of the deep learning model. In a second aspect, the application provides a training performance analysis device of a deep learning model, which comprises an acquisition module, a correlation module and an analysis module. The system comprises an acquisition module, a deep learning model, a control module and a control module, wherein the acquisition module is used for acquiring first operation data of operators in the deep learning model in different dimensionalities under the condition that the deep learning model is trained; the association module is used for associating the first operation data of the operators in different dimensions through a preset association algorithm to obtain the joint operation data of the operators; And the analysis module is used for carrying out training performance analysis on the deep learning model according to the combined operation data to obtain a training performance analysis result of the deep learning model. In a third aspect, the present application provides an electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors as a training performance analysis method of a deep learning model as described above. In a fourth aspect, the present application provides a readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform a training performance analysis method of a deep learning model as described above. In a fifth aspect, the present application provides a computer program product comprising a computer program loaded and executed by a processor to implement a training performance analysis method of a deep learning model as described above. The embodiment of the application has the following advantages: According to the embodiment of the application, first operation data of operators in the deep learning model in different dimensions are acquired under the condition that the deep learning model is trained, the performance data of the operator level can be acquired, the granularity of performance analysis data is thinned, the accuracy of a subsequent performance analysis result is improved, meanwhile, the acquired data can be ensured to reflect the real training state of the model in real time in the model training process, the performance problems dynamically appearing in response are timely identified, and analysis deviation caused by data lag is avoided. And secondly, correlating the first operation data of the operators in different dimensions through a preset correlation algorithm to obtain joint operation data of the operators, and finally performing training performance analysis on the deep learning model according to the joint operation data to obtain training performance analysis results of the deep learning model. The first operation data after unifying the time axis can be associated thro