CN-121981172-A - Model parameter optimization method and device, electronic equipment and medium

CN121981172ACN 121981172 ACN121981172 ACN 121981172ACN-121981172-A

Abstract

The application provides a model parameter optimization method, a device, electronic equipment and a medium, which relate to the technical field of deep learning, and are applied to a working node in a deep learning model; based on the weight value of each model weight in at least one model weight, analyzing the gradient transmission state of each model weight in at least one model weight to generate a structured mask matrix according to an analysis result, and based on the structured mask matrix, carrying out sparsification processing on the gradient matrix to obtain a sparse gradient matrix so as to update at least one model weight of the deep learning model by using the sparse gradient matrix. The method realizes the sparsification processing of the gradient by using the globally unified structured mask generated based on the weight of the model, so that the sparse gradient positions of all working nodes are consistent, the problem of sparse gradient accumulation is avoided, and the traffic can be stably reduced.

Inventors

Liu Duansheng

Assignees

江苏清微智能科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260403

Claims (10)

1. A method of model parameter optimization, the method being applied to a working node in a deep learning model, the method comprising: Acquiring at least one model weight of a deep learning model and a gradient matrix corresponding to the at least one model weight; Analyzing the gradient transmission state of each model weight in the at least one model weight based on the weight value of each model weight in the at least one model weight to generate a structured mask matrix according to an analysis result, wherein the matrix dimension of the structured mask matrix is the same as the matrix dimension of the gradient matrix; And based on the structured mask matrix, carrying out sparsification processing on the gradient matrix to obtain a sparse gradient matrix so as to update at least one model weight of the deep learning model by using the sparse gradient matrix.
2. The method of claim 1, wherein the gradient transport states include a desired transport gradient state and a non-transport gradient state, The analyzing the gradient transmission state of each model weight in the at least one model weight based on the weight value of each model weight in the at least one model weight to generate a structured mask matrix according to the analysis result comprises: Determining a first model weight which accords with a preset transmission condition from the at least one model weight based on the weight value of each model weight in the at least one model weight, and determining the gradient transmission state of the first model weight as a gradient state to be transmitted; Determining a gradient transmission state of a second model weight other than the first model weight of the at least one model weight as a non-transmission gradient state; And determining a first mask corresponding to the first model weight and a second mask corresponding to the second model weight based on a mapping rule between the gradient transmission state and the mask, and generating the structured mask matrix based on the first mask and the second mask.
3. The method of claim 2, wherein determining a first model weight from the at least one model weight that meets a preset transmission condition based on a weight value size of each model weight in the at least one model weight comprises: Dividing the at least one model weight based on preset transmission conditions to obtain at least one model weight set; And carrying out numerical analysis on each model weight in the first model weight group based on the weight value of each model weight in the first model weight group, and determining the first model weight of the target number, wherein the first model weight group is any model weight group of the at least one model weight group.
4. The method of claim 1, wherein the performing a sparsification process on the gradient matrix based on the structured mask matrix to obtain a sparse gradient matrix comprises: Determining gradient elements in gradient element positions corresponding to target mask element positions in the gradient matrix as gradient elements to be transmitted based on target mask element positions of the target mask in the structured mask matrix under the condition that the target mask in the structured mask matrix is a first mask, wherein the target mask is any mask in the structured mask matrix; when a target mask in the structured mask is a second mask, adjusting gradient elements of a gradient element position corresponding to a target mask element position in the gradient matrix to a first value based on target mask element positions of the target mask in the structured mask matrix; Generating the sparse gradient matrix based on the gradient elements to be transmitted and the first value.
5. The method according to claim 1, wherein the thinning process is performed on the gradient matrix based on the structured mask matrix to obtain a sparse gradient matrix, and then the method comprises: Determining a residual gradient matrix based on the structured mask matrix and the gradient matrix, wherein the residual gradient matrix consists of gradient elements in the gradient matrix corresponding to the model weights of which the gradient transmission states are non-transmission gradient states; and accumulating the residual gradient matrix and the historical residual gradient matrix to obtain an updated residual gradient matrix, so as to update the gradient matrix according to the updated residual gradient matrix, wherein the historical residual gradient matrix is obtained based on a historical structured mask matrix and a historical gradient matrix.
6. The method of claim 5, wherein the updating at least one model weight of the deep learning model with the sparse gradient matrix comprises: Acquiring at least one target sparse gradient matrix of at least one relevant working node in the deep learning model, wherein the at least one relevant working node and the working node participate in distributed training of the deep learning model together; performing aggregation treatment on the sparse gradient matrix and the at least one target sparse gradient matrix to obtain a model global sparse gradient matrix; And updating the at least one model weight based on the model global sparse gradient matrix to obtain at least one updated model weight.
7. The method of claim 6, wherein updating the at least one model weight based on the model global sparse gradient matrix results in at least one updated model weight, and wherein the method comprises: Reasoning the deep learning model based on the at least one updated model weight, and obtaining a model performance index of the deep learning model so as to determine whether the deep learning model meets a convergence condition according to the model performance index; And under the condition that the deep learning model does not meet the convergence condition, determining an update gradient matrix corresponding to the at least one update model weight based on the at least one update model weight and the update residual gradient matrix, and carrying out sparsification processing on the update gradient matrix based on the at least one update model weight to obtain an update sparse gradient matrix.
8. A model parameter optimization apparatus, the apparatus comprising: the acquisition unit is used for acquiring at least one model weight of the deep learning model and a gradient matrix corresponding to the at least one model weight; The generation unit is used for analyzing the gradient transmission state of each model weight in the at least one model weight based on the weight value of each model weight in the at least one model weight so as to generate a structured mask matrix according to an analysis result, wherein the matrix dimension of the structured mask matrix is the same as the matrix dimension of the gradient matrix; And the sparse unit is used for carrying out sparsification processing on the gradient matrix based on the structured mask matrix to obtain a sparse gradient matrix so as to update at least one model weight of the deep learning model by using the sparse gradient matrix.
9. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7.

Description

Model parameter optimization method and device, electronic equipment and medium Technical Field The present application relates to the field of deep learning technologies, and in particular, to a method and apparatus for optimizing model parameters, an electronic device, and a medium. Background With the continuous expansion of the deep learning model and the data scale, the distributed data parallel training has become a key means for improving the training efficiency. In the distributed training process, each working node needs to perform gradient synchronization and communication, communication overhead becomes a main bottleneck for limiting training efficiency, and gradient sparsification is a common mode for reducing communication traffic. Currently, related art gradient sparsification methods are mostly screened based on gradient magnitude or threshold, such as schemes of depth gradient Compression (DEEP GRADIENT Compression, DGC), top-k, and the like, and the amount of transmission data is reduced by only retaining a part of larger gradients for transmission. However, in the related art, since each working node performs gradient sparsification processing according to the local gradient value, the gradient positions selected by each working node are not uniform, so that the problem of sparse gradient Accumulation (SPARSE GRADIENT calculation, SGA) is easily caused when a model is updated, sparse transmission is degraded into dense transmission, the traffic is greatly increased, and stable and efficient model distributed training communication optimization is difficult to realize. Disclosure of Invention The application provides a model parameter optimization method, a device, electronic equipment and a medium, which are used for solving the problem of higher traffic caused by sparse gradient accumulation caused by gradient sparsification based on gradient values in the related technology. An embodiment of a first aspect of the present application provides a model parameter optimization method, which is applied to a working node in a deep learning model, and the method includes obtaining at least one model weight of the deep learning model and a gradient matrix corresponding to the at least one model weight; based on the weight value of each model weight in at least one model weight, analyzing the gradient transmission state of each model weight in at least one model weight to generate a structured mask matrix according to the analysis result, wherein the matrix dimension of the structured mask matrix is the same as that of the gradient matrix, and based on the structured mask matrix, carrying out sparsification processing on the gradient matrix to obtain a sparse gradient matrix so as to update at least one model weight of the deep learning model by using the sparse gradient matrix. In some embodiments, the gradient transmission states comprise a to-be-transmitted gradient state and a non-transmitted gradient state, analyzing the gradient transmission state of each model weight in the at least one model weight based on the weight value of each model weight in the at least one model weight to generate a structured mask matrix according to analysis results comprises determining a first model weight meeting preset transmission conditions from the at least one model weight based on the weight value of each model weight in the at least one model weight, determining the gradient transmission state of the first model weight as the to-be-transmitted gradient state, determining the gradient transmission state of a second model weight except the first model weight in the at least one model weight as the non-transmitted gradient state, determining a first mask corresponding to the first model weight and a second mask corresponding to the second model weight based on mapping rules between the gradient transmission state and the masks, and generating the structured mask matrix based on the first mask and the second mask. In some embodiments, determining the first model weight from the at least one model weight that meets the preset transmission condition based on the weight value of each model weight in the at least one model weight includes dividing the at least one model weight based on the preset transmission condition to obtain at least one model weight set, performing numerical analysis on each model weight in the first model weight set based on the weight value of each model weight in the first model weight set, and determining a target number of first model weights, the first model weight set being any model weight set of the at least one model weight set. In some embodiments, the sparse processing is performed on the gradient matrix based on the structured mask matrix to obtain a sparse gradient matrix, wherein the sparse gradient matrix comprises the steps of determining gradient elements of gradient element positions corresponding to target mask element positions in the gradient matrix to be gradient ele