CN-121984970-A - Heterogeneous data-oriented distributed zero-order compression non-convex optimization method
Abstract
The invention discloses a distributed zero-order compression non-convex optimization method for heterogeneous data, which belongs to the technical field of computer networks and comprises the steps of adopting an intelligent body to store local model parameters, sampling function values based on a random local cost function, constructing a zero-order gradient estimator of the local cost function, constructing a communication network between the intelligent bodies based on a communication compression technology, carrying out information interaction between the intelligent bodies based on the communication network, constructing the communication network between the intelligent bodies based on a generalized contraction type compressor, setting dual variables based on the local model parameters, using the local model parameters as original variables, and adopting the zero-order gradient estimator and the generalized contraction type compressor to update the original variables and the dual variables. Aiming at heterogeneous data, the invention constructs the zero-order gradient estimator through multi-round two-point function value sampling to relieve the influence of data isomerism, is suitable for the distributed zero-order non-convex optimization task with isomerism and complexity of data distribution and training targets, and simultaneously reduces communication burden by using a compression communication technology.
Inventors
- Yi Xinlei
- WANG HAONAN
- HONG YIGUANG
- Li Wangminghui
Assignees
- 同济大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260403
Claims (10)
- 1. A distributed zero-order compression non-convex optimization method for heterogeneous data is characterized by comprising the following steps: Preserving local model parameters by adopting an agent, sampling function values based on a random local cost function, and constructing a zero-order gradient estimator of the local cost function; Constructing a communication network between the intelligent agents based on a communication compression technology, and carrying out information interaction between the intelligent agents based on the communication network; constructing a communication network between intelligent agents based on a generalized contraction type compressor, wherein the generalized contraction type compressor comprises an unbiased compressor, a biased contraction type compressor and a part of biased non-contraction type compressor; and setting a dual variable based on the local model parameter, taking the local model parameter as an original variable, and updating the original variable and the dual variable by adopting the zero-order gradient estimator and the generalized contraction compressor.
- 2. The distributed zero-order compression non-convex optimization method for heterogeneous data according to claim 1, wherein the local model parameters are , 。
- 3. The distributed zero-order compression non-convex optimization method for heterogeneous data according to claim 2, wherein the zero-order gradient estimator is: ; Wherein i is the number of agents, k is the number of iterations, τ is the number of rounds of two-point function value sampling, M is the number of rounds of two-point sampling together, As a parameter of the smoothness of the sheet, S p , For a uniformly sampled random vector S p , Is the local data sample extracted from distribution D i .
- 4. The distributed zero-order compression non-convex optimization method for heterogeneous data according to claim 3, wherein constructing a communication network between agents based on a generalized contraction compressor comprises: Constructing a communication network between intelligent bodies as a graph 。
- 5. The distributed zero-order compression non-convex optimization method for heterogeneous data according to claim 4, wherein the generalized contraction type compressor C is: The method meets the following conditions: wherein r >0, delta is the compression coefficient, , For the desire of randomness inside the compressor C, x is the model parameter that needs to be communicated.
- 6. The distributed zero-order compression non-convex optimization method for heterogeneous data according to claim 5, wherein the unbiased compressor includes a k-bit unbiased quantizer C 1 (x), which satisfies the first condition of the generalized contraction compressor: ; The biased-shrink type compressor includes a Top-k sparser compressor C 2 (x), and the biased-shrink type compressor satisfies a second condition of the generalized-shrink type compressor: ; The biased-shrink type compressor comprises a Rand-k random sparse compressor C 3 (x), and the Rand-k random sparse compressor meets the second condition of the generalized-shrink type compressor: ; The biased non-shrink compressor includes a norm-sign compressor C 4 (x) that satisfies a third condition of the generalized shrink compressor: 。
- 7. the distributed zero-order compression non-convex optimization method for heterogeneous data according to claim 6, wherein the definition of the k-bit unbiased quantizer is: , wherein, In order for the disturbance to occur, As a function of the sign of the element by element, In the form of a Hadamard product, In order to round down the function, For absolute value operation, b v is the number of bits required to transmit a vector, and b 1 is 64; the definition of Top-k sparsing compressor is: , wherein t j is the coordinate index of the j-th absolute maximum component of the vector x The definition of Rand-k random sparse compressor is: r j is the slave Randomly selecting coordinate indexes which are not repeated mutually; the definition of a norm-sign compressor is: , 。
- 8. The distributed zero-order compression non-convex optimization method for heterogeneous data according to claim 7, wherein the communication network between the intelligent agents is constructed based on a generalized contraction compressor, further comprising: Setting a first auxiliary variable based on the communication network; When the first auxiliary variable approaches the local model parameter, setting an indirect compression variable to replace the local model parameter so as to reduce compression errors and construct a compression variable; setting a second auxiliary variable to record information of the first auxiliary variable; the first auxiliary variable and the second auxiliary variable are updated.
- 9. The distributed zero-order compression non-convex optimization method for heterogeneous data according to claim 8, wherein the first auxiliary variable is , ; The second auxiliary variable is , ; As the first auxiliary variable approaches the local model parameter, ; Compression variable is ; Updating the first auxiliary variable and the second auxiliary variable includes: Constructing a first auxiliary variable updating algorithm: updating the first auxiliary variable; constructing a second auxiliary variable updating algorithm: The second auxiliary variable is updated.
- 10. The heterogeneous data-oriented distributed zero-order compression non-convex optimization method according to claim 9, wherein setting a dual variable based on the local model parameter, using the local model parameter as an original variable, updating the original variable and the dual variable by using the zero-order gradient estimator and the generalized contraction compressor, comprises: setting local model parameters Is the dual variable of (2) , ; Constructing an original variable updating algorithm: updating the original variable; Constructing a dual variable updating algorithm: the dual variables are updated.
Description
Heterogeneous data-oriented distributed zero-order compression non-convex optimization method Technical Field The invention relates to the technical field of computer networks, in particular to a distributed zero-order compression non-convex optimization method for heterogeneous data. Background Distributed optimization has received increasing attention for wide application in the fields of large-scale machine learning and networking systems, etc. Distributed optimization typically considers a networked system of n agents, where each agent possesses a private local cost function of(Not necessarily convex), the cost function may be deterministic or random, each agent optimizing by solving the following optimization problem: (1) To achieve consistency and to obtain optimal model parameters, wherein, As a parameter of the model, it is possible to provide,To obey data distributionIs used for the data processing of the data samples of the local data,As a random local cost function. To solve the problem (1), a number of important algorithms have been proposed, from convex optimization methods to non-convex optimization methods. Recently, the problem of random non-convex distributed optimization has received increasing attention as it is directly related to neural network training. Although the distributed optimization problem has been widely studied, existing methods are mostly first-order (gradient-based) methods. Many practical scenarios have difficulty in obtaining gradient information and can only perform a limited number of function point samples per round, such as black box model optimization and optimization based on bandit feedback. Such problems have spawned zero-order (no gradient) optimization methods, which can be generally classified by the way the gradient estimator is constructed. Some distributed zero-order optimization algorithms use two-point differencing to construct a gradient estimator when solving convex as well as non-convex problems. The scheme realizes better balance between the sampling times of the function points and the convergence speed. For example, such algorithms can achieve convergence speed of linear acceleration. In contrast, some distributed zero-order methods rely on 2p times function point sampling per round. Although the convergence rate is comparable to the first-order method, a serious sampling burden is imposed when the dimension p and the number n of agents are large. In a centralized scenario, there are also zero-order methods based on one-point sampling, and methods based on zero-order Hessian. Communication is another key factor in distributed optimization and often becomes a bottleneck in practical applications. Therefore, distributed optimization algorithm designs around efficient communications are widely studied, such as compressed communications, event triggered communications, and asynchronous communications. Among these methods, compressed communication is particularly interesting because of its simple mechanism, easy implementation, and effective reduction of communication overhead. In the distributed convex optimization problem, various kinds of compressors including quantizer, unbiased compressor, and shrink compressor have been studied and applied. These techniques are further generalized to non-convex situations as well. Another category of research effort is directed to unifying the different categories of compressors. There have been studies to propose a class of generalized compressors with bounded relative compression errors, incorporating unbiased compressors and systolic compressors into a unified framework. On this basis, the compressor with global and local absolute errors is further analyzed. In addition, there is a category of research that is directed to accelerating the convergence speed of compression algorithms. It is noted that some compression methods achieve convergence rates comparable to the first-order method, random gradient descent (SGD), and zero-order algorithms under precise communication. In distributed optimization, data heterogeneity can significantly affect algorithm convergence, which is particularly pronounced in applications such as meta-learning and large language model training and fine tuning. The problem of data heterogeneity was at the earliest a major concern in federal learning. In federal learning, multi-step local updates and system heterogeneity can further exacerbate the effects of data heterogeneity. To alleviate this problem, researchers have proposed some important algorithms in scenarios with a central node, and more generally in distributed scenarios. However, these methods generally rely on assumptions that limit the degree of data isomerism, which are collectively referred to herein as "data homogeneity assumptions". Notably, when all agents participate in each round and perform only one local update, the algorithm may degrade to a standard distributed SGD or first order approach, where no data homogeneity