CN-121998001-A - Contract recursion method for updating distributed tensor data

CN121998001ACN 121998001 ACN121998001 ACN 121998001ACN-121998001-A

Abstract

The invention discloses a contract recursion method for updating distributed tensor data, which relates to the technical field of computer data processing and comprises the following steps of initializing a dynamic self-adaptive contract cluster, sensing and aggregating the dynamic state of the cluster in real time, executing a contractual recursion update loop, carrying out fault tolerance and state restoration in a recursion process and carrying out iteration termination judgment; the intelligent recursion method realizes the intelligent recursion of the distributed tensor update by combining the dynamic self-adaptive contracts with the real-time cluster state sensing, remarkably improves the efficiency and the stability of large-scale training, dynamically optimizes the communication strategy according to the network and the computing power state, ensures the state consistency in a complex environment, precisely resolves conflict through version vectors and intelligent arbitration, realizes elegant fault tolerance by utilizing prediction compensation, has high self-adaptability, can optimize parameters on line to reduce the manual tuning cost, and provides robust and efficient support for the distributed machine learning.

Inventors

SONG HEPING
ZHANG LEI
ZHU JIAN

Assignees

迈纳士(上海)机器人科技有限公司
小挚云途(上海)新能源汽车科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260129

Claims (9)

1. A contract recursion method for distributed tensor data update, comprising the steps of: Step one, loading and instantiating a global layer contract and a local layer contract as dynamic self-adaptive contracts, wherein the global layer contract defines a collaborative basic rule crossing nodes, the local layer contract customizes a corresponding tensor slicing management strategy and a local recursion calculation rule according to node types, allocates an initial tensor state to each node in a cluster based on a preset tensor slicing rule, and creates an initial version vector for each tensor slice; periodically collecting running state indexes of all nodes in the cluster, including node computing power load, network communication quality among nodes and access mode indexes of all tensor fragments, and aggregating to generate a cluster state vector; step three, dynamically adjusting contract parameters according to cluster state vectors, calculating local tensor update quantity by each node according to local layer contracts, generating version labels containing node identifiers and logic time stamps, polymerizing the local tensor update quantity in the clusters according to rules defined by global layer contracts, carrying out conflict resolution based on the version labels, generating a global consistent tensor new state and updating each node copy; Continuously recording a recurrence path log in the updating process, tracing back to a nearest consistent state point according to the recurrence path log when node faults or communication anomalies are detected, and predicting and compensating the missing updating quantity by utilizing historical log data; and fifthly, if the current recursion process meets a preset convergence condition or triggers an abnormal termination condition, terminating the process and outputting a final tensor state, otherwise, returning to the step two to continue iteration.
2. The method for contract recursion of distributed tensor data update of claim 1, wherein in the step two, when the running state indexes are collected, the calculation power indexes are periodically collected through the node proxy, including time consumption, memory utilization rate and processor occupancy rate of tensor operation calculation, bidirectional delay and available bandwidth among nodes are measured through the network probe, and the fragmentation access heat is calculated by counting the fragmentation update frequency and the inquiry dependency frequency through a counter in a tensor fragmentation management module.
3. The method for contract recursion of distributed tensor data update of claim 1, wherein in the step three, the specific step of dynamically adjusting contract parameters is to dynamically switch a cluster synchronization mode based on a network delay index in a cluster state vector, switch to an asynchronous or hybrid mode when an average delay between nodes exceeds a first threshold, switch back to the synchronous mode when the delay is below a second threshold, and the hybrid mode refers to synchronous aggregation within a subset of computing nodes and asynchronous communication between the subsets.
4. The method for contract recursion of distributed tensor data update of claim 1, wherein in said step three, when contract parameters are dynamically adjusted, reinforcement learning agents are adopted to construct a reward function with cluster iteration efficiency and state consistency, and adjustment actions of learning rate and aggregation frequency parameters in contract are dynamically output and executed.
5. The method for contract recursion of distributed tensor data update of claim 4, wherein the training mechanism of the reinforcement learning agent comprises: the state space is a normalized representation of the cluster state vector; the action space is an adjusting instruction for adjusting parameters in the contract; the bonus function R is designed to: R=w 1 *E+w 2 *(1-D)-w 3 *C wherein E is the inverse of the iteration efficiency, D is the maximum degree of difference of the state versions of the cluster tensors, C is the network communication overhead, and w 1 、w 2 and w 3 are weight coefficients.
6. The method for contract recursion of distributed tensor data update according to claim 1, wherein in the third step, when the local tensor update amount is aggregated in the cluster, a gradient compression strategy is dynamically selected based on network bandwidth and slice access heat in the cluster state vector, sparse compression is adopted for high-frequency update slices, and quantization or low-rank approximate compression is adopted for low-frequency update slices.
7. The method of contract recursion for distributed tensor data update according to claim 6, wherein the compression rate r in the gradient compression strategy is adaptively determined, a target compression rate is calculated according to a formula r=r_base f (h) g (b), wherein r_base is a base compression rate, f (h) is a decreasing function about the slice access heat h, g (b) is a decreasing function about the available bandwidth b, the current compression error is calculated and accumulated after decompression at the receiving end, and the accumulated error is added as a correction amount to the gradient locally calculated at the next round.
8. The method for contract recursion of distributed tensor data update according to claim 1, wherein in the third step, when conflict resolution is performed based on version labels, version vectors are maintained for each tensor slice, when new updates are received, version comparison is performed, if concurrency conflicts exist, strong consistent arbitration based on latest time stamps is adopted for key slices, and final consistent arbitration based on node priority or merging algorithm is adopted for non-key slices according to rule arbitration defined by local layer contracts.
9. The method for contract recursion of distributed tensor data update according to claim 1, wherein in the fourth step, when the missing update amount is predicted and compensated by using the history log data, a history update sequence of N consecutive rounds before the failure of the failure node is obtained from the recursion path log stored in a distributed manner, the history update sequence is input into a pre-trained recurrent neural network prediction model, a predicted value of the missing round update amount is output, and the predicted value is fused into the aggregation process of the current round as the compensation amount.

Description

Contract recursion method for updating distributed tensor data Technical Field The invention relates to the technical field of computer data processing, in particular to a contract recursion method for updating distributed tensor data. Background Distributed tensors typically occur in large scale machine learning (especially deep learning) training. Because the model parameter amount is huge (such as a large language model with trillion parameters) or the data amount is huge, a single node cannot accommodate all data or high-efficiency calculation, tensor fragments (Sharding) are required to be stored on a plurality of nodes, or are distributed and processed in a data parallel mode (a plurality of nodes copy the same model), a model parallel mode (a single model is split into a plurality of nodes), and the like. At this time, updating of tensors (e.g., according to gradient adjustment parameters) needs to be completed in a coordinated manner in a distributed environment. Distributed tensor data update refers to the process of modifying or adjusting tensors (multidimensional arrays, model parameters commonly found in deep learning, intermediate features, etc.) stored on multiple nodes (e.g., servers, GPUs, TPUs, etc.) in a distributed computing environment. Its core goal is to efficiently and consistently maintain the latest state of tensors in multi-node collaborative computing to support distributed training, reasoning, or other large-scale data processing tasks. Based on the findings in the prior art, the existing distributed tensor data updating methods, such as AllReduce based on asynchronous updating or fixed synchronous domain of a parameter server, generally adopt static and preset strategies to perform gradient aggregation and synchronization, and the methods lack the sensing and response capability to run-time cluster dynamics (such as network fluctuation, node isomerism and data access hot spot), so that the problems of communication blockage, unbalanced resource utilization and the like easily occur in a complex network environment. Meanwhile, the fault-tolerant mechanism is rough (such as simple rollback) and can cause the waste of computing resources and influence the convergence speed, and the conflict resolution is also dependent on simple rules, so that the state consistency is difficult to maintain while the efficiency is ensured. In summary, the existing distributed tensor data updating method has the defects of flexibility and intelligence, and cannot realize the optimal balance of efficiency, consistency and robustness in a dynamic and heterogeneous distributed environment. Therefore, the invention provides a contract recursion method for updating distributed tensor data to solve the problems in the prior art. Disclosure of Invention Aiming at the problems, the invention aims to provide a contract recursion method for updating distributed tensor data, which solves the problems that the existing distributed tensor data updating method adopts a static preset updating strategy, is difficult to adapt to a dynamic heterogeneous cluster environment, and often causes low communication efficiency, unbalanced resource utilization, rough fault-tolerant mechanism and difficult maintenance of state consistency. In order to achieve the purpose of the invention, the invention is realized by the following technical scheme that the contract recursion method for updating the distributed tensor data comprises the following steps: Step one, loading and instantiating a global layer contract and a local layer contract as dynamic self-adaptive contracts, wherein the global layer contract defines a collaborative basic rule crossing nodes, the local layer contract customizes a corresponding tensor slicing management strategy and a local recursion calculation rule according to node types, allocates an initial tensor state to each node in a cluster based on a preset tensor slicing rule, and creates an initial version vector for each tensor slice; periodically collecting running state indexes of all nodes in the cluster, including node computing power load, network communication quality among nodes and access mode indexes of all tensor fragments, and aggregating to generate a cluster state vector; step three, dynamically adjusting contract parameters according to cluster state vectors, calculating local tensor update quantity by each node according to local layer contracts, generating version labels containing node identifiers and logic time stamps, polymerizing the local tensor update quantity in the clusters according to rules defined by global layer contracts, carrying out conflict resolution based on the version labels, generating a global consistent tensor new state and updating each node copy; Continuously recording a recurrence path log in the updating process, tracing back to a nearest consistent state point according to the recurrence path log when node faults or communication anomalies are detected, and