CN-121984862-A - Design method of distributed computing modeling system of high-dimensional discrete data
Abstract
The invention relates to the technical field of distributed computation and discloses a design method of a distributed computation modeling system of high-dimensional discrete data, which comprises the steps of obtaining a characteristic identification access request in a high-dimensional discrete data stream, counting access frequency, and marking the characteristic identification of high-frequency access as a high-frequency access state; the method comprises the steps of constructing local parameter copies on a non-master node physical computing unit of a distributed cluster, routing gradient update data aiming at a feature identifier to the local parameter copies closest to a network and generating local residual tensors, calculating the directional cosine similarity between the local residual tensors and global gradient update vectors, calculating real-time access entropy based on access source distribution and converting the real-time access entropy into a dynamic synchronization threshold value, and triggering parameter aggregation when the similarity is smaller than the threshold value.
Inventors
- XIONG SONGQUAN
- LIU YANGGUANG
- HE JIANWEI
- CHEN SHIKUN
Assignees
- 宁波财经学院
Dates
- Publication Date
- 20260505
- Application Date
- 20260331
Claims (10)
- 1. The design method of the distributed computing modeling system of the high-dimensional discrete data is characterized by comprising the following steps of: Step 101, acquiring feature identification access request data in a high-dimensional discrete data stream, and counting the access frequency of feature identification in a preset time window; 102, marking a characteristic identifier with the access frequency exceeding a preset load threshold as a high-frequency access state, distributing a storage space on a non-master node physical computing unit of a distributed computing cluster, and constructing a local parameter copy corresponding to the characteristic identifier; Step 103, receiving gradient update data aiming at the feature identification, routing the gradient update data to a local parameter copy closest to the network topology, and executing vector accumulation operation in the local parameter copy to generate a local residual tensor; 104, acquiring a global gradient update vector maintained by a global master node, and calculating a directional cosine similarity value between a local residual tensor and the global gradient update vector; Step 105, counting access source IP address distribution data aiming at the feature identification, calculating real-time access entropy representing the discrete degree of a request source, and calculating a dynamic synchronization threshold value based on the real-time access entropy; step 106, executing synchronous trigger judging logic in each calculation period, namely comparing the direction cosine similarity value with a dynamic synchronous threshold value, and generating a synchronous trigger instruction when the direction cosine similarity value is smaller than the dynamic synchronous threshold value; And step 107, in response to the synchronous trigger instruction, performing parameter aggregation operation between the local parameter copy and the global master node, and updating global model parameters by using the local residual tensor.
- 2. The method for designing a distributed computing modeling system of high-dimensional discrete data according to claim 1, wherein the step of constructing a local parameter copy corresponding to the feature identifier in step 102 includes the steps of analyzing network IP addresses of all source computing nodes initiating access requests for the feature identifier and querying a pre-stored network topology mapping table according to the network IP addresses to obtain logical coordinates of the source computing nodes, the step of constructing a request source network topology set based on the logical coordinates and calculating a geometric center node position of the network topology set, and the step of traversing computing nodes in the distributed computing cluster, and selecting a computing node having the smallest logical link hop number with the geometric center node position and having a current memory occupancy lower than a preset security value as a host node of the local parameter copy.
- 3. The method for designing a distributed computing modeling system of high-dimensional discrete data according to claim 1, wherein the step of generating a local residual tensor in step 103 includes the steps of receiving local gradient vectors for feature identifiers sent by a plurality of source computing nodes in a preset cache period, performing a bitwise addition operation on the local gradient vectors and a cumulative residual vector currently stored in a local parameter copy to obtain updated cumulative residual vectors, and performing sparse filtering on the updated cumulative residual vectors to set a dimension component with an absolute value lower than a preset noise cutoff to 0 to obtain the local residual tensor in step 301.
- 4. The method of claim 1, wherein the step of calculating the dynamic synchronization threshold based on the real-time access entropy in step 105 is performed based on the following formula: Wherein T is a dynamic synchronization threshold, k is a preset adjusting coefficient, E is a real-time access entropy, epsilon is a preset minimum normal number constant used for preventing denominator from being zero, and the value range of the adjusting coefficient k is 0.1 to 1.0.
- 5. The method for designing a distributed computing modeling system of high-dimensional discrete data according to claim 1, wherein the parameter aggregation operation in step 107 includes step 501, sending an aggregation request data packet containing a local residual tensor to a global master node by a computing node where a local parameter copy is located, step 502, analyzing the aggregation request data packet by the global master node, merging the local residual tensor into a global model parameter by a weighted average algorithm, and step 503, calculating a new global gradient update vector by the global master node based on the merged global model parameter, and distributing the new global gradient update vector to all computing nodes where local parameter copies are stored to cover a local state of the local parameter copy.
- 6. The method according to claim 1, wherein the synchronization trigger decision logic in step 106 further includes an auxiliary trigger condition based on a gradient modular length, wherein the step 601 calculates an L2 norm of the local residual tensor, the step 602 compares the L2 norm with a preset gradient accumulation upper threshold, and the step 603 directly generates the synchronization trigger command when the L2 norm is greater than the gradient accumulation upper threshold and the direction cosine similarity value is compared with the dynamic synchronization threshold.
- 7. The method for designing a distributed computing modeling system for high-dimensional discrete data according to claim 1, wherein after the step of counting the access frequencies of the feature identifiers in the preset time window in step 101, further comprising step 701 of calculating an arithmetic mean and a standard deviation of all the feature identifiers, and step 702 of determining the feature identifier having an access frequency value greater than the sum of the arithmetic mean and the standard deviation of 3 times as high-frequency access state.
- 8. The method for designing a distributed computing modeling system of high-dimensional discrete data according to claim 1, wherein before calculating the direction cosine similarity value in step 104, further comprising step 801 of performing unit vector normalization processing on the local residual tensor and the global gradient update vector, respectively, to map both to the same vector space, and step 802 of directly setting the direction cosine similarity value to 0 if the modulo length of the global gradient update vector is detected to be 0.
- 9. The method according to claim 1, wherein the parameter aggregation operation in step 107 uses a non-blocking asynchronous communication protocol, and the local parameter copy does not wait for a confirmation response of the global master node after sending the aggregation request packet, and continues to receive and process the gradient update data that arrives later.
- 10. The method of claim 1, wherein the step 105 is performed with real-time access to entropy The method is obtained by calculating shannon entropy of the ratio distribution of the request times of different source computing nodes to the feature identification in a preset time window.
Description
Design method of distributed computing modeling system of high-dimensional discrete data Technical Field The invention relates to a design method of a distributed computing modeling system of high-dimensional discrete data, and belongs to the technical field of distributed computing. Background In the current industrial-level applications such as computing advertisements, recommendation systems and large language model training, modeling aiming at high-dimensional discrete data has become a core computing task, in order to process characteristic parameters up to billions, a parameter server architecture is generally adopted in the industry as a main stream technical mode, and the design premise of the architecture is that a mass of discrete characteristic keys are mapped and segmented onto different physical computing nodes of a distributed cluster based on a consistent hash algorithm, and storage and computing load balance is achieved through uniform distribution in probability statistics. However, when the power law distribution data stream is extremely skewed in the real production environment, the technical manner based on static hash mapping faces serious physical performance bottleneck, although attempts to relieve communication pressure by optimizing network topology structure have appeared in the prior art, static or semi-static reorganization of the physical nodes is simply relied on, and transient severe fluctuation of flow characteristics is difficult to cope, for example, chinese patent application publication number CN113191505B discloses a geographic distributed machine learning parameter server placement method, by a clustering algorithm, working nodes are divided into different clusters according to link physical distance and bandwidth, and deployment positions of local and global parameter servers are selected according to the clusters, although communication delay is reduced to a certain extent by shortening physical transmission paths, the method is still essentially a coarse granularity optimization based on physical properties of a network layer, semantic perceptibility of transmission data content is lacking, and when the training scene is faced with high-dimensional discrete data, numerical value validity of gradient update cannot be recognized, so that a large amount of low information gain data containing random noise is still occupied by a non-differential link to be transmitted, and congestion cannot be fundamentally solved. Therefore, how to design a method for dynamically reconstructing parameter storage topology according to access heat and gradient semantics of data flow and maximally inhibiting invalid communication and resource concussion on the premise of ensuring model convergence accuracy is a technical problem to be solved by the method. Disclosure of Invention In order to solve the problems in the background technology, the technical scheme of the invention is as follows, a design method of a distributed computing modeling system of high-dimensional discrete data, which comprises the following steps: Step 101, acquiring feature identification access request data in a high-dimensional discrete data stream, and counting the access frequency of feature identification in a preset time window; 102, marking a characteristic identifier with the access frequency exceeding a preset load threshold as a high-frequency access state, distributing a storage space on a non-master node physical computing unit of a distributed computing cluster, and constructing a local parameter copy corresponding to the characteristic identifier; Step 103, receiving gradient update data aiming at the feature identification, routing the gradient update data to a local parameter copy closest to the network topology, and executing vector accumulation operation in the local parameter copy to generate a local residual tensor; 104, acquiring a global gradient update vector maintained by a global master node, and calculating a directional cosine similarity value between a local residual tensor and the global gradient update vector; Step 105, counting access source IP address distribution data aiming at the feature identification, calculating real-time access entropy representing the discrete degree of a request source, and calculating a dynamic synchronization threshold value based on the real-time access entropy; step 106, executing synchronous trigger judging logic in each calculation period, namely comparing the direction cosine similarity value with a dynamic synchronous threshold value, and generating a synchronous trigger instruction when the direction cosine similarity value is smaller than the dynamic synchronous threshold value; And step 107, in response to the synchronous trigger instruction, performing parameter aggregation operation between the local parameter copy and the global master node, and updating global model parameters by using the local residual tensor. Preferably, the step of construc