CN-121996699-A - Distributed system oriented to deep learning recommendation model and data caching method

CN121996699ACN 121996699 ACN121996699 ACN 121996699ACN-121996699-A

Abstract

The invention discloses a distributed system facing deep learning recommendation model and a data caching method in the technical field of distributed deep learning system communication, which are used for acquiring labels of query request data streams in a central switch in each measurement period, calculating the size of corresponding data streams according to the labels of the query request data streams, determining hot query requests of each distributed node based on the size of each data stream, the method can measure and analyze the cluster communication traffic in real time, count global embedded table access hotspots, push the hotspot data to each node cache, effectively improve the cache hit rate, reduce All-to-All traffic, reduce the requirement for GPU video memory access, and solve the problems of dense data throughput and high access delay caused by the fact that the recommended model is oversized, the embedded operation involves frequent and random table entry access.

Inventors

Han Diefan
SUN YUE
HUANG HE
DU YANG

Assignees

苏州大学

Dates

Publication Date: 20260508
Application Date: 20260120

Claims (10)

1. The distributed system facing the deep learning recommendation model is characterized by comprising a plurality of distributed nodes and a central switch for communication among the distributed nodes, wherein a measurement module is deployed in the central switch, and a distributed embedded table and a cache module are deployed in the distributed nodes; the measurement module is used for measuring the size of each query request data flow passing through the central switch in each measurement period, determining the hot query request of each distributed node based on the size of the data flow, and pushing the cache entry corresponding to the hot query request to the distributed node; And the distributed node is used for executing the query request of each measurement period and merging the cache entry corresponding to the hot query request in the cache module with the local cache entry in the distributed embedded table after each measurement period is finished.
2. The deep learning recommendation model oriented distributed system of claim 1, wherein the measurement module comprises a plurality of buckets arranged in a matrix and mapped with hash functions at each row, the buckets comprising: The label field is used for recording labels corresponding to the data flow in the storage barrel; a value field for calculating the sum of all the query request data flow values in the storage barrel; A count field for indicating a count; a timestamp field, configured to record an arrival time of a current query request data stream; An instantaneous frequency estimation domain for estimating instantaneous access rate variation of a current query request data stream; a trend priority field, configured to record a priority of a current query request data stream; and the path information field is used for recording the forwarding path of the current query request data stream.
3. The deep learning recommendation model oriented distributed system of claim 2, wherein the measurement module is further configured to update the bucket based on a match between a tag of a current query request data stream and a tag of a record data stream in a tag domain.
4. The deep learning recommendation model oriented distributed system of claim 3 wherein if the tag of the current query request data stream matches the tag of the record data stream in the tag field, updating the bucket using the following formula: ; ; ; Wherein, the Represent the first Tag corresponding to each data stream Is a hash value of (2); Represent the first Tag corresponding to each data stream At the measuring module The value recorded in the row bucket count field, A time-weighted function is represented and, Representing a time stamp; Represent the first Tag corresponding to each data stream At the measuring module The values recorded by the instantaneous frequency estimation domain of the line bucket, The frequency weight parameter is represented as a function of the frequency, Representing the latest increment of the value range, Representation and representation The corresponding time stamp variable is used to determine, Representing the assignment; Represent the first Tag corresponding to each data stream At the measuring module The row bucket timestamp field records the value, Representing the arrival time of the current query request data stream; ; Wherein, the Represents an exponential function based on a real number e, Representing the time weight parameter.
5. The deep learning recommendation model oriented distributed system of claim 3 wherein if the label of the current query request data stream does not match the label of the record data stream in the label field, updating the bucket using: ; Wherein, the Represent the first Tag corresponding to each data stream At the measuring module The value recorded in the row bucket count field, A time-weighted function is represented and, The time stamp is indicated as such, Representing the assignment; ; Wherein, the Represents an exponential function based on a real number e, The time-weight parameter is represented by a time-weight parameter, Represent the first Tag corresponding to each data stream At the measuring module The row bucket timestamp field records the value, Representing the arrival time of the current query request data stream; If it is Then , ; Initialization of And is of Giving new priority to record ; Wherein, the Represent the first Tag corresponding to each data stream At the measuring module The row stores the value of the bucket label field record, Represent the first Tag corresponding to each data stream At the measuring module The values recorded by the instantaneous frequency estimation domain of the line bucket, Represent the first Tag corresponding to each data stream At the measuring module The values recorded in the row bucket trend priority field, Represent the first Tag corresponding to each data stream At the measuring module The values recorded in the path information field of the row bucket, Indicating the current query request data stream request node number, Representing the current query request data stream destination node number.
6. The deep learning recommendation model oriented distributed system of claim 2, wherein the measuring module for measuring the size of each query request data stream passing through the central switch at each measurement cycle comprises: the size of each query request data stream is calculated using: ; Wherein the method comprises the steps of Representing the first of the measurement modules Row of lines The column bucket records the value of the data flow label corresponding to the query request data flow, Representing the first of the measurement modules Row of lines The value of the value field record in the column bucket, Representing the first of the measurement modules Row of lines The value of the count field record in the column bucket, And Representing the weight coefficient to be adjustable and, The activation function is represented as a function of the activation, Representing the first of the measurement modules Row of lines The values recorded in the frequency estimation domain in the column bucket, Representing a normalized priority value.
7. The deep learning recommendation model oriented distributed system of claim 1, wherein the distributed nodes are configured to execute a query request for each measurement cycle, comprising: The cache module comprises k cache entries and a counter corresponding to the k cache entries; If the query request hits the cache entry in the cache module, the central switch directly acquires the feature vector corresponding to the cache entry, the counter value corresponding to the cache entry is increased by 1, otherwise, the central switch continues to query the rest nodes.
8. The deep learning recommendation model oriented distributed system of claim 7, wherein the distributed node is configured to merge a cache entry corresponding to a hot query request in a cache module with a local cache entry in a distributed embedded table after each measurement period is completed, and comprises: after the measurement period is finished, receiving k cache entries pushed by the central switch corresponding to the hot query request and corresponding count values; K cache entries pushed by the central switch corresponding to the hot query request are combined with the local cache entries and then are sorted in a descending manner according to the count value, and after the first k cache entries are taken for caching, all counter values are cleared.
9. A data caching method, wherein the method is applied to the deep learning recommendation model oriented distributed system as claimed in any one of claims 1 to 8, and comprises: Acquiring labels of each query request data stream in a central switch passing through each measurement period; calculating the size of the corresponding data stream according to the labels of the data stream of each inquiry request; determining a hot query request for each distributed node based on the size of each data stream; And pushing the cache entry corresponding to the hot query request to the distributed node for caching.
10. The data caching method of claim 9, wherein the determining a hot query request for each distributed node based on the size of each data stream comprises: according to the size of each data stream, analyzing the request node number and the query feature number in the corresponding label; constructing a cross-node co-occurrence matrix according to the request node number and the query feature number; identifying shared hot spots and node specific hot spots according to the cross-node co-occurrence matrix; Determining composite priority according to the shared hot spot and the node specific hot spot; And generating an ordered list according to the composite priority, and taking the query feature numbers of the first k requests to obtain the hot query request of each distributed node.

Description

Distributed system oriented to deep learning recommendation model and data caching method Technical Field The invention relates to a distributed system oriented to a deep learning recommendation model and a data caching method, and belongs to the technical field of distributed deep learning system communication. Background The personalized recommendation system (Personalized Recommendation System) has become a key support technology in internet service and is widely applied to online business such as commodity recommendation, video and music recommendation, search service and the like. With the refinement of recommendation tasks and the continuous expansion of data scale, a recommendation model (DEEP LEARNING Recommendation Model, DLRM) based on deep learning shows remarkable advantages in click rate prediction and sequencing tasks, and becomes a recommendation modeling framework of the current mainstream. The DLRM model is typically composed of a multi-layer perceptron and a large-scale embedding operator, where the embedding layer is used to perform high-dimensional vectorization processing on discrete features. Compared with the traditional neural network, the parameters of DLRM are very large in scale, and the embedded tables (Embedding Table, EMT) of the neural network often occupy hundreds of GB or even TB-level storage space, which is far beyond the high-bandwidth video memory capacity of a single accelerator (such as a GPU). Meanwhile, the video memory capacity of the hardware is increased at a speed far lower than the expansion speed of the model scale, so that high-performance DLRM training and reasoning on single-node equipment becomes infeasible. To this end, DLRM high-performance training and reasoning typically relies on multi-node distributed systems. However, DLRM's distributed training presents new system bottlenecks and extensibility challenges, mainly manifested in: The communication bottleneck is that the traditional data parallel mode can not completely copy model parameters on each accelerator due to overlarge model scale, the industry commonly adopts a mixed strategy of model parallel and data parallel to partition and distribute EMT to different nodes, each node needs to exchange an embedded result through All-to-All communication, and the communication mode can generate extremely high data exchange quantity and becomes a key bottleneck of the overall performance of the system; The memory bottleneck DLRM contains parameters of which the magnitude reaches trillion, and has extremely high requirement on memory access bandwidth. The embedding operation involves frequent and random entry accesses, and the memory bandwidth is difficult to support such dense data throughput, resulting in increased access delay, further reducing overall training efficiency. Disclosure of Invention The invention aims to overcome the defects in the prior art, provides a distributed system and a data caching method for a deep learning recommendation model, can measure and analyze cluster communication traffic in real time, count global embedded table access hotspots, and push hotspot data to each node cache, and solves the problems of dense data throughput and high access delay caused by the fact that the recommendation model is oversized, the embedded operation involves frequent and random table access. In order to solve the technical problems, the invention is realized by adopting the following technical scheme: the invention provides a distributed system oriented to a deep learning recommendation model, which comprises a plurality of distributed nodes and a central switch for communication among the distributed nodes, wherein a measurement module is deployed in the central switch, and a distributed embedded table and a cache module are deployed in the distributed nodes; the measurement module is used for measuring the size of each query request data flow passing through the central switch in each measurement period, determining the hot query request of each distributed node based on the size of the data flow, and pushing the cache entry corresponding to the hot query request to the distributed node; And the distributed node is used for executing the query request of each measurement period and merging the cache entry corresponding to the hot query request in the cache module with the local cache entry in the distributed embedded table after each measurement period is finished. Further, the measurement module includes a plurality of buckets arranged in a matrix and mapped with a hash function at each row, the buckets including: The label field is used for recording labels corresponding to the data flow in the storage barrel; a value field for calculating the sum of all the query request data flow values in the storage barrel; A count field for indicating a count; a timestamp field, configured to record an arrival time of a current query request data stream; An instantaneous frequency estimation domain for es