US-12625633-B2 - Multi-pass distributed data shuffle

US12625633B2US 12625633 B2US12625633 B2US 12625633B2US-12625633-B2

Abstract

A system and method for repartitioning data in a distributed network. The method may include executing, by one or more processors, a first pass of a data set from a plurality of first sources to a plurality of first sinks, each first sink collecting data from one or more of the first sources, and executing, by the one or more processors, a second pass of the data set from a plurality of second sources to a plurality of second sinks, each one of the plurality of first sinks corresponding to one of the plurality of second sources, and each second sink collecting data from one or more of the second sources. Executing the first and second passes causes the data set to be repartitioned such that one or more second sinks collect data that originated from two or more of the first sources.

Inventors

Mohsen Vakilian
Hossein Ahmadi

Assignees

GOOGLE LLC

Dates

Publication Date: 20260512
Application Date: 20240510

Claims (20)

1 . A method comprising: executing, by one or more processors, a first regrouping of a data set in a plurality of first sources, and a first splitting of the regrouped data set to a plurality of first sinks, wherein the plurality of first sinks is greater in quantity than the plurality of first sources; and executing, by the one or more processors, a second regrouping of the data set in the plurality of first sinks, and a second splitting of the regrouped data set to a plurality of second sinks, wherein the plurality of second sinks is greater in quantity than the plurality of first sinks.
2 . The method of claim 1 , wherein the second regrouping comprises a plurality of independent shuffles.
3 . The method of claim 2 , wherein the plurality of independent shuffles are performed sequentially.
4 . The method of claim 2 , wherein the plurality of independent shuffles are performed simultaneously.
5 . The method of claim 2 , wherein the first regrouping and first splitting of the data set is associated with a first metadata log, and wherein each independent shuffle of the second regrouping of the data set is associated with its own respective second metadata log, wherein each respective second metadata log includes metadata for only those second sinks of its own associated independent shuffle.
6 . The method of claim 5 , further comprising flushing details of one second metadata log when its associated independent shuffle is complete while other independent shuffles are still being executed.
7 . The method of claim 1 , wherein the method further comprises: prior to the first regrouping, designating each of the plurality of first sinks and the plurality of second sinks; and upon at least one first sink completing collection of a respective portion of the reordered data set from one or more of the first sources, and before completion of the first regrouping and splitting, executing, by the one or more processors, the independent shuffle of the respective portion of the regrouped data set received by the at least one first sink.
8 . The method of claim 1 , wherein the plurality of second sinks is greater in quantity than the plurality of first sinks by a factor of two or greater.
9 . The method of claim 1 , wherein in the first splitting of the data set each first sink receives data from at least two first sources, and wherein in the second splitting of the data set each second sink receives data from at least two first sinks.
10 . The method of claim 9 , wherein the first sinks and second sinks are divided among a plurality of independent shuffles of the second regrouping and the second splitting of the data set, wherein each independent shuffle is associated with its own respective metadata log such that each respective metadata log includes metadata for only those second sinks of its own associated independent shuffle.
11 . A system comprising: one or more processors; and one or more storage devices in communication with the one or more processors, wherein the one or more storage devices contain instructions configured to cause the one or more processors to: execute a first regrouping of a data set in a plurality of first sources, and a first splitting of the regrouped data set to a plurality of first sinks, wherein the plurality of first sinks is greater in quantity than the plurality of first sources; and execute a second regrouping-of the data set in the plurality of first sinks, and a second splitting of the regrouped data set to a plurality of second sinks, wherein the plurality of second sinks is greater in quantity than the plurality of first sinks.
12 . The system of claim 11 , wherein the second regrouping comprises a plurality of independent shuffles.
13 . The system of claim 12 , wherein the instructions are configured to cause the one or more processors to perform the plurality of independent shuffles sequentially.
14 . The system of claim 12 , wherein the instructions are configured to cause the one or more processors to perform the plurality of independent shuffles simultaneously.
15 . The system of claim 12 , wherein the first regrouping and first splitting of the data set is associated with a first metadata log, and wherein each independent shuffle of the second regrouping of the data set is associated with its own respective second metadata log, wherein each respective second metadata log includes metadata for only those second sinks of its own associated independent shuffle.
16 . The system of claim 15 , wherein the instructions are configured to cause the one or more processors to flush details of one second metadata log when its associated independent shuffle is complete while other independent shuffles are still being executed.
17 . The system of claim 11 , wherein the instructions are configured to cause the one or more processors to: prior to the first regrouping, designate each of the plurality of first sinks and the plurality of second sinks; and upon at least one first sink completing collection of a respective portion of the regrouped data set from one or more of the first sources, and before completion of the first regrouping and splitting, execute the independent shuffle of the respective portion of the regrouped data set received by the at least one first sink.
18 . The system of claim 11 , wherein the plurality of second sinks is greater in quantity than the plurality of first sinks by a factor of two or greater.
19 . The system of claim 11 , wherein the instructions are configured to cause the one or more processors to: in the first splitting of the data set, configure each first sink to receive data from at least two first sources, and in the second splitting of the data set, configure each second sink to receive data from at least two first sinks.
20 . The system of claim 19 , wherein the first sinks and second sinks are divided among a plurality of independent shuffles of the second regrouping and the second splitting of the data set, wherein each independent shuffle is associated with its own respective metadata log such that each respective metadata log includes metadata for only those second sinks of its own associated independent shuffle.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS The present application is a continuation of U.S. patent application Ser. No. 18/198,356, filed May 17, 2023, which is a continuation of U.S. patent application Ser. No. 17/969,296, filed Oct. 19, 2022, now issued as U.S. Pat. No. 11,675,517, which is a continuation of U.S. patent application Ser. No. 17/359,810, filed Jun. 28, 2021, now issued as U.S. Pat. No. 11,513,710, which is a continuation of U.S. patent application Ser. No. 16/672,939, filed on Nov. 4, 2019, now issued as U.S. Pat. No. 11,061,596, the disclosures of which are hereby incorporated herein by reference. BACKGROUND The technology of the present disclosure relates generally to a system for improving the efficiency of shuffle operations involving many sinks. In a “shuffle,” blocks of data from multiple sources are redistributed among multiple sinks using a distribution scheme that causes blocks of data in each source to be distributed to multiple sinks. At the end of a shuffle each sink may include blocks from more than one source. Shuffle data is conventionally organized by its source and mapped to its corresponding source for each sink. FIG. 1 is a functional block diagram showing an example of a shuffle operation in which blocks of data stored at sources 10 are shuffled to sinks 30. In the example of FIG. 1 there are fourteen sources and sixteen sinks. Each sink is mapped to, and receives, data from four different sources. For example, each of sinks 31 and 32 is mapped to receive shuffled data from sources 11, 12, 13 and 14. For further example, each of sinks 33 and 34 is mapped to and receives shuffled data that is mapped from sources 11 and 12 to source 13, from source 13 to source 15, and from source 15 to sinks 33 and 34. There are 64 total mappings between sink and source for the shuffle of FIG. 1—four sources for each of the sixteen sinks. Conventionally, shuffle operations may require each source to append its data to a common log. Therefore, shuffle operations can easily scale to accommodate additional sources, and the number of operations to complete a shuffle may increase linearly as the number of sources increases. However, since the sinks receive data from multiple sources and thus are mapped to several different sources, each sink must scan all of the sources from which it may receive data. Thus shuffle operations do not scale as easily to accommodate additional sinks, as the number of operations to complete a shuffle may increase quadratically as the number of sinks increases. As the amount of data handled in the shuffle operation increases, the data may no longer fit in a limited number of sinks, so it becomes necessary to increase the number of sinks to which the data is repartitioned. BRIEF SUMMARY One aspect of the present disclosure is directed to a method of repartitioning data in a distributed network. The method may include executing, by one or more processors, a first pass of a data set from a plurality of first sources to a plurality of first sinks, each first sink collecting data from one or more of the first sources, and executing, by the one or more processors, a second pass of the data set from a plurality of second sources to a plurality of second sinks, each one of the plurality of first sinks corresponding to one of the plurality of second sources, and each second sink collecting data from one or more of the second sources. Executing the first and second passes may cause the data set to be repartitioned such that one or more second sinks collect data that originated from two or more of the first sources. In some examples, a quantity of the plurality of first sinks may be greater than a quantity of the plurality of first sources. In some examples, each first sink may collect data from two or more of the first sources. In some examples, a quantity of the plurality of second sinks may be greater than a quantity of the plurality of second sources. In some examples, the method may further include executing N passes, N being a number having a value greater than two. For each given pass, a plurality of sinks may collect data from one or more of a plurality of sources, each source corresponding to a sink of a previous pass. Executing the N passes may cause the data set to be repartitioned such that one or more Nth sinks collect data that originated from two or more of the first sources. In some examples, for at least one pass of the N passes, each sink of the pass may collect data from two or more of the sources of the pass, and each of the two or more sources of the pass may include data that originated from different sources of an immediately preceding pass. In some examples, for at least another pass of the N passes, each sink of the pass may collect data from two or more of the sources of the pass, and each of the two or more sources of the pass may include data that originated from different sources of an immediately preceding pass. In some examples, the at least one pa