CN-117194690-B - Method and device for distributing graph data in distributed system

CN117194690BCN 117194690 BCN117194690 BCN 117194690BCN-117194690-B

Abstract

The embodiment of the specification relates to a method and a device for distributing graph data in a distributed system, wherein the distributed system comprises N-N pieces of working equipment arranged in an N-dimensional equipment matrix, the method is executed by any first piece of working equipment, the method comprises the steps of obtaining node data of a plurality of nodes in first graph data and a plurality of connecting sides, determining target numbers of the nodes according to the node data of any node by using a preset mapping function, then for any target connecting side, mapping two target numbers of two connected target nodes into row indexes and column indexes in the N-dimensional equipment matrix respectively by using a first mapping rule, determining target working equipment in the equipment matrix according to the row indexes and the column indexes, and then sending relevant data of the target connecting side and the two target nodes to the target working equipment.

Inventors

Wan Xiaopei
JING BIN
ZHU ZHONGSHU

Assignees

支付宝(杭州)信息技术有限公司

Dates

Publication Date: 20260512
Application Date: 20230927

Claims (16)

1. A method for distributing graph data in a distributed system, wherein the distributed system comprises N x N pieces of working equipment arranged in an N x N dimension equipment matrix, and any two pieces of working equipment can communicate with each other, the method is executed by any first working equipment, and the method comprises the following steps: acquiring node data of a plurality of nodes and edge data of a plurality of connecting edges in the first graph data, wherein the edge data indicates two nodes connected with the corresponding connecting edges; determining a target number of any node by using a preset mapping function according to node data of any node, wherein the mapping function is used for mapping the node data into shaping data; For any target connection edge, mapping two target numbers of two target nodes in edge data of the arbitrary target connection edge into a row index and a column index in an N-to-N-dimensional equipment matrix respectively by using a first mapping rule, and determining target working equipment in the equipment matrix according to the row index and the column index; and sending the related data of the target connecting edge and the two target nodes to the target working equipment.
2. The method of claim 1, wherein the first map data is a portion of map data randomly assigned to the first work device from among full map data.
3. The method of claim 1, wherein the mapping function is a hash function.
4. The method of claim 1, further comprising: And receiving node and connection edge data sent by other working equipment, and storing the node and connection edge data as second graph data.
5. The method of claim 2, further comprising: Determining an unlabeled node from the current second graph data as a seed node, and marking the seed node; Performing a plurality of iterations until all nodes in the second graph data are marked, wherein any one iteration comprises: Using the first mapping rule, mapping the target number of the seed node into a first index number, and determining a plurality of working devices from the device matrix according to the first index number, wherein the plurality of working devices comprise working devices in a row indicated by the first index number and working devices in a column indicated by the first index number; sending a first request to the plurality of working devices, wherein the first request is used for requesting to acquire untagged neighbor nodes and corresponding connecting edges of the seed node; storing the nodes received from the plurality of working devices into the second graph data and marking the second graph data; at least one node is determined from the received nodes as a new seed node.
6. The method of claim 1, wherein the node data includes an initial number and a type attribute of a node, and determining the target number of the node using a preset mapping function according to the node data of any node comprises: performing character string splicing on the initial number and the type attribute to obtain a first character string; and inputting the first character string into the mapping function to obtain a target number.
7. The method of claim 1, wherein the target numbers of the two target nodes are a first target number and a second target number, respectively, and the first target number is smaller than the second target number, wherein mapping the two target numbers of the two target nodes in the edge data into row indexes and column indexes in the N x N dimensional device matrix, respectively, using a first mapping rule, comprises: Mapping the first target number to the row index using a first mapping rule; the second target number is mapped to the column index using a first mapping rule.
8. The method of claim 1, wherein the first mapping rule is to modulo N the corresponding target number.
9. The method of claim 5, further comprising: each time the first working device performs an iteration, it performs data synchronization with other working devices.
10. The method of claim 9, wherein the data synchronization is accomplished through an information transfer interface MPI.
11. An apparatus for graph data distribution in a distributed system, where the distributed system includes n×n pieces of working equipment arranged in an n×n dimensional device matrix, and any two pieces of working equipment can communicate with each other, and the apparatus is disposed on any first working equipment, where the apparatus includes: The device comprises an acquisition unit, a first graph data acquisition unit and a second graph data acquisition unit, wherein the acquisition unit is configured to acquire node data of a plurality of nodes and edge data of a plurality of connecting edges, and the edge data indicates two nodes connected by the corresponding connecting edges; a first determining unit configured to determine, according to node data of any node, a target number of the node using a preset mapping function, where the mapping function is used to map the node data into shaping data; The mapping unit is configured to map two target numbers of two target nodes in the edge data of any target connecting edge into a row index and a column index in an N-dimensional equipment matrix respectively by using a first mapping rule, and determine target working equipment in the equipment matrix according to the row index and the column index; and the sending unit is configured to send the related data of the target connecting edge and the two target nodes to the target working equipment.
12. The apparatus of claim 11, further comprising: And the receiving unit is configured to receive node and connection side data sent by other working equipment and store the node and connection side data as second graph data.
13. The apparatus of claim 12, further comprising: a second determining unit configured to determine an unlabeled node from the current second graph data as a seed node and to label it; An iteration unit configured to perform a plurality of iterations until all nodes in the second graph data are marked, wherein any one iteration comprises: Using the first mapping rule, mapping the target number of the seed node into a first index number, and determining a plurality of working devices from the device matrix according to the first index number, wherein the plurality of working devices comprise working devices in a row indicated by the first index number and working devices in a column indicated by the first index number; sending a first request to the plurality of working devices, wherein the first request is used for requesting to acquire untagged neighbor nodes and corresponding connecting edges of the seed node; storing the nodes received from the plurality of working devices into the second graph data and marking the second graph data; at least one node is determined from the received nodes as a new seed node.
14. The apparatus of claim 13, further comprising: And the data synchronization unit is configured to perform data synchronization with other working devices once every time the first working device performs one round of iteration.
15. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-10.
16. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-10.

Description

Method and device for distributing graph data in distributed system Technical Field One or more embodiments of the present disclosure relate to the field of graph processing, and in particular, to a method and apparatus for graph data distribution in a distributed system. Background In recent years, application of graph data, such as a knowledge graph, has been widely used in various scenes, and along with expansion of application scenes, the scale of graph data has also been rapidly increased, and the limit of a single machine memory has been exceeded, which brings challenges to downstream applications such as graph calculation or graph deep learning. At present, a traditional strategy in the related field adopts a distributed scheme, large-scale map data is segmented into a plurality of pieces through map segmentation, each machine loads one piece of map data, and calculation is performed in a distributed mode. In the distributed scheme, the quality of the slicing (the connectivity of the graph inside a single slice and the repeatability of the graph data among different slices) determines the memory occupation, load balancing and efficiency in the downstream task, and the conventional slicing scheme cannot meet the requirements, so that a more efficient graph slicing method needs to be designed. Disclosure of Invention One or more embodiments of the present disclosure describe a method and apparatus for graph data allocation in a distributed system, which aims to balance memory load among multiple working devices, reduce data redundancy, and improve operation efficiency. In a first aspect, a method for distributing graph data in a distributed system, where the distributed system includes n×n working devices arranged as an n×n device matrix, the method is performed by any first working device, and includes: acquiring node data of a plurality of nodes and edge data of a plurality of connecting edges in the first graph data, wherein the edge data indicates two nodes connected with the corresponding connecting edges; determining a target number of any node by using a preset mapping function according to node data of any node, wherein the mapping function is used for mapping the node data into shaping data; For any target connection edge, mapping two target numbers of two target nodes in edge data of the arbitrary target connection edge into a row index and a column index in an N-to-N-dimensional equipment matrix respectively by using a first mapping rule, and determining target working equipment in the equipment matrix according to the row index and the column index; and sending the related data of the target connecting edge and the two target nodes to the target working equipment. In one possible embodiment, the first map data is part of map data randomly allocated to the first working device from among the whole map data. In one possible implementation, the mapping function is a hash function. In one possible embodiment, the method further comprises: And receiving node and connection edge data sent by other working equipment, and storing the node and connection edge data as second graph data. In one possible embodiment, the method further comprises: Determining an unlabeled node from the current second graph data as a seed node, and marking the seed node; Performing a plurality of iterations until all nodes in the second graph data are marked, wherein any one iteration comprises: Using the first mapping rule, mapping the target number of the seed node into a first index number, and determining a plurality of working devices from the device matrix according to the first index number, wherein the plurality of working devices comprise working devices in a row indicated by the first index number and working devices in a column indicated by the first index number; sending a first request to the plurality of working devices, wherein the first request is used for requesting to acquire untagged neighbor nodes and corresponding connecting edges of the seed node; storing the nodes received from the plurality of working devices into the second graph data and marking the second graph data; at least one node is determined from the received nodes as a new seed node. In one possible implementation, the node data includes an initial number and a type attribute of a node, and determining the target number of the node according to the node data of any node by using a preset mapping function includes: performing character string splicing on the initial number and the type attribute to obtain a first character string; and inputting the first character string into the mapping function to obtain a target number. In one possible implementation manner, the target numbers of the two target nodes are a first target number and a second target number, and the first target number is smaller than the second target number, and the two target numbers of the two target nodes in the edge data are mapped into a row index and a column index