CN-122021771-A - Training optimization method, CMM-DC device pool and host device

CN122021771ACN 122021771 ACN122021771 ACN 122021771ACN-122021771-A

Abstract

The present disclosure provides a training optimization method, a CMM-DC device pool, and a host device. The training optimization method is executed by a CMM-DC device pool, wherein the CMM-DC device pool comprises a plurality of CMM-DC devices, the training optimization method comprises the steps of receiving a plurality of sub-image data partitioned by a host device, connecting the host device to the plurality of CMM-DC devices, receiving characteristic data corresponding to the plurality of sub-image data, and utilizing the plurality of CMM-DC devices to store the plurality of sub-image data and the characteristic data separately.

Inventors

TAO PENG
ZHU NINGNING
YAN HAO
MA XIANGFEI
Pei Wengui
YIN CHENGHAO
QIU YUANCHENG

Assignees

三星（中国）半导体有限公司
三星电子株式会社

Dates

Publication Date: 20260512
Application Date: 20251219

Claims (14)

1. A training optimization method of a graph neural network, wherein the training optimization method is performed by a CMM-DC device pool including a plurality of CMM-DC devices, the training optimization method comprising: Receiving a plurality of sub-graph data partitioned by a host device, the host device connected to the plurality of CMM-DC devices; Receiving feature data corresponding to the plurality of sub-image data; the plurality of sub-graph data and the feature data are stored separately using a plurality of CMM-DC devices.
2. The training optimization method of a graph neural network according to claim 1, further comprising: responding to a sampling request of a training manager in the host device, and sampling the stored corresponding sub-image data to obtain a sampling result; and responding to the aggregation request of the training manager, and performing feature aggregation on the feature data of the sampling nodes in the sampling result to obtain an aggregation result.
3. The method of claim 2, further comprising sending the sampling result and the aggregation result as batch data to a memory of the host device.
4. The training optimization method of a graph neural network of claim 1, wherein each CMM-DC device stores a portion of the plurality of sub-graph data and a portion of the feature data.
5. The training optimization method of a graph neural network of claim 1, wherein a portion of the plurality of CMM-DC devices is used to store the plurality of sub-graph data and another portion of the plurality of CMM-DC devices is used to store feature data.
6. A training optimization method for a graph neural network, the training optimization method being performed by a host device, the training optimization method comprising: partitioning, by a training manager of a host device, graph data and feature data corresponding to the graph data; Storing in a memory of the host device sampling results sampled by samplers in a plurality of CMM-DC devices in a pool of CMM-DC devices and aggregated results aggregated by aggregators in the plurality of CMM-DC devices, Wherein the plurality of CMM-DC devices are configured to separately store graph data and feature data partitioned by the training manager.
7. The training optimization method of a graph neural network of claim 6, wherein the sampling result and the aggregation result are stored as batch data in a memory of the host device, wherein the training optimization method further comprises: And loading the sampling result and the aggregation result into a memory of the GPU device through a data loader in a CPU of the host device.
8. The training optimization method of a graph neural network of claim 7, further comprising: And triggering the GPU to train in response to the batch data being loaded into the memory of the GPU equipment and the GPU being in an available state.
9. A training optimization method for a graph neural network, wherein the training optimization method is performed by a GPU device, the training optimization method comprising: In response to training instructions from a training manager in the host device, performing a modular computation of the graph neural network model using the sampling results and the aggregate results loaded from the memory of the host device, Wherein the sampling result and the aggregation result are stored as batch data in a memory of the host device, The sampling results are sampled by samplers in a plurality of CMM-DC devices in a pool of CMM-DC devices, The polymerization result is polymerized by a aggregator in the CMM-DC device, and The plurality of CMM-DC devices are configured to separately store graph data and feature data partitioned by the training manager.
10. A CMM-DC device pool, characterized in that it comprises a plurality of CMM-DC devices configured to perform the training optimization method according to any of claims 1-5.
11. A host device comprising a processor and a non-transitory computer readable storage medium storing instructions that, when executed by the processor, cause the processor to perform the training optimization method according to any one of claims 6 to 8.
12. A GPU device comprising a processor and a non-transitory computer readable storage medium storing instructions that, when executed by the processor, cause the processor to perform the training optimization method of claim 9.
13. A computer system comprising a CMM-DC device pool according to claim 10, a host device according to claim 11, and a GPU device according to claim 12.
14. A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the training optimization method of the graph neural network according to any one of claims 1-9.

Description

Training optimization method, CMM-DC device pool and host device Technical Field The present disclosure relates to the field of artificial intelligence acceleration and computer storage technologies, and more particularly, to a training optimization method, a CMM-DC device pool, and a host device. Background Graph Neural Networks (GNNs) are a class of deep neural networks that can efficiently process and analyze graph structure data, and are widely used for various graph-related tasks. As the size of the graph data in the real world increases, it becomes difficult to train large-scale GNNs in a limited memory space. The small-batch GNN (Mini-batch GNN) training based on the distributed samples becomes a solution with a higher application prospect. The small-batch GNN training mainly comprises the following steps of data partitioning, sampling, feature acquisition, data loading, aggregation and model training. Traditional small-batch GNN training splits large-scale graph or graph data into sub-graphs or sub-graph data through data partitioning, and completes sampling, feature acquisition, aggregation and model training on a Graphics Processing Unit (GPU), providing more storage and computing resources for GNN training, but it still has the disadvantages that when training large-scale or massive-scale GNNs, the limitation of GPU memory space is faced, the GPU needs to access remote graph data and feature data from graph storage to create batch processing for local training, which results in large-scale data movement, data I/O in sub-graph sampling and feature retrieval takes up a lot of training time, GPU utilization is low, the aggregation function comprises a lot of simple computation, and high computing performance of the GPU cannot be effectively utilized. The optimized small-lot GNN training samples on the CPU and optimizes performance by preloading the feature data. On the one hand, sampling in the CPU reduces the storage of some subgraphs by the GPU memory space, and on the other hand, the GPU utilization can be improved by preloading features into the GPU. The optimized small-batch GNN training still has the defect that although the GPU memory space occupied by the subgraph sampled by the CPU is reduced, the GPU memory is limited when the large-scale GNN is trained. The GPU still can cause frequent data movement when loading the sampling graph data and the characteristic data, and the simple calculation in the aggregation process wastes the high calculation performance of the GPU. The above information is provided merely as background information and is not meant to constitute prior art to the present disclosure. Disclosure of Invention It is an object of the present disclosure to provide a training optimization method that can reduce the use of host memory and GPU memory to meet the memory requirements for GNN training on large scale graphics. It is an object of the present disclosure to provide a training optimization method capable of reducing data preparation time and improving training performance. It is an object of the present disclosure to provide a training optimization method capable of improving the utilization of a Central Processing Unit (CPU). According to a first aspect of the present disclosure, there is provided a training optimization method of a graph neural network, the training optimization method being performed by a CMM-DC device pool including a plurality of CMM-DC devices, the training optimization method including receiving a plurality of sub-graph data partitioned by a host device, the host device being connected to the plurality of CMM-DC devices, receiving feature data corresponding to the plurality of sub-graph data, and separately storing the plurality of sub-graph data and the feature data using the plurality of CMM-DC devices. Optionally, the training optimization method can further comprise the steps of responding to a sampling request of a training manager in the host device to sample the stored corresponding sub-graph data to obtain a sampling result, and responding to an aggregation request of the training manager to perform feature aggregation on feature data of sampling nodes in the sampling result to obtain an aggregation result. Optionally, the training optimization method may further include sending the sampling result and the aggregation result as batch data to a memory of the host device. Optionally, each CMM-DC device may store a portion of the plurality of sub-graph data and a portion of the feature data. Alternatively, a portion of the plurality of CMM-DC devices may be used to store the plurality of sub-graph data and another portion of the plurality of CMM-DC devices may be used to store the characteristic data. According to a second aspect of the present disclosure, there is provided a training optimization method of a graph neural network, the training optimization method being performed by a host device, the training optimization method including par