CN-116090527-B - Method and product for evaluating performance of a distributed training system in a single node environment

CN116090527BCN 116090527 BCN116090527 BCN 116090527BCN-116090527-B

Abstract

The present disclosure describes a method for evaluating performance of a distributed training system in a single node environment, where the method may be implemented in a computing device that may be located in a combined processing device that may also include a universal interconnect interface and other processing devices. The compiling device interacts with other processing devices to jointly complete the calculation operation designated by the user. The combined processing means may further comprise storage means connected to the computing means and the other processing means, respectively, for storing data of the computing means and the other processing means.

Inventors

Request for anonymity
Request for anonymity
Request for anonymity
Request for anonymity

Assignees

中科寒武纪科技股份有限公司

Dates

Publication Date: 20260505
Application Date: 20211029

Claims (14)

1. A method for evaluating performance of a distributed training system in a single node environment, wherein the distributed training system comprises a plurality of nodes, the method comprising: Determining a type of operation involved in the distributed training system; Determining a time overhead for each operation based on the type of operation, and Determining the performance of the distributed training system according to the time overhead of each operation; The types of operations include computing operations and multi-node operations, and wherein determining the time overhead for each operation based on the type of operation includes: Responsive to the type of operation being a computing operation, directing that the computing operation be run in a single node to determine a computation time overhead of the computing operation; in response to the type of operation being a multi-node operation, communication time overhead and communication alignment time overhead for communicating are determined by simulating the multi-node operation in the distributed training system.
2. The method of claim 1, wherein, Dividing the data quantity N of a single iteration of the distributed training system by the number N of a plurality of nodes to obtain single calculation data quantity N/N of the single node, and calculating single calculation time cost for the single calculation data quantity N/N; And determining the calculation time cost of the calculation operation according to the single calculation time cost and the iteration times of the distributed training system.
3. The method of claim 1, wherein the communication alignment time overhead is a difference between a first communication start time of a first node invoking a communication operation and a second communication start time of a last node invoking the communication operation in the distributed training system, and Determining the communication alignment time overhead by simulating multi-node operation in a distributed training system includes taking random numbers in a gaussian distribution as communication alignment times.
4. A method according to claim 3, wherein the communication alignment time overhead is a random number within a specific time range, the specific time range being 0-2ms.
5. The method of claim 1, wherein the communication time overhead is a function of a communication initiation time overhead, a number of nodes, an inherent time overhead of cross-node communication, a data size, and a transmission bandwidth.
6. The method of claim 5, wherein when the distributed training system is Broadcast communication over a ring topology, the function is communication time overhead = communication start time overhead + (number of nodes-1) (Inherent time overhead + data size/transmission bandwidth).
7. The method of claim 5, wherein when the distributed training system is AllReduce communicating over a ring topology, the function is communication time overhead = communication start time overhead +2 (Number of nodes-1) (Inherent time overhead + data size/(number of nodes) Transmission bandwidth)).
8. The method of claim 5, wherein when the distributed training system is AllReduce communicating via a binary tree topology, the function is communication time overhead = communication initiation time overhead +2 (Log 2 (number of nodes)) (Inherent time overhead + data size/transmission bandwidth).
9. The method of claim 1, wherein the communication time overhead is a function of a communication initiation time overhead, a number of data slices, a number of nodes, an inherent time overhead of cross-node communication, a data size, and a transmission bandwidth.
10. The method of claim 1, wherein the time overhead of multi-node operation in the distributed training system is emulated by a sleep program.
11. The method of claim 10, wherein multi-node operation in the distributed training system is simulated by the sleep program occupying respective resources of a single node.
12. The method of any of claims 1-11, wherein the computing operations and multi-node operations are provided in an operation queue in a single node, such that by running the operation queue, performance of the distributed training system is simulated in terms of the time overhead of each operation.
13. An electronic device, comprising: One or more processors, and A memory having stored therein computer executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-12.
14. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method of any of claims 1-12.

Description

Method and product for evaluating performance of a distributed training system in a single node environment Technical Field The present disclosure relates to the field of testing, and more particularly, to performance testing of neural networks. Background Scalability is an important performance indicator for multi-node distributed training of deep neural network models. To calculate the scalability of the distributed training of the deep neural network, it is necessary to know the time from training to error convergence (hereinafter referred to as the first training time T1) of the network in a single-node hardware environment, and the second training time T2 of the network for training the same number of iterations in a multi-node hardware environment using the same data and parameters. By definition, scalability = T1/(T2 x M), where M is the number of nodes in the neural network model. The nodes described herein may be accelerators, processors, accelerator cards, etc. used in neural networks. Ideally, if the first training time of a single node is T1, then the second training time of M nodes t2=t1/M, so the scalability of the system is ideally 1. In practical cases, however, the second training time of M nodes cannot be perfect t2=t1/M, but is higher than t2=t1/M, because there is a system start-up time, a communication time, and the like, and therefore the closer the scalability of the system to1, the better the scalability of the system. There are various methods of testing scalability. One is to actually measure the training time of a neural network at a specified depth by running a program in a multi-node hardware environment. And when the network performs single-node training, the performance analysis tool software is used for grabbing a time line of the network training, and then the communication overhead is inserted into the time line by combining the actually measured performance of the communication library software in the multi-node hardware environment, so that the training time of the network in the multi-machine multi-card hardware environment is finally estimated. All of the above methods must actually run the program in a multi-node hardware environment. This means that when the hardware resources are insufficient (e.g., do not have the required multi-node hardware environment), the extensibility of the distributed training of the deep neural network will not be obtained. In addition, for the second method, relying on manual searching for the location of the communication activity on the timeline may lead to prediction accuracy problems and inefficiency. Disclosure of Invention To at least partially address the technical problems noted in the background, aspects of the present disclosure provide a method and related products for evaluating performance of a distributed training system in a single node environment. In one aspect, the present disclosure provides a method for evaluating performance of a distributed training system in a single node environment, wherein the distributed training system includes a plurality of nodes, the method comprising determining a type of operation involved in the distributed training system, determining a time overhead for each operation based on the type of operation, and determining the performance of the distributed training system based on the time overhead for each operation. In another aspect, an electronic device is provided that includes one or more processors and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above. In yet another aspect, a computer-readable storage medium is provided, comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above. One benefit of the present disclosure is the ability to test the scalability of multiple node distributed training of deep neural network systems with a single node (e.g., processor). Drawings The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which: FIGS. 1a and 1b illustrate schematic diagrams of a multi-core processor according to one embodiment of the present disclosure; FIG. 1c illustrates a method flow diagram for evaluating performance of a distributed training system in a single node environment, according to one embodiment of the present disclosure; FIG. 2 illustrates a schematic diagram of a communication alignment operation; FIG. 3 illustrates a flow chart for determining tim