Search

CN-116126777-B - Network design method of multi-core integrated system based on network calculation

CN116126777BCN 116126777 BCN116126777 BCN 116126777BCN-116126777-B

Abstract

The invention discloses a network design method of a multi-core integrated system based on network computing integration, according to the method, an operation unit is integrated in a network unit, so that a network layer in a multi-core integrated system is designed. The method mainly comprises a router micro-architecture design, task mapping and software and hardware collaborative design, wherein the router micro-architecture design is used for realizing hardware, flow control, pipeline design and arbitration design of a router, and part of calculation amount to be aggregated is transplanted from a processor to a network for processing, the task mapping is used for mapping tasks in an application system flow diagram into a multi-core integrated system, and the software and hardware collaborative design is used for designing optimal software and hardware configuration in a software and hardware combined mode with the aim of minimizing application running time. The invention can realize calculation in the data transmission process, reduce the data volume of network transmission and reduce communication delay, thereby accelerating the running speed of the multi-core integrated system.

Inventors

  • XU JIAEN
  • WANG XIAOHANG
  • LI SHUNBIN
  • DENG QINGWEN
  • Sun Tianning
  • WAN ZHIQUAN
  • WANG ZHIYU

Assignees

  • 之江实验室
  • 华南理工大学

Dates

Publication Date
20260508
Application Date
20221214

Claims (9)

  1. 1. A network design method of a multi-core integrated system based on network calculation integration is characterized by comprising the following steps: (1) The router micro-architecture design comprises the steps of carrying out hardware implementation, pipeline design, flow control and arbitration design of a router, and adding a parser, an operation unit and a FIFO buffer queue into the router; (2) Generating a corresponding system flow diagram before the operation of the neural network distributed training application, generating an allocation strategy based on the system flow diagram, and mapping tasks in the system flow diagram into each processor resource of the multi-core integrated system according to the allocation strategy; (2.1) setting a processor in the multi-core integrated system as a resource manager for calculating an optimal allocation strategy and allocation tasks; (2.2) initializing a dataflow graph according to the task partitioning of the neural network distributed training application Wherein, the method comprises the steps of, In the form of a node of the graph, In the form of a directed edge, Is the weight value, the graph node For representing tasks, for tasks with computational dependencies And Construction of To the point of Directed edge of (2) , Weights of (2) Is that To the direction of Traffic transmitted, the dataflow graph In which there are two virtual vertices And , As a starting point and connected to the first layer task, As end point and connected with the last layer task, defining resource set Sequentially traversing processor resources contained by each die in a multi-die integrated system Joining resource sets Wherein , The method comprises the steps of including position information of processor resources in a multi-core integrated system; (2.3) the dataflow graph obtained based on step (2.2) Traversing in breadth-first search order And sequentially adding the traversed nodes into a task queue The task queue Is the first task of (1) Assigned to a resource manager; (2.4) task queuing based on step (2.3) Sequentially traversing task queues For the current task The resource set obtained from step (2.2) The last task is selected Allocated processor resources Processor resources with shortest distance and in idle state as tasks Because the inter-core and intra-core of the multi-core integrated system are Mesh structures, a plurality of processor resources with equal distances and the shortest distances can exist, and the processor resources with the equal distances and the shortest distances are stored as a resource set And sequentially select resource sets Processor resources in as tasks An assigned object; (2.5) sequential traversal of task queues Repeating the step (2.4) until the task queue All tasks in (1) Task allocation strategy generation in traversal process ; (2.6) An allocation policy based on the step (2.5) Minimizing application runtime Can be expressed as Is the least running time combination of a plurality of combinations, wherein , For the configuration of the arithmetic units in the router, Can be expressed as To the point of Including execution time of tasks in a processor and communication time in a network, based on task allocation policies Establishing a linear regression model, establishing a prediction module aiming at the execution time and the communication time, obtaining the prediction model through linear fitting of historical operation data, traversing Obtaining the minimum application running time and the optimal allocation strategy; (3) And (3) software and hardware collaborative design, namely constructing an error model and a performance model based on the task mapping scheme in the step (2), and minimizing the running time of the neural network distributed training application so as to obtain the optimal software and hardware configuration under the minimum running time.
  2. 2. The network design method of the network integration system based on network computing integration is characterized in that a parser, an operation unit and a FIFO buffer queue are added to the router, the parser is a comparator and is used for judging network computing zone bits of a data packet header, the operation unit comprises an arithmetic logic unit, a buffer area and a register, the arithmetic logic unit is a plurality of parallel single-precision/double-precision floating point adders and is used for parallel calculation of exponents and mantissas, the buffer area is used for storing input data of the arithmetic logic unit, the register is used for storing a source address, a destination address and the network computing zone bits of the data packet header, and the FIFO buffer queue is used for storing the data packet outputted by the operation unit.
  3. 3. The network design method of the network integration system based on network computing integration of claim 2, wherein the router pipeline design in the step (1) is specifically that a first-stage network computing pipeline is added before routing, the data packet enters the network computing pipeline after entering the router, if the data packet needs network computing, the data packet is transmitted to the operation unit for computing, and if the data packet does not need network computing, the data packet enters the next-stage pipeline for routing.
  4. 4. The network design method of the network integration system based on network computing integration as claimed in claim 3, wherein the router flow control in the step (1) is specifically that when the data packet is transmitted to the router, the parser compares whether the network computing flag bit of the data packet header in the input port is equal to a preset network computing flag bit, if so, the whole data packet is split, the data packet header is stored in a register of the computing unit, the data part is stored in a buffer area inside the computing unit, the data part triggers judgment after entering the buffer area, when the data amount in the buffer area is greater than 1, the data in the buffer area is transmitted to the arithmetic logic unit for floating point addition operation, and when the data amount in the buffer area is 1, the data needing to be calculated next time is waited for entering the buffer area.
  5. 5. The network design method of the network integration system based on network computing integration of claim 2, wherein the router arbitration design in the step (1) is characterized in that the number of input ports of a crossbar switch in the router is increased from the original five port numbers to six port numbers, an arbitration part of the router is priority arbitration, each input port is polled according to a priority arbitration algorithm, the weight value of the port with the data packet forwarding request is determined, and after the polling is finished, the port with the largest weight value is determined as the input of the crossbar switch processing based on a comparator and is output after arbitration.
  6. 6. The network design method of the network-based multi-core integrated system according to claim 1, wherein the step (3) comprises the following sub-steps: (3.1) acquiring summarized gradient values of the neural network during distributed training on the multi-kernel integrated system based on the task allocation strategy obtained in the step (2.6), and based on the current neural network model structure Using a set of hardware configuration combinations comprising different arithmetic logic unit types and numbers Loss rate of accuracy Modeling error with neural network mass loss The relation between the accuracy loss value and the model quality loss value is linearly fitted through an error model trained offline; (3.2) based on the error model obtained in step (3.1) And a neural network after clipping Building a performance model to enable training time Minimum and satisfy the quality loss value less than the threshold value And increased area of router after adding network calculation integrated function Less than ten percent Is not limited to the original router area ; And (3.3) selecting different hardware configurations obtained in the step (3.1) and substituting the cut neural network model into the operation time model obtained in the step (3.2), and obtaining the optimal hardware configuration and the corresponding neural network model under the minimum operation time by using a branch-and-bound method.
  7. 7. The network planning method of a network-based multi-core integrated system according to claim 6, wherein the step (3.1) comprises the following sub-steps: (3.1.1) marking the gradients of the data to be summarized and updated before each round of gradient summarization, and setting the initial precision reduction rate The neural network can be ordered according to the absolute value of the gradient during training, % Gradient can be reduced in accuracy; (3.1.2) sorting the gradients marked in the step (3.1.1) from small to large by selecting small batches of gradients, and finding the sorting as Absolute value corresponding to the% gradient, and setting the absolute value as an integral threshold; (3.1.3) comparing the absolute value of the remaining gradient with the threshold value obtained in the step (3.1.2), and when the absolute value of the gradient is smaller than the threshold value, the gradient is degraded; (3.1.4) taking the difference value between the training results obtained by iterating the same times and the training results with the original precision as a quality loss value, obtaining corresponding quality loss after each iteration is finished, And gradually increasing until the quality loss is larger than the acceptable range, and taking out the current model as the optimal software model within the error acceptable range.
  8. 8. The network design method of the multi-core integrated system based on network computing integration according to claim 1, wherein the multi-core integrated system is a hybrid system with a CPU, a GPU and an LLC as core packages, wherein the LLC is used for information transmission between the CPU and the GPU, the CPU updates data to the LLC when writing data, and the GPU directly accesses the data written before the CPU when accessing the data.
  9. 9. The network computing integration-based multi-core integrated system network design method according to claim 1, wherein the neural network distributed training application is used for realizing distributed computation by using a plurality of machines when training the neural network on a large data set, and a model parallel or data parallel mode is adopted for dividing a model and training the data set.

Description

Network design method of multi-core integrated system based on network calculation Technical Field The invention relates to the field of integrated circuits, in particular to a network design method of a multi-core integrated system based on network computing integration. Background The multi-chip integrated system integrates a plurality of chips (bare chips) for realizing specific functions into a system chip with specific functions through an advanced packaging technology, and compared with a traditional single-chip integrated mode, the multi-chip integrated technology has advantages and potential in various aspects of chip performance power consumption optimization, cost and business mode. Deep Neural Networks (DNNs) are deployed to run on multi-kernel integrated systems by way of segmentation of training data sets/models. And each core particle processes different subsets in the training data in parallel to train the local copy of the model, gathers training results after the processing of each core particle is finished, aggregates the parameters, broadcasts the aggregated parameters to each node to update the local model, and then carries out the next iteration. Communication bandwidth between core grains in the multi-core integrated system is limited, network communication across the core grains is longer in time consumption compared with internal communication of the core grains, and in a parameter summarizing stage of the deep neural network, if the network transmission rate is low, network congestion is more likely to be caused, network delay is increased, and accordingly overall performance is reduced, and therefore communication performance of the network becomes a performance bottleneck in training. Existing methods are used to solve the communication bottlenecks caused by the network during the distributed training of the deep neural network, such as RingAllReduce, hierarchical All Reduce, double Binary Tree, etc., however these methods do not directly Reduce the total traffic. Meanwhile, in the prior art, the accelerator based on network computing integration is mostly integrated in the switch, the switch is responsible for data communication among a plurality of ports, and the data transmission quantity during data forwarding is reduced by integrating an arithmetic unit in the switch to complete summarized data before the data forwarding. However, the switch may be connected to multiple local area networks simultaneously, and when the switch is connected to multiple local area networks simultaneously, the granularity of network traffic is large, which results in lower data aggregation degree and poorer application performance improvement effect. Disclosure of Invention The invention aims to provide a network design method based on network computing integration for a multi-core integrated system, aiming at the defects of the prior art. The invention aims at realizing the technical scheme that the network design method of the multi-core integrated system based on network computing integration comprises the following steps: (1) The router micro-architecture design comprises the steps of carrying out hardware implementation, pipeline design, flow control and arbitration design of a router, and adding a parser, an operation unit and a FIFO buffer queue into the router; (2) Generating a corresponding system flow diagram before the operation of the neural network distributed training application, generating an allocation strategy based on the system flow diagram, and mapping tasks in the system flow diagram into each processor resource of the multi-core integrated system according to the allocation strategy; (3) And (3) software and hardware collaborative design, namely constructing an error model and a performance model based on the task mapping scheme in the step (2), and minimizing the running time of the neural network distributed training application so as to obtain the optimal software and hardware configuration under the minimum running time. The router hardware in the step (1) is realized by adding a parser, an operation unit and a FIFO buffer queue on the router, wherein the parser is a comparator and is used for judging network calculation flag bits of a packet header of a data packet, the operation unit comprises an arithmetic logic unit, a buffer area and a register, the arithmetic logic unit is a plurality of parallel single-precision/double-precision floating-point adders and is used for calculating exponents and mantissas in parallel, the buffer area is used for storing input data of the arithmetic logic unit, the register is used for storing source addresses, destination addresses and network calculation flag bits of the packet header of the data packet, and the FIFO buffer queue is used for storing the data packet output by the operation unit. The router pipeline design in the step (1) is specifically that a first-stage network computing pipeline is added before routing, the data pa