Search

CN-122027622-A - Online computing device and system based on looped network broadcast acceleration

CN122027622ACN 122027622 ACN122027622 ACN 122027622ACN-122027622-A

Abstract

The application relates to the technical field of network computing, and discloses an on-network computing device and system based on ring network broadcast acceleration, wherein the on-network computing device provided by the application offloads computing tasks born by each expansion processor to distributed deployed on-network computing modules for parallel execution, simultaneously utilizes the on-network computing modules to realize high-efficiency broadcast of computing results of each on-network computing module through a data transmission loop, reduces the situation of data congestion, the shared buffer pool is used for storing the fragmented data of different on-network computing modules aiming at the same on-network computing task, so that transmission jitter among the fragmented data transmitted by different expansion processors is absorbed, after the integrity of the data of each on-network computing task is ensured, the corresponding on-network computing unit is triggered to perform computation, the input jitter of an on-network computing device in the related art is effectively overcome, the complete compatibility with the traditional Ethernet switching function is ensured, and the aim of meeting the deployment requirement of an Ethernet architecture with wider application range is fulfilled.

Inventors

  • GONG XIAOHUA
  • Tian Yipu
  • MA LE
  • HUANG XIAOMIN

Assignees

  • 无锡众星微系统技术有限公司

Dates

Publication Date
20260512
Application Date
20251229

Claims (10)

  1. 1. An on-network computing device based on looped network broadcast acceleration, the device comprising: Each on-line computing module is connected to the flow management and exchange module and respectively connected with a corresponding expansion processor, and each on-line computing module is connected with other on-line computing modules through a data transmission loop; The online computing module is used for executing corresponding computing tasks or feeding back the completion state of the computing tasks to the corresponding expansion processor based on the computing requests distributed by the flow management and exchange module; The flow management and exchange module is used for distributing the calculation request corresponding to the calculation service to the corresponding on-network calculation module based on the type of the received calculation service, or storing the response data transmitted by each on-network calculation module by utilizing the shared resource pool, establishing an index and triggering the corresponding on-network calculation module to execute the corresponding calculation task, or coordinating the dispatching and synchronization of the distributed calculation resources in the data transmission loop.
  2. 2. The apparatus of claim 1, wherein the traffic management and switching module comprises: The service analysis and distribution unit is used for analyzing the received calculation service and distributing the calculation request corresponding to the analyzed calculation service to the corresponding on-network calculation module; the shared buffer management unit is used for storing the response data of each expansion processor into the shared buffer pool, respectively establishing indexes, and simultaneously monitoring the integrity of each response data in the shared buffer pool; The computing trigger and synchronization unit is used for sending a computing trigger signal to the corresponding online computing module after the response data of any expansion processor are collected, and synchronizing the execution time sequence of the corresponding computing task; And the resource scheduling and state coordination unit is used for distributing the buffer memory and the bandwidth of the data transmission loop based on the received task completion state fed back by each online computing module.
  3. 3. The apparatus of claim 2, wherein the traffic management and switching module further comprises: and the multicast engine is used for copying the calculation request corresponding to the parsed calculation service based on the calculation service parsed by the service parsing and distributing unit.
  4. 4. The apparatus of claim 1, wherein the apparatus further comprises: And each interface module is respectively connected between one on-network computing module and the corresponding expansion processor in a communication way and is used for transmitting computing service or response data between the expansion processor and the corresponding on-network computing module.
  5. 5. The apparatus of claim 4, wherein the interface module comprises: the entrance processing unit is used for receiving the calculation service sent by the corresponding expansion processor and transmitting the calculation service to the corresponding on-network calculation module; and the outlet processing unit is used for receiving the response data of the corresponding online computing modules and transmitting the response data to the corresponding expansion processor.
  6. 6. The apparatus of claim 1, wherein the online computing module comprises: the system comprises a message distribution unit, a control unit, an arithmetic logic operation unit and a looped network interaction unit, wherein the input end of the message distribution unit is in communication connection with the flow management and exchange module, the output end of the message distribution unit is in communication connection with the control unit and the looped network interaction unit, and the control unit is in communication connection with the arithmetic logic operation unit and the looped network interaction unit; The message shunting unit is used for receiving the calculation request transmitted by the flow management and exchange module and sending the calculation request to the control unit; The control unit is used for analyzing the calculation request transmitted by the message distribution unit and generating a corresponding control signal based on the analyzed calculation request; The arithmetic logic operation unit is used for responding to the control signal of the control unit and executing corresponding calculation operation; the ring network interaction unit is used for monitoring the data flow on the corresponding data transmission loop, sending the data flow to the control unit, or responding to the control signal of the control unit, and inputting the calculation result to the data transmission loop.
  7. 7. The apparatus of claim 6, wherein when the computation request passed by the output management and switching module includes a base reduction task, the online computation module comprises: The message shunting unit is used for receiving a first request transmitted by the flow management and exchange module and sending the first request to the control unit, wherein the first request comprises a control descriptor corresponding to a basic reduction task and data to be reduced; the control unit is used for analyzing the control descriptor of the first request, caching the analyzed control descriptor, triggering the corresponding expansion processor to send the fragment data, or generating a control signal for controlling the arithmetic logic operation unit to execute reduction operation after receiving the fragment data forwarded by the flow management and exchange module, or counting response data of the corresponding expansion processor, and feeding back the task completion state to the flow management and exchange module; the arithmetic logic operation unit is used for receiving the control signal of the control unit, executing reduction operation on all part of slice data and outputting a reduction result.
  8. 8. The apparatus of claim 6, wherein when the computation request delivered by the output management and switching module includes a pure broadcast task, the online computation module comprises: the message distribution unit is used for receiving a second request transmitted by the flow management and exchange module and sending the second request to the control unit, wherein the second request comprises a control descriptor corresponding to a pure broadcast task and data to be broadcast; The control unit is used for analyzing the control descriptor of the second request, applying for the buffer memory and the bandwidth of the data transmission loop, generating the control descriptor corresponding to the data to be broadcasted, and transmitting the data to be broadcasted and the control descriptor to the ring network interaction unit; The ring network interaction unit is used for injecting a data transmission loop corresponding to the ring network interaction unit or monitoring a loop data stream based on the received data to be broadcasted and the control descriptor transmitted by the control unit, receiving other broadcast data forwarded by the network computing module, transmitting the broadcast data to the control module, and executing local replication forwarding or last hop unloading in response to a judging result of the control module.
  9. 9. The apparatus of claim 6, wherein when the computation request delivered by the output management and switching module includes a reduction and broadcast task, the online computation module comprises: The message distribution unit is used for receiving a third request transmitted by the flow management and exchange module and sending the third request to the control unit, wherein the third request comprises a control descriptor corresponding to a reduction and broadcasting task and data to be reduced; The control unit is used for analyzing the control descriptor of the third request, caching the corresponding control descriptor, transmitting the corresponding data reading request to the corresponding expansion processor, or receiving all the fragment data forwarded by the flow management and exchange module, generating a control signal for controlling the arithmetic logic operation unit to execute reduction operation, or generating the control descriptor corresponding to the reduction result based on the reduction result of the arithmetic logic operation unit, transmitting the reduction and control descriptor to the ring network interaction unit, or counting response data of the corresponding expansion processor, and feeding back the task completion state to the flow management and exchange module; The arithmetic logic operation unit is used for receiving the control signal of the control unit, executing reduction operation on all part of slice data and outputting a reduction result; The ring network interaction unit is used for injecting a data transmission loop corresponding to the ring network interaction unit or monitoring a loop data stream based on the received data to be broadcasted and the control descriptor transmitted by the control unit, receiving other broadcast data forwarded by the network computing module, transmitting the broadcast data to the control module, and executing local replication forwarding or last hop unloading in response to a judging result of the control module.
  10. 10. An on-network computing system based on looped network broadcast acceleration, the system comprising: The online computing device of any of claims 1-9.

Description

Online computing device and system based on looped network broadcast acceleration Technical Field The application relates to the technical field of network computing, in particular to a network computing device and system based on ring network broadcast acceleration. Background With the development of artificial intelligence, large-scale distributed training and high-performance computing, communication overhead among extended processor clusters (XPU) has become one of main bottlenecks of system performance, extended processors comprise CPU, GPU, DPU and other processing units participating in high-performance computing or AI training, and communication among extended processor clusters depends on each extended processor to actively initiate and complete all collective operations, so that a large amount of intermediate data is repeatedly transmitted in a network, end-to-end delay is increased, and precious computing resources of the extended processors are occupied. Thus, on-network computing (In-Network Computing) is evolving by offloading part of the computing tasks from the various expansion processors into the data path of the network switching device. However, the online computing device disclosed in the related art needs to be implemented by relying on a proprietary network architecture, and cannot meet the deployment requirement of the ethernet architecture with a wider application range. Disclosure of Invention The application provides an on-network computing device and system based on ring network broadcast acceleration, which are used for solving the problem that the on-network computing device disclosed in the related technology cannot meet the deployment requirement of an Ethernet architecture with wider application range. In a first aspect, the present application provides an online computing device based on looped network broadcast acceleration, the device comprising: Each on-line computing module is connected to the flow management and exchange module and respectively connected with a corresponding expansion processor, and each on-line computing module is connected with other on-line computing modules through a data transmission loop; The online computing module is used for executing corresponding computing tasks or feeding back the completion state of the computing tasks to the corresponding expansion processor based on the computing requests distributed by the flow management and exchange module; The flow management and exchange module is used for distributing the calculation request corresponding to the calculation service to the corresponding on-network calculation module based on the type of the received calculation service, or storing the response data transmitted by each on-network calculation module by utilizing the shared resource pool, establishing an index and triggering the corresponding on-network calculation module to execute the corresponding calculation task, or coordinating the dispatching and synchronization of the distributed calculation resources in the data transmission loop. According to the embodiment, the computing tasks born by each expansion processor are unloaded to the distributed network computing modules for parallel execution, meanwhile, the network computing modules are utilized to realize high-efficiency broadcasting of the computing results of the network computing modules through the data transmission loop, the situation of data congestion is reduced, the shared buffer pool is used for storing the fragmented data of different network computing modules aiming at the same network computing task, so that the transmission jitter among the fragmented data transmitted by different expansion processors is absorbed, after the data integrity of each network computing task is ensured, the corresponding network computing unit is triggered for computing, the input jitter of the network computing device in the related art is effectively overcome, the complete compatibility with the traditional Ethernet switching function is ensured, and the aim of meeting the deployment requirement of the Ethernet architecture with wider application range is fulfilled. In an alternative embodiment, the traffic management and switching module includes: The service analysis and distribution unit is used for analyzing the received calculation service and distributing the calculation request corresponding to the analyzed calculation service to the corresponding on-network calculation module; the shared buffer management unit is used for storing the response data of each expansion processor into the shared buffer pool, respectively establishing indexes, and simultaneously monitoring the integrity of each response data in the shared buffer pool; The computing trigger and synchronization unit is used for sending a computing trigger signal to the corresponding online computing module after the response data of any expansion processor are collected, and synchronizing the execution time sequence o