CN-122019466-A - Computing chip, data processing method, device, equipment and storage medium

CN122019466ACN 122019466 ACN122019466 ACN 122019466ACN-122019466-A

Abstract

The embodiment of the application provides a computing chip, a data processing method, a data processing device, equipment and a storage medium, and relates to the technical field of chips. The computing chip comprises at least one computing cluster, wherein the computing cluster comprises a plurality of storage computing units, a cluster center, a cluster shared memory, a network-on-chip and a network-on-chip, wherein each storage computing unit comprises a computing module and a memory module which is physically integrated with the computing module, the cluster shared memory is in communication connection with the cluster center and is in communication connection with each computing module in the plurality of storage computing units through the cluster center, and the cluster center of each computing cluster is connected to the network-on-chip. The high-efficiency access of the local data is realized through the physical integration of the memory computation, the cross-memory access path is optimized through the cluster-level cluster shared memory, the interaction logic of the network-on-chip is simplified through the hierarchical communication architecture, and the data transmission efficiency of the computing module is improved. The network on chip access flow and the power consumption are reduced by reducing the network on chip access request and simplifying the network design, and the chip area and the hardware cost are reduced.

Inventors

Request for anonymity
Request for anonymity

Assignees

苏州亿铸智能科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260407

Claims (20)

1. A computing chip, comprising: At least one computing cluster; Wherein the computing cluster comprises: Each storage computing unit comprises a computing module and a memory module physically integrated with the computing module; A cluster center; a cluster shared memory which is in communication connection with the cluster center and is in communication connection with each computing module in the plurality of computing units through the cluster center; And a network-on-chip, wherein a cluster center of each computing cluster is connected to the network-on-chip.
2. The computing chip of claim 1, wherein the computing module is vertically integrated with the memory module by a three-dimensional stacking technique, and the memory module is configured as a private memory of the computing module.
3. The computing chip of claim 1, wherein the cluster shared memory of the computing cluster is on-chip memory having a physical address space independent of a private cache hierarchy of the computing modules and configured to be directly accessible by all of the computing modules within the computing cluster.
4. The computing chip of claim 1, wherein the cluster shared memory of the computing cluster is connected to the network-on-chip via the cluster center.
5. The computing chip of claim 1, wherein the network-on-chip includes a switch corresponding to each of the computing clusters, the cluster center being connected to the network-on-chip through the switch.
6. The computing chip of claim 1, comprising at least two wafers, each wafer comprising at least one of the computing clusters.
7. The computing chip of claim 6, wherein each of the wafers is provided with a wafer center, the wafer centers being communicatively coupled to all of the cluster centers within the corresponding wafer, respectively, the wafer centers being configured to address requests for access across clusters and to uniformly access the network-on-chip.
8. The computing chip of claim 1, wherein the cluster center is configured with an address discrimination circuit for directing the access request to the cluster shared memory or forwarding to the network on chip according to a target address in response to the access request initiated by the computing module.
9. The computing chip of claim 8, wherein the address discrimination circuit is specifically configured to direct the access request to the cluster shared memory of the computing cluster if a target address of the access request initiated by the computing module is located in the cluster shared memory, and forward the access request to the network on chip for routing if the target address is located outside the computing cluster.
10. The computing chip of claim 1, wherein the computing chip has a three-layer data access path comprising: The computing module accesses a private access path of the memory module corresponding to the computing module; The computing module accesses a shared path in the cluster of the cluster shared memory through the cluster center; The computing module accesses the network on chip via the cluster center to access a global access path of resources outside the computing cluster.
11. The computing chip of any of claims 1 to 10, wherein the memory module is a 3D-DRAM and the cluster shared memory is an SRAM.
12. The computing chip of claim 1, wherein a path for the computing module to access the memory module corresponding thereto does not pass through the cluster center and the network-on-chip.
13. The computing chip of claim 1, wherein the cluster shared memory is configured to store data shared by a plurality of the computing modules within the computing cluster, and wherein the computing modules access the cluster shared memory without going through the network on chip.
14. A data processing method applied to the computing chip according to any one of claims 1 to 13, comprising: in each time step, processing the distributed local data blocks in each computing module in parallel, and storing the obtained computing results into the corresponding memory modules; transmitting the calculation result corresponding to the calculation module to the cluster shared memory in the same calculation cluster to obtain corresponding cluster aggregation data; Carrying out data synchronization on the cluster aggregation data in the cluster shared memory on the network-on-chip through each cluster center to obtain global data; And storing the global data in each cluster shared memory so that each computing module in the computing cluster directly accesses the global data.
15. The data processing method according to claim 14, wherein the cluster shared memory includes a shared buffer for storing the cluster aggregate data and the global data.
16. A data processing method according to claim 14 or 15, wherein the method is applied to a diffusion model based image or video generation task, the local data blocks being patches for potential representations of the image or video in potential space.
17. The method according to claim 16, wherein the calculation result is a denoised potential representation patch, and the cluster aggregate data is a potential frame composed of a plurality of denoised potential representation patches.
18. The data processing method of claim 17, wherein the potential frames include at least one of global potential frames, window potential frames, and current potential frames, depending on the generation phase, wherein the global potential frames comprise a complete potential representation of the current target generation object, the window potential frames are composed of potential frames a preset number of time steps before the current time step, and the current potential frames are potential representations to be processed for the current time step.
19. The method of claim 14, wherein the performing, by each cluster center, data synchronization on the network-on-chip of the cluster aggregate data in the cluster shared memory to obtain global data includes: the cluster aggregation data in the cluster shared memory are sent to the corresponding cluster center; And performing full collection operation among the cluster centers through the network-on-chip, so that each cluster center obtains the cluster aggregation data from other computing clusters.
20. The data processing method according to claim 19, wherein the full collection operation uses a ring-based communication algorithm, and uses each cluster center participating in synchronization as a communication node to perform a cycle switching of a preset round, so that each cluster center obtains the complete global data.

Description

Computing chip, data processing method, device, equipment and storage medium Technical Field The present application relates to the field of chip technologies, and in particular, to a computing chip, a data processing method, a device, an apparatus, and a storage medium. Background In a high performance computing chip architecture, a Computing Module (CM) and a Memory Module (MM) are usually designed to be physically separated. In such a non-coupled architecture, if the data required by one computing module is not in its own memory module, but in the memory module of another computing unit, the data must be acquired through a network-on-chip (NOC) inside the chip. If the data required by the compute unit is not stored locally, any memory access across the memory module must go through the on-chip high speed interconnect network, which increases data transfer delay and causes a lot of unnecessary NOC traffic and network congestion. In addition, the network on chip NOC needs to process massive concurrent random access requests of all computing modules, which causes the routing design and flow control mechanism to become complex, and significantly increases the area, cost and power consumption overhead of the chip. Disclosure of Invention The embodiment of the application mainly aims to provide a computing chip, a data processing method, a device, equipment and a storage medium, so that the data transmission efficiency of a computing module is improved, and the access flow and the power consumption of a network on chip are reduced. To achieve the above object, a first aspect of an embodiment of the present application provides a computing chip, including: At least one computing cluster; Wherein the computing cluster comprises: Each storage computing unit comprises a computing module and a memory module physically integrated with the computing module; A cluster center; a cluster shared memory which is in communication connection with the cluster center and is in communication connection with each computing module in the plurality of computing units through the cluster center; And a network-on-chip, wherein a cluster center of each computing cluster is connected to the network-on-chip. In some embodiments, the computing module is vertically integrated with the memory module via a three-dimensional stacking technique, and the memory module is configured as a private memory of the computing module. In some embodiments, the cluster shared memory of the computing cluster is on-chip memory having a physical address space independent of a private cache hierarchy of the computing modules and configured to be directly accessible by all of the computing modules within the computing cluster. In some embodiments, the cluster shared memory of the computing cluster is connected to the network on chip via the cluster center. In some embodiments, the network-on-chip includes a switch corresponding to each of the computing clusters, the cluster center being connected to the network-on-chip through the switch. In some embodiments, the computing chip comprises at least two wafers, each wafer comprising at least one of the computing clusters. In some embodiments, each of the wafers is provided with a wafer center, the wafer centers are respectively in communication connection with all the cluster centers in the corresponding wafer, and the wafer centers are configured to address route access requests crossing clusters and uniformly access the network-on-chip. In some embodiments, the cluster center is configured with an address discrimination circuit, which is configured to respond to an access request initiated by the computing module, and direct the access request to the cluster shared memory or forward the access request to the network on chip according to a target address. In some embodiments, the address discrimination circuit is specifically configured to direct the access request to the cluster shared memory if a target address of the access request initiated by the computing module is located in the cluster shared memory of the computing cluster, and forward the access request to the network on chip for routing if the target address is located outside the computing cluster. In some embodiments, the computing chip has a three-layer data access path comprising: The computing module accesses a private access path of the memory module corresponding to the computing module; The computing module accesses a shared path in the cluster of the cluster shared memory through the cluster center; The computing module accesses the network on chip via the cluster center to access a global access path of resources outside the computing cluster. In some embodiments, the memory module is a 3D-DRAM and the cluster shared memory is SRAM. In some embodiments, paths for the computing module to access the memory module corresponding thereto do not pass through the cluster center and the network-on-chip. In some embodiments, the cluster shared mem