CN-121979364-A - Processor and electronic equipment

CN121979364ACN 121979364 ACN121979364 ACN 121979364ACN-121979364-A

Abstract

The present disclosure provides a processor and an electronic device, where the processor includes a huge bandwidth memory, a memory controller, a replication engine and an on-chip memory, where the huge bandwidth memory is directly connected to the replication engine through the memory controller, and the replication engine is connected to the on-chip memory, where a transmission bandwidth of the huge bandwidth memory is matched with a transmission bandwidth of the replication engine, so that the processor and the electronic device of the present disclosure can shorten a data transmission path and improve a bandwidth utilization rate by optimizing a processor architecture.

Inventors

Request for anonymity
Request for anonymity

Assignees

苏州亿铸智能科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251230

Claims (11)

1. The processor is characterized by comprising a huge bandwidth memory, a memory controller, a replication engine and an on-chip memory, wherein the huge bandwidth memory is directly connected to the replication engine through the memory controller, the replication engine is connected with the on-chip memory, and the transmission bandwidth of the huge bandwidth memory is matched with the transmission bandwidth of the replication engine.
2. The processor of claim 1, further comprising a local crossbar, the memory controller comprising a plurality of transfer ports, the local crossbar connecting the replication engine and each of the transfer ports, respectively.
3. The processor of claim 1, wherein the replication engine performs data interaction with the on-chip memory through a network-on-chip, and wherein the data interaction mode of the network-on-chip includes any one of the following: A cross-bar mode; A grid pattern; Broadcast mode.
4. The processor of claim 1, wherein the on-chip memory organizes data in tiles as storage units, the size of the tiles being determined based on the storage space size of the on-chip memory.
5. The processor of claim 4, wherein by the replication engine, when a non-aligned access request initiated to the large bandwidth memory is received, splitting the non-aligned access request into at least two aligned access sub-requests and executing all of the aligned access sub-requests in parallel within a same clock cycle.
6. The processor of claim 5, wherein the replication engine is configured to receive return data of all the aligned access sub-requests, and extract initial data to be written to the on-chip memory from the return data according to a request address of the non-aligned access request.
7. The processor of any one of claims 4 to 6, wherein initial data in a coordinate space of a target tile to be written to the on-chip memory is obtained by the replication engine, an uncovered area of the initial data is determined when the size of the initial data is smaller than the size of the tile, the uncovered area is data-filled to form a target tile, and the target tile is written to the on-chip memory.
8. The processor of claim 7, wherein the target tile comprises a plurality of tile rows, wherein when the size of the target data is smaller than the size of the tile, determining an uncovered area of the target data, data populating the uncovered area to form a target tile, comprises: Traversing all the tile rows, and comparing the initial coordinates of the initial data with the uncovered area coordinates of the target tile on each tile row to determine the uncovered area of the initial data and the valid data on each tile row; Generating filling data for the initial data in the uncovered areas of the tile rows, and combining the valid data and the filling data to form a target tile.
9. The processor of claim 6, wherein the tiles are written and stored in the on-chip memory in a row-major layout or a column-major layout.
10. The processor of claim 7, wherein the copy engine is configured to determine an initial data range to be written to the on-chip memory, and determine a storage data range of the jumbo bandwidth memory, and intercept data corresponding to an area of the storage data range overlapping the initial data range in the jumbo bandwidth memory as the initial data written to the on-chip memory when the initial data range is greater than the storage data range.
11. An electronic device comprising the processor of any one of claims 1 to 9.

Description

Processor and electronic equipment Technical Field The present disclosure relates to the field of processor technologies, and in particular, to a processor and an electronic device. Background With the rapid increase of the demands of the fields of artificial intelligence, high-performance computing and the like for data processing speed and energy efficiency ratio, the collaborative design of on-chip memory and external storage equipment becomes a key for improving the system performance. In the current processor architecture, the replication engine is used as a core component for data handling, but the transmission bandwidth of the replication engine is greatly different from the transmission bandwidth of the external storage device, so that the effective bandwidth of the replication engine is limited by the actual bandwidth of the external storage device, when the external storage device reads data, the replication engine needs to load the data into a secondary cache firstly, and then the replication engine reads the data through the secondary cache to form a multi-level redundancy path, and the transmission path between the replication engine and the external storage device is long, so that the power consumption is improved and the bandwidth utilization rate of the replication engine is limited. Disclosure of Invention The embodiment of the disclosure provides a processor and electronic equipment, which can shorten a data transmission path and improve bandwidth utilization rate by optimizing a processor architecture. According to an aspect of the disclosure, a processor is provided, the processor includes a huge bandwidth memory, a memory controller, a replication engine and an on-chip memory, the huge bandwidth memory is directly connected to the replication engine through the memory controller, the replication engine is connected to the on-chip memory, wherein a transmission bandwidth of the huge bandwidth memory is matched with a transmission bandwidth of the replication engine. In the processor provided in the embodiment of the present disclosure, the processor further includes a local crossbar, and the memory controller includes a plurality of transmission ports, where the local crossbar is connected to the replication engine and each of the transmission ports, respectively In the processor provided by the embodiment of the present disclosure, the replication engine performs data interaction with the on-chip memory through an on-chip network, and a data interaction mode of the on-chip network includes any one of the following: A cross-bar mode; A grid pattern; Broadcast mode. In the processor provided by the embodiment of the disclosure, the on-chip memory organizes data by using tiles as storage units, and the size of the tiles is determined based on the storage space size of the on-chip memory. In the processor provided in the embodiment of the present disclosure, when a non-aligned access request initiated to the huge bandwidth memory is received through the replication engine, the non-aligned access request is split into at least two aligned access sub-requests, and all the aligned access sub-requests are executed in parallel in the same clock cycle. In the processor provided in the embodiment of the present disclosure, the replication engine receives return data of all the aligned access sub-requests, and extracts initial data to be written into the on-chip memory from the return data according to a request address of the non-aligned access request. In the processor provided in the embodiment of the present disclosure, initial data in a coordinate space of a target tile to be written into the on-chip memory is obtained through the replication engine, when a size of the initial data is smaller than a size of the tile, an uncovered area of the initial data is determined, data filling is performed on the uncovered area to form the target tile, and the target tile is written into the on-chip memory. In the processor provided in the embodiment of the present disclosure, the target tile includes a plurality of tile rows, and when the size of the target data is smaller than the size of the tile, determining an uncovered area of the target data, and performing data filling on the uncovered area to form a target tile includes: Traversing all the tile rows, and comparing the initial coordinates of the initial data with the uncovered area coordinates of the target tile on each tile row to determine the uncovered area of the initial data and the valid data on each tile row; Generating filling data for the initial data in the uncovered areas of the tile rows, and combining the valid data and the filling data to form a target tile. In the processor provided in the embodiment of the present disclosure, the tiles are written and stored in the on-chip memory according to a row main sequence layout or a column main sequence layout. According to another aspect of the disclosure, a replication engine is provided,