CN-122018824-A - Conditional storage system and method

CN122018824ACN 122018824 ACN122018824 ACN 122018824ACN-122018824-A

Abstract

The invention discloses a conditional storage system and a conditional storage method, wherein the conditional storage system comprises a wafer level computing engine, a network-on-chip and a distributed Engram embedded storage subsystem, the wafer level computing engine comprises at least one core cluster, at least one computing core and a pre-fetching scheduler are arranged on each core cluster, the distributed Engram embedded storage subsystem comprises a plurality of physical storage blocks, each physical storage block is mapped with fragments of a Engram embedded table, the computing cores are used for asynchronously and parallelly computing an embedded vector address corresponding to an input sequence time sequence position in a subsequent transducer layer when a current transducer layer is computed, and the pre-fetching scheduler is used for asynchronously and parallelly pre-fetching an embedded vector from the physical storage blocks according to the embedded vector address through the network-on-chip.

Inventors

OUYANG PENG
WANG BO
LI XIUDONG

Assignees

北京清微智能科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260415

Claims (12)

1. The conditional storage system is characterized by comprising a wafer level computing engine, a network on chip and a distributed Engram embedded storage subsystem; the wafer level computing engine comprises at least one core cluster, and each core cluster is provided with at least one computing core and a prefetch scheduler; the distributed Engram embedded storage subsystem comprises a plurality of physical storage blocks, each physical storage block is mapped with a slice of a Engram embedded table, and the Engram embedded table stores embedded vectors of a transducer layer of a transducer model; When calculating the current transducer layer based on the input sequence and the embedded vector, asynchronously and parallelly calculating the embedded vector address corresponding to the sequence position of the input sequence in the subsequent transducer layer; The prefetch scheduler is configured to prefetch the embedded vector from the physical memory block to the corresponding compute core asynchronously and in parallel according to the embedded vector address via the network on chip.
2. The system of claim 1, wherein each computing core is provided with a computing unit, a deterministic hash addressing unit, and a local cache; the computing unit is used for executing the computation of the transducer model based on the input sequence of the transducer model and the corresponding embedded vector in the local cache; when the computing unit computes the current transducer layer, asynchronously and parallelly computing an embedded vector address corresponding to a subsequent transducer layer of the current transducer layer at the time sequence position of the input sequence; The prefetch scheduler is configured to prefetch the embedded vector from the physical memory block to a local cache of the corresponding compute core in parallel and asynchronously according to an embedded vector address via the network on chip.
3. The system according to claim 2, wherein the deterministic hash addressing unit is configured to: And executing the multi-head hash function in parallel according to the compressed suffix of the time sequence position of the input sequence to obtain an embedded vector address corresponding to the time sequence position of the input sequence, wherein the embedded vector address comprises a physical storage block coordinate and an intra-block offset address.
4. The system of claim 3, wherein the prefetch scheduler has stored therein an address mapping table for recording a mapping relationship between a plurality of physical memory blocks and a plurality of fragments of Engram embedded tables; The prefetching scheduler is used for inquiring the address mapping table according to the physical storage block coordinates in the embedded vector address, determining a physical storage block corresponding to the physical storage block coordinates and a mapped partition, positioning a single embedded vector corresponding to the partition in the physical storage block according to the intra-block offset address in the embedded vector address, and prefetching the positioned embedded vector to a local cache of a corresponding computing core through an on-chip network.
5. The system of claim 4, wherein the mapping is determined by a load balancing algorithm based on Engram the size of the embedded table, and the number, capacity, and distribution of the physical memory blocks.
6. The system of claim 2, wherein the core cluster is further provided with a shared cache; The local cache stores an embedded vector with access frequency larger than a first access frequency threshold value; The shared cache stores an embedded vector with the access frequency not greater than a first access frequency threshold; The computing unit is used for: If the embedded vector corresponding to each transducer layer is not hit in the local cache, the shared cache of the core cluster is queried, and the embedded vector corresponding to each transducer layer is read.
7. The system of claim 6, wherein the computing unit is configured to: If the embedded vector corresponding to each transducer layer is not hit in the shared cache, the embedded vector is read from the physical storage block according to the embedded vector address.
8. The system of claim 1, wherein the map of the first number of shards of the Engram embedded table is in physical memory of the distributed Engram embedded storage subsystem and the second number of shards is mapped to an external storage pool; the prefetch scheduler is configured to: If the embedded vector corresponding to each transducer layer is not hit in the physical storage block, the embedded vector is prefetched from an external storage pool according to the embedded vector address.
9. A conditional access method, comprising: When a current transducer layer of a transducer model is calculated based on an input sequence and an embedded vector, the computing core asynchronously and parallelly calculates an embedded vector address corresponding to a sequence position of the input sequence in a subsequent transducer layer, wherein the embedded vector of the transducer layer is stored in a Engram embedded table, the Engram embedded table is divided into a plurality of fragments when the transducer model is loaded, and each fragment is mapped with a physical storage block of a distributed Engram embedded storage subsystem; The prefetch scheduler asynchronously prefetches embedded vectors from the physical memory blocks to the corresponding compute cores in parallel according to the embedded vector addresses over the network on chip.
10. A wafer level chip comprising the conditional access system of any one of claims 1 to 8.
11. A board comprising the chip of claim 10.
12. An electronic device comprising the board card of claim 11.

Description

Conditional storage system and method Technical Field The invention relates to the technical field of wafer-level chips, in particular to a conditional storage system and a conditional storage method. Background As the large-scale language model has exploded, conditional computing paradigms, represented by hybrid experts, have become the mainstay of expanding model capacity. However, current Transformer architectures lack native knowledge lookup primitives, forcing models to simulate retrieval of static knowledge through dynamic computation, which results in wasted computing resources. For example, identifying a common multi-word entity requires progressive feature combining at multiple levels of attention and feed forward network, essentially using expensive runtime calculations to reconstruct a static look-up table. The Engram module of the transducer model uses the conditional memory as a novel sparsification axis complementary to the conditional computation. The method realizes O (1) constant time search of the static knowledge mode through hashed N-gram embedding, thereby freeing a transducer model from a static mode reconstruction task of a bottom layer and enabling the transducer model to be more focused on high-level reasoning. However, the efficient integration of the Engram embedded tables of billions or even billions of parameters into hardware systems presents a significant challenge in that the high bandwidth memory capacity of conventional GPUs is limited, and if parameters are stored in external host memory, frequent PCIe data transfers can introduce significant delays and bandwidth bottlenecks, limiting system throughput. Wafer level chips (e.g., cerebras WSE) provide unprecedented on-chip storage and computing resources, but their architecture is quite different from that of conventional GPUs. How to combine Engram certainty, sparse search characteristics and wafer level chip hardware characteristics deeply to realize efficient and low-delay storage and access of a very large-scale transducer parameter becomes a problem to be solved urgently. This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section. Disclosure of Invention The embodiment of the invention provides a conditional storage system for realizing high-efficiency and low-delay storage and access of a very large-scale transducer parameter, which comprises a wafer level computing engine, a network-on-chip and a distributed Engram embedded storage subsystem; the wafer level computing engine comprises at least one core cluster, and each core cluster is provided with at least one computing core and a prefetch scheduler; the distributed Engram embedded storage subsystem comprises a plurality of physical storage blocks, each physical storage block is mapped with a slice of a Engram embedded table, and the Engram embedded table stores embedded vectors of a transducer layer of a transducer model; When calculating the current transducer layer based on the input sequence and the embedded vector, asynchronously and parallelly calculating the embedded vector address corresponding to the sequence position of the input sequence in the subsequent transducer layer; The prefetch scheduler is configured to prefetch the embedded vector from the physical memory block to the corresponding compute core asynchronously and in parallel according to the embedded vector address via the network on chip. In one embodiment, each computing core is provided with a computing unit, a deterministic hash addressing unit and a local cache; the computing unit is used for executing the computation of the transducer model based on the input sequence of the transducer model and the corresponding embedded vector in the local cache; when the computing unit computes the current transducer layer, asynchronously and parallelly computing an embedded vector address corresponding to a subsequent transducer layer of the current transducer layer at the time sequence position of the input sequence; The prefetch scheduler is configured to prefetch the embedded vector from the physical memory block to a local cache of the corresponding compute core in parallel and asynchronously according to an embedded vector address via the network on chip. In an embodiment, the deterministic hash addressing unit is configured to: And executing the multi-head hash function in parallel according to the compressed suffix of the time sequence position of the input sequence to obtain an embedded vector address corresponding to the time sequence position of the input sequence, wherein the embedded vector address comprises a physical storage block coordinate and an intra-block offset address. In an embodiment, the prefetch scheduler stores an address mapping table, where the address mapping table is used to record mapping relationships between a plurality