US-12619563-B2 - Dynamic random-access memory (DRAM) configured for block transfers and method thereof

US12619563B2US 12619563 B2US12619563 B2US 12619563B2US-12619563-B2

Abstract

A method and system for building a block data transfer (BT) DRAM provides a solution to fix the performance gap between memory and processor. The data conversion time per word between the analog circuits and the digital circuits inside the BT DRAM is smaller than the processor clock cycle time, that enables the average data transfer speed of a BT DRAM to match to the operation speed of a processor. When continuously transferring a plurality of data blocks, a BT DRAM can achieve a close-to-zero-latency performance and is completely self-refreshing.

Inventors

Weidong Zhang

Assignees

SUNRISE MEMORY CORPORATION

Dates

Publication Date: 20260505
Application Date: 20230804

Claims (9)

1 . A block data transfer memory system having a system interface receiving access commands from an external processor and operated by a clock signal being a clock signal of the external processor, comprising: two cache arrays, comprising a first cache array and a second cache array, each configured to hold one or more data blocks, each data block having a block size comprising a predetermined number of data words; an input/output circuit configured for transferring a data block between a designated one of the cache arrays and the external processor through the system interface in a system transfer operation, wherein in the system transfer operation, the input/output circuit transfers the data block to the system interface in units of data words at a rate of one data word per cycle of the clock signal; a memory array configured for storing a plurality of data blocks, the memory array further configured for transferring one or more data blocks in units of the block size between the memory array and either one of the cache arrays in a memory transfer operation, each data block being transferred within an access time of the memory array; and an access controller configured for receiving access commands from the external processor and controlling both system transfer operations and memory transfer operations, wherein the access controller is configured to: (i) designate the first cache array to carry out a first plurality of system transfer operations to transfer data words of one data block and, using the input/output circuit, transfer the data words of the one data block from the first cache array to the external processor through the system interface during the first plurality of system transfer operations; and (ii) simultaneously with the first plurality of system transfer operations, designate the second cache array to carry out a memory transfer operation and transfer a data block between the memory array and the second cache array within the access time of the memory array, wherein the first plurality of system transfer operations has a duration greater than the access time of the memory array.
2 . The memory system of claim 1 , wherein the predetermined number is a parameter that is configured by the external processor using the access commands over the system interface.
3 . The memory system of claim 1 , wherein the memory array a plurality of banks, each bank comprising a plurality of subarrays of memory cells, with subarrays in the same bank being configured to participate in a memory transfer operation simultaneously.
4 . The memory system of claim 1 , wherein each cache array is configured as one or more 2-dimensional arrays of storage cells, each 2-dimensional array is organized into rows and columns, wherein the number of rows in each cache array equals the predetermined number and the number of columns in each cache array equals the bus width of a word.
5 . The memory system of claim 4 , wherein each 2-dimensional array forms a data section configured to provide a data block from the one or more data blocks in one of the system transfer operations independently of other data sections.
6 . The memory system of claim 1 , wherein the access controller further a refresh control circuit for carrying out refresh operations in the memory array without participation by an external agent over the system interface.
7 . The memory system of claim 1 , wherein the access controller configures the two cache arrays into a pipeline for carrying out successive system transfer operations involving multiple data blocks.
8 . The memory system of claim 1 , wherein the memory system includes more than two cache arrays, such that more data blocks at independent addresses can be transferred and stored in the cache arrays.
9 . The memory system of claim 1 , wherein the access controller implements in each of the cache arrays a write-back policy in which a data block transferred into the cache array by system transfer operations is transferred into the memory array by a memory transfer operation initiated by the access controller.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This patent application claims priority to U.S. provisional patent application, Ser. No. 63/375,004, entitled “DYNAMIC RANDOM-ACCESS MEMORY (DRAM) CONFIGURED FOR BLOCK TRANSFERS AND METHOD THEREOF,” filed Sep. 8, 2022, which is incorporated herein by reference for all purposes. BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to memory systems, including those built out of integrated circuits. In particular, the present invention relates to a dynamic random-access memory (DRAM) system that is configured for block data transfers. 2. Discussion of the Related Art In the article “Hitting the Memory Wall: Implications of the Obvious,” by W. Wulf et al., published in ACM SIGARCH Computer Architecture, News, Volume 23 Issue 1, March 1995, pp. 20-24 (at https://doi.org/10.1145/216585.216588), the authors reviewed a “processor-memory performance gap”: “[T]he rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM memory speed, each is improving exponentially, but the exponent for microprocessors is substantially larger than that for DRAMs. The difference between diverging exponentials also grows exponentially; so, although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one . . . . Even if we assume a cache hit rate of 99.8% and use the more conservative cache miss cost of 4 cycles as our starting point, performance hits the 5-cycles-per-access wall in 11-12 years. At a hit rate of 99% we hit the same wall within the decade, and at 90%, within 5 years. Note that changing the starting point the currant miss/hit cost ratio m and the cache miss rates don't change the trends: if the microprocessor/memory performance gap continues to grow at a similar rate, in 10-15 years each memory access will cost, on average, tens or even hundreds of processor cycles. Under each scenario, system speed is dominated by memory performance.” The authors' prediction has become a reality. For example, in a state-of-the-art system, a 4.0 GHz processor executes one cycle in 0.25 ns, but when it accesses a data from the main memory, it needs to spend over 40 ns or 160 cycles to complete. Thus, the processor-memory performance gap is huge. Over the years, even though system designs continuously made progress, which included (i) three or more levels of caches—colloquially referred to as Level-1, Level-2, Level-3, . . . , and LLC (“Last Level Cache”)—with increasing capacities to improve the hit rate, and (ii) improved industry standard DRAM interfaces (e.g., DDR2, DDR3, DDR4, DDR5 SDRAM, and HBM interfaces) with higher data transfer rates and greater bandwidths, the processor-memory performance gap keeps increasing. With AI (Artificial Intelligence) and HPC (High Performance Computing) applications, today's microprocessors (e.g., multi-core CPUs and GPUs) require even greater amount of data transferred to and from the main memory. This processor-memory performance gap, also referred to as the “memory wall,” has become the main obstacle to performance improvement in today's computer systems. Thus, there is a long-awaited need for a memory technology whose performance can scale with the processor's performance. SUMMARY According to one embodiment of the present invention, a block data transfer (BT) memory system having a system interface and operated by a clock signal, includes: (a) two cache arrays each configured to hold one or more data blocks each of a predetermined number (“block size”) of data words; (b) an input/output circuit configured for transferring a data block between a designated one of the cache arrays and the system interface in a system transfer operation, wherein the input/output circuit transfers one or more data words of the data block within each cycle of the clock signal; (c) a memory array configured for storing multiple data blocks, such that one or more data blocks are transferable between the memory array and either one of the cache arrays in a memory transfer operation within an access time of the memory array; and (d) an access controller, configured for controlling both system transfer operations and memory transfer operations, wherein the access controller being configured to designate which one of the cache arrays to be the designated cache array and to cause a plurality of system transfer operations—equal in number or greater in number than the integer multiple—simultaneously with a memory transfer operation between the memory array and the cache array other than the designated cache array. Each set of system transfer operations concurrently carried out with a memory transfer operation may have a duration greater than the memory array access time. In one embodiment, the duration is greater than the memory access time by less than one clock cycle of the clock signal. According to one embodiment, the block size is a parameter that may be configure