Search

US-20260127121-A1 - HIGH BANDWIDTH MEMORY STRUCTURES FOR COMPUTER PROCESSOR UNITS

US20260127121A1US 20260127121 A1US20260127121 A1US 20260127121A1US-20260127121-A1

Abstract

A computer processing unit (CPU) comprising an instruction unit and an accelerator unit comprises a first configuration for concurrent instruction and accelerator operation using an instruction-bus for instruction-data and a data-bus for compute-data transfers respectively; and a second configuration for accelerator operation using both buses for compute-data transfer to boost accelerator performance. A CPU comprises a configurable sense-node in cache-memory comprising a detect RC-delay 100-times faster than a memory bit-line settling RC-delay to selectively detect, and latch a plurality of settled bit-line voltages in quick succession during a memory access, and transmit the latched data sequentially in evenly distributed time steps in one or more data-buses. One or more cache-memory address signals configurably couple a plurality of data-words to one or more buses to increment one or more memory address signals and configure DDR, QDR and higher data rate modes of data transfer. Disclosed embodiments enhance high performance computing data-bandwidth.

Inventors

  • Raminda U. Madurawe
  • Joseph T. DiBene, II

Assignees

  • Raminda U. Madurawe
  • Joseph T. DiBene, II

Dates

Publication Date
20260507
Application Date
20241104

Claims (20)

  1. 1 . A computer processing unit (CPU) for high bandwidth data processing comprising an instruction bus to transfer instruction-data and a data bus to transfer compute-data, comprised of: a configurable first mode to transfer instruction-data in the instruction bus, and transfer compute-data in the data bus; and a configurable second mode to transfer compute-data in both of said instruction bus and said data bus.
  2. 2 . The device of claim 1 , wherein the data bus comprises a plurality of wires, and the instruction bus comprises the same or a higher number of wires as the data bus, and wherein: the configurable first mode transfers compute-data at a first data rate; and the configurable second mode transfers compute-data at a data rate higher than the first data rate.
  3. 3 . The device of claim 1 , further comprising: an instruction processing unit; and an accelerator unit to process a function instruction; and a configurable means comprised of: interpreting a received instruction to determine the configurable mode; and configuring the configurable first mode to use the instruction processing unit and the accelerator unit for data processing at the first data rate; and configuring the configurable second mode to halt transferring instruction-data and use the accelerator unit for data processing at a data rate higher than the first data rate.
  4. 4 . The device of claim 3 , further comprising: an instruction cache memory configurably coupled to the instruction bus; and a data cache memory configurably coupled to the instruction bus and the data bus; and a first control unit, and a second control unit; and the configurable means comprised of: configuring the first mode to assign the first control unit to a master control role, and assign the second control unit to a slave control role controlled by the master, and couple a portion of the instruction cache memory to the instruction bus, and couple a first portion of the data cache memory to the data bus; and configuring the second mode to assign the second control unit to the master control role, and assign the first control unit to the slave control role controlled by the master, and decouple the instruction cache memory from the instruction bus, and couple the first portion of the data cache memory to the data bus and a second portion of the data cache memory to the instruction bus.
  5. 5 . The device of claim 3 , further comprising: two or more memory buffers; and a first plurality of compute-data forming a first word comprised of one or more consecutive bytes of data; and a second plurality of compute-data forming a second word comprised of consecutive bytes identical to the first word; and the configurable second mode further comprised of: a load mode to transfer the first word in the first bus, and transfer the second word in the second bus from the data cache memory to a said memory buffer; and a store mode to transfer the first word in the first bus, and transfer the second word in the second bus from a said memory buffer to the data cache memory.
  6. 6 . The device of claim 4 , wherein the configurable means further comprising a means for the master control unit to change the configurable modes between the master mode and the slave mode.
  7. 7 . The device of claim 5 , wherein: during the first mode, the CPU receives instruction-data from the instruction cache memory, and compute-data from the data cache memory to concurrently execute a plurality of instructions in the instruction processing unit, and execute a function instruction in the accelerator unit; and during the second mode, the CPU receives compute-data in the instruction-bus and the data-bus from the data cache memory to increase data bandwidth for a plurality of successive function computations in the accelerator unit.
  8. 8 . A high bandwidth cache memory structure in a computer processing unit (CPU) comprising: a first clock cycle time to access a cache memory array to transfer two or more data words, each data word comprising an identical plurality of data bits; and a first bus comprising a plurality of wires, the number of data bits in the data word identical to the number of wires in the first bus to transfer a data word; and a first address to select a word line in an array of memory elements, the word line comprising memory elements of at least a first and a second data word; and a second address to select one of the first and the second data words; and a first configuration to couple one of the first and the second data words to the first bus, and dynamically switching the second address at least two times within the first clock cycle time to transfer the two data words sequentially in the first bus.
  9. 9 . The device of claim 8 , further comprised of: a second bus comprising a plurality of wires identical to the first bus; and a second configuration to: couple the first data word to the first bus; and couple the second data word to the second bus, independent of at least one address signal status in the second address; wherein, selecting the first and second addresses couple the first and the second data words simultaneously to the two buses to transfer two data words during the first clock cycle.
  10. 10 . The device of claim 8 , wherein: the first address selects 2 N data words where N is an integer greater than one; and the second address comprises N address signals to couple one of the 2 N data words to the first bus, and dynamically incrementing the second address N times within the first clock cycle time transfers 2 N data words sequentially in the first bus.
  11. 11 . The device of claim 10 , further comprising: a bit line to output each data bit value of 2 (N+M) bit line outputs comprising the selected 2 N data words, each said data word comprising 2 M data bits, where M is an even integer; and the first bus comprising 2 M wires to transfer a data word comprised of 2 M data bits; and a said bit line includes a first RC time constant to reach a detect voltage from the time the word line is selected; and a sense device comprising an output node coupled to a said wire in the first bus, and an input node comprised of: a means to selectively connecting to a bit line in each of the 2 N data words; and a second RC time constant to reach a voltage nearly equal to the detect voltage from the time the input node is connected to a bit line, wherein the second RC time constant is at least 2 (N+1) times lower, and preferably 100 times lower, and more preferably 1000 times lower than the first RC time constant; wherein, dynamically incrementing the second address connects one of said 2 N data word bit lines one by one to the sense device input node to detect and transfer 2 N data bits in the first bus during the first clock cycle time.
  12. 12 . The device in claim 11 , further comprising: each sense device comprised of 2 N latches, each latch comprising: an input; and an output; and a latch capture time less than the second RC time constant; and a selectable means of coupling the sense device output to each of the 2 N latch inputs one at a time matched with the dynamic incrementing of the second address to capture the detected 2 N bit line values in the 2 N latches; and a driver comprising an input and an output that buffers the input signal; and a selectable means of coupling the 2 N latch outputs one at a time in 2 N time steps to the driver input during the first clock cycle time to relay the latched data at the driver output coupled to a bus wire to increase the data transfer bandwidth by 2 N times.
  13. 13 . The device in claim 12 , further comprising: the first address selecting 2 N+1 data words comprised of a first set of 2 N data words, and a second set of 2 N data words; and a first set of 2 N latches to capture the first set of 2 N detected data words; and a second set of 2 N latches to capture the second set of 2 N detected data words; and a second bus comprising 2M wires identical to the first bus wires; and the second address comprising (N+1) address signals; and a second configuration to selectively couple first word bit lines to the first set of latches, and the second word bit lines to the second set of latches during dynamically incrementing N address signals regardless of at least one address bit in the (N+1) bit second address; wherein, coupling the first set of 2 N latched outputs in the first bus wire, and the second set of 2 N latched outputs in the second bus wire, one pair at a time in 2 N time steps sequentially increase the data transfer bandwidth by 2 (N+1) times.
  14. 14 . The device of claim 12 , wherein the first bus further comprises a segmented interconnect structure comprising: a first wire segment comprised of a first end coupled to a said driver output comprising: a means of by passing the driver and coupling to a said bit line; and a wire segment length; and a second end capable of coupling to a second wire segment of equal segment length to relay a signal utilizing a bidirectional latch buffer comprised of: an input to receive the signal and an output to relay the signal; and a configurable means of selecting the input and the output to couple to the first and second wire segments to configure the signal direction; and a detector coupled to the input to detect an input signal transition comprising a trip-point; and a latch coupled to the detector to store a binary data value based on the transition detection, the latched value buffered at the output to relay the signal; wherein, the wire segment length and the trip-point facilitate achieving a wire segment delay 2 N times lower than the first cycle time to transfer high bandwidth memory data.
  15. 15 . A sense device to evaluate a data state of a memory element in a cache memory structure of a computer processor unit (CPU), the sense device comprising: an input node comprised of a first capacitance; and a configurable means to couple the input node to a plurality of bit lines in a memory array, each bit line having a second capacitance, the configurable means comprising: a first state to isolate the input node from the plurality of bit lines; and a second state to connect the input node to a said bit line to detect a voltage level of the bit line determined by a data state in a memory element coupled to the bit line by an address selected word line; and a plurality of cyclical isolate and connect operations for the input node to connect to the plurality of bit lines one by one to detect each of the bit line voltage levels sequentially.
  16. 16 . The device of claim 15 , wherein a said bit line comprises at least two voltage levels comprised of: a first voltage level about equal to a power voltage level determined by a pre charged bit line voltage to the power voltage level unchanged by a first data state in the memory element; and a second voltage level at a detect voltage level of a sense device determined by a pre charged bit line voltage at the power voltage level being discharged during a bit line settling time to reach the detect voltage level by a second data state in the memory element; wherein, the detect voltage level is preferably about 75% of the power voltage level, and more preferably about 80% of the power voltage level to reduce the said bit line settling time to increase data transfer bandwidth.
  17. 17 . The device of claim 16 , further comprising: an output node; and a plurality of latches configurably coupled to the output node, a said latch to store a said detected bit line data state, the plurality of latches storing the plurality of data states in said sequentially connected bit lines to the input node; wherein, latching a plurality of bit line data states facilitates detecting the plurality of bit line data states at a faster cycle time compared to a word line addressing cycle time and an equal data transfer cycle time to increase data transfer bandwidth.
  18. 18 . The device of claim 17 , wherein the plurality of latches comprises non overlapping data capture pulses, each data capture pulse synchronized with the cyclical connect operation to capture the voltage level of the bit line connected to the sense device input node in a said latch; wherein, the address selected word line memory element coupled plurality of bit lines settle at a first delay time, and the cyclical sense and latch data capture operates at a second cycle time at least two times, preferably 4 times, and more preferably 2 N times faster than the first delay time to increase data bandwidth, where N is an integer greater than two.
  19. 19 . The device of claim 17 , wherein: the sense node comprises a sense time determined by a first RC time constant; and the bit line comprises a settling time determined by a second RC time constant, at least 100 times larger than the first RC time constant due to the resistance and capacitance differences between the sense node and bit line; wherein, sense node connected to a bit line equilibrate at a voltage level nearly equal to the bit line voltage nearly 100 times faster than the settling time due to charge sharing; and wherein, during a single cache memory address cycle time, a plurality of bit line voltage levels can be detected, and latched, and transferred to an output using a single sense device.
  20. 20 . The device of claim 19 , wherein a said latch output is coupled to a first wire segment to transfer the plurality of sense device latched data, the first wire segment further comprised of: a first end coupled to a latch output driver, the first end further comprising a means of by passing the sense device and coupling to a said bit line; and a wire segment length; and a second end capable of coupling to a second wire segment of equal segment length to relay a signal utilizing a bidirectional latch buffer comprised of: an input to receive the signal and an output to relay the signal; and a configurable means of selecting the input and the output to couple to the first and second wire segments to configure the signal direction; and a detector coupled to the input to detect an input signal transition comprising a trip-point; and a latch coupled to the detector to store a binary data value based on the transition detection, the latched value buffered at the output to relay the signal; wherein, the wire segment length and the trip-point facilitate short wire delays to transfer the plurality of latched data to achieve high data transfer bandwidth.

Description

This application is related to Provisional Application Ser. No. 63/468,059 entitled “Macro-Processor Architectures”, filed on 22 May 2023 and Provisional Application Ser. No. 63/468,061 entitled “Content-Compute Processors and Architectures”, filed on 22 May 2023, and list as inventor Mr. Raminda U. Madurawe, the contents of which are incorporated-by-reference. This application is also related to application Ser. No. 18/656,824 entitled “Macroprocessor Architectures for Pipelined Flexible-Function Computing”, application Ser. No. 18/656,836 entitled “Content Compute Processors and Architectures”, application Ser. No. 18/656,854 entitled “Interconnect Structures for Configurable CPU Pipelines”, all filed on 7 May 2024, and application Ser. No. 18/656,854 entitled “Control Units for Heterogeneous Compute Processors”, filed on 22 May 2024, all of which list as inventor Mr. Raminda U. Madurawe, the contents of which are incorporated-by-reference. BACKGROUND 1. Field of the Invention The present invention relates to a plurality of integrated circuits, and further relates to central processor units (CPU), field programable gate arrays (FPGA), and application specific integrated circuits (ASIC). CPUs include microprocessors, microcontrollers and other instruction-based processors comprising one or more processor cores. FPGAs include other types of programmable logic devices (PLDs). ASICs include domain-specific-accelerators (co-processors such as TPUs, NPUs & GPUs & DSAs) and in-memory compute units (CIM). Integrated circuits include hardware architectures (HWA) and instruction set architectures (ISA). Specifically, the invention relates to high bandwidth cache memories and segmented bus architectures for multi-core CPU systems for high performance computing (HPC). The invention includes configurable coherent cache data storage structures, data communication bus structures, and control units in HWA. A CPU comprises an instruction-bus to receive instruction-data and a data-bus to receive compute-data, wherein said instruction-bus and data-bus fetch compute-data to increase HPC bandwidth. The CPU further comprising a configurable accelerator to utilize the increased data bandwidth. A data-bus in a CPU comprising a configurable means of transferring data within a clock cycle at one of a single data rate, a double data rate, and a quadruple data rate to boost data bandwidth. Said data-bus further comprising one or more latches comprising a means of early signal transition detection to reduce signal transmission delays. 2. Prior Art A microprocessor, also known as a CPU, is a widely used first embodiment of a programmable device in the Integrated Circuits (IC) industry. The programming is done by executing ISA-instructions. It comprises a plurality of hardware structures (arranged in the Hardware Architecture HWA) to process the pre-defined instruction-set (the ISA). The matched HWA-ISA duality allows a control-unit to select a plurality of dedicated hardware structures to execute all instructions using control-signals. Each activity takes one or more clock cycles. Compiled instructions reside in memory, in the form of data-strings, and when the instruction is loaded (or read) into an instruction-register (IR), an IR decoding circuit instructs the control unit to provide hardware functions needed to execute the instruction. Hardware units manipulate compute-data associated with instructions. Instruction-data and compute-data may reside in different segments of an external hard-drive of a computer. Hereafter the term instructions refer to instruction-data and the term data refers to compute-data. A CPU utilizes a cache memory hierarchy to fetch instructions and data from the external memory using an Operating-System (OS) that also runs on a CPU dedicated for the OS known as the host-CPU. Some instructions move data (such as move, load and store), and some instructions compute data (such as AND, MULT, ADD). When instructions manipulate data, the instructions and data need to be synchronized. The cache memory hierarchy ensures accuracy of data, and when multiple copies of the same data reside in multiple memory locations, all data-fields must match, aka data-coherency. Only a store command can disturb the coherency. In this discussion, it is assumed that a CPU chip has three levels of cache memory: L3-cache (L3$), L2-cache (L2$) & L1-cache (L1$). It could have fewer or greater memory levels. Instructions and data move from External Memory to L3$ to L2$ to L1$ sequentially to feed the CPU, and work in reverse order to save computed results back in the hard-drive. Instructions only move one-way, towards L1$, while data move both-ways. Drivers ensure the directionality of instruction and data movement. Local bus structures (drivers and wires) are used to move data between storage units. The number of wires and a data clocking frequency determine the data bandwidth. To feed one or more CPUs, the bus structures must provide re