CN-121996200-A - 9T-SRAM unit, floating point memory internal computing circuit and CIM chip

CN121996200ACN 121996200 ACN121996200 ACN 121996200ACN-121996200-A

Abstract

The invention belongs to the field of integrated circuits, and particularly relates to a 9T-SRAM unit, a floating point memory internal computing circuit and a CIM chip. In the 9T-SRAM unit, P1, P2 and N1-N4 form a latch structure with inverted storage nodes Q and QB, a grid electrode of N6 is connected with Q, a source electrode of N6 is connected with VSS, a drain electrode of N6 is connected with a source electrode of N5, a grid electrode of N5 is used as an INPUT port INPUT of multiplication, a grid electrode of P3 is connected with a switch signal line KEY, and drain electrodes of N5 and P3 are connected and used as an output port OUT of multiplication operation results. The floating point in-memory computing circuit comprises a memory computing array, a finger tail precision mapper, a sensitive amplifier array and a shift adder. The finger tail precision mapper is used for converting the input index difference value into a multi-bit precision mask used for controlling the starting state of each column of the memory array, and carrying out bit-by-bit adjustment on the precision mask in a subsequent period. The invention can solve the problems of high power consumption and large area commonly existing in the existing floating point memory calculation.

Inventors

ZHAO QIANG
ZHOU YONGLIANG

Assignees

合肥恒森半导体有限公司

Dates

Publication Date: 20260508
Application Date: 20260128

Claims (10)

1. A9T-SRAM unit is characterized by being used as a basic unit of an SRAM-based CIM circuit and used for realizing maskable single-bit multiplication operation, wherein the 9T-SRAM unit is composed of 3 PMOS tubes P1-P3 and 6 NMOS tubes N1-N6, P1, P2, N1 and N2 form a latch structure with inverted storage nodes Q and QB, N3 is used as a transmission tube between Q and a bit line BL, N4 is used as a transmission tube between QB and a bit line BLB, gates of N3 and N4 are connected with a word line WL, gates of N6 are connected with Q, sources of N6 are connected with VSS, drains of N6 are connected with sources of N5, gates of N5 are used as INPUT ports INPUT of multiplication, gates of P3 are connected with a switch signal line KEY, and drains of N5 and P3 are connected as output ports OUT of multiplication operation results; When key=1, the 9T-SRAM unit is in an enabling state, the OUT port outputs multiplication results, and when key=0, the 9T-SRAM unit is in a shielding state, and the OUT port outputs results which are constant to 0.
2. The 9T-SRAM cell of claim 1, wherein the logic to perform the single bit multiplication operation is as follows: When key=1, one operand a is prestored in the inverted storage nodes Q and QB, the other operand B is INPUT to the 9T-SRAM cell through the INPUT port, and the level state of the OUT port is used to characterize the multiplication result.
3. The 9T-SRAM cell of claim 2, wherein the operation logic for performing a single bit multiplication operation is as follows: when Q is high and QB is low, A=1 is represented, and when Q is low and QB is high, A=0 is represented; when the INPUT signal of the INPUT port is at a high level, B=1 is represented; When the output signal of the OUT port is kept at a high level, the result of the product is indicated as 0, and when the output signal of the OUT port is reduced from a high level to a low level, the result of the product is indicated as 1.
4. The 9T-SRAM cell of claim 2, wherein the 6T-SRAM cell of P1, P2, N1-N4 is used as a base unit for performing a data storage function.
5. A floating point in-memory computing circuit for dynamically adjusting mantissa truncation precision based on exponent difference is characterized by comprising The memory array is formed by arranging 9T-SRAM cell arrays according to any one of claims 1-4, wherein each 9T-SRAM cell IN the same column IN the memory array shares bit lines BL, BLB and switch signal lines KEY; The finger tail precision mapper comprises an initial value generating circuit and a dynamic adjusting circuit, wherein the initial value generating circuit is used for converting an input exponent difference value into a multi-bit precision mask according to a preset cut-off precision conversion strategy, the dynamic adjusting circuit is used for sequentially turning the lowest bit with the value of 1 in the precision mask to 0 in each subsequent period, and each bit of the generated precision mask is sequentially used as a switching signal for controlling the starting state of a 9T-SRAM unit of each column in a memory array; A sense amplifier array comprising individual sense amplifiers electrically connected to the output ports OUT of the 9T-SRAM cells in the memory array, each sense amplifier for converting the multiplication result output by a corresponding 9T-SRAM cell into a corresponding digital quantity; A shift adder electrically connected to the sense amplifier array and configured to shift-add the respective digital quantities inputted thereto; In the first calculation period, the 9T-SRAM units of each column in the memory array are shielded according to the initial value of the precision mask, multiplication operation of the lowest bit of one multi-bit mantissa part and the other multi-bit mantissa part is executed, in the other calculation periods, the 9T-SRAM units of each column in the memory array are shielded according to the dynamically updated precision mask, multiplication operation of each bit of one multi-bit mantissa part and the other multi-bit mantissa part from low to high is executed, and multiplication results of all calculation periods are shifted and added according to weights to obtain the product of mantissa parts in two multi-bit floating points.
6. The floating point in-memory computational circuit of claim 5 wherein the operation strategy for performing a multiplication of mantissa portions of a two-multi-bit floating point number comprises: s1, sequentially pre-storing values on each bit of a mantissa part of a first multi-bit floating point number into 9T-SRAM units of each column in one row of a memory array according to weights; S2, inputting the exponent difference value into a finger tail precision mapper to generate a precision mask for controlling the initial cut-off precision of the storage array IN a first calculation period, and inputting the lowest-order value of the mantissa part of the second multi-bit floating point number into all 9T-SRAM units of the corresponding row of the storage array through an input signal line IN; At this time, each 9T-SRAM unit completes the multiplication task, and the shift adder performs shift addition on the products of each 9T-SRAM unit in the starting state according to weights, so as to complete the multiplication of the first floating point number and the lowest bit of the second floating point number; S3, IN each subsequent calculation period, sequentially inputting the values from the next lower position to the highest position of the mantissa part of the second multi-bit floating point number to all 9T-SRAM units of the corresponding row of the memory array through an input signal line IN; At this time, each 9T-SRAM unit completes the multiplication task, and the shift adder performs shift addition on the products of each 9T-SRAM unit in the starting state according to the weight, so as to complete the multiplication of the first floating point number and the rest bits of the second floating point number; And S4, assuming that the bit number of the second floating point number is N, after the Nth calculation period is finished, shifting and adding multiplication results of all calculation periods according to weights, and further obtaining the product of the mantissa parts of the first floating point number and the second floating point number.
7. The floating point in-memory computing circuit for dynamically adjusting mantissa truncation accuracy based on exponent difference as claimed in claim 5, wherein in a truncation accuracy conversion strategy preset by the exponent accuracy mapper, when exponent difference of two floating point numbers is larger, the higher the truncation bit number represented by an initial value of an accuracy mask is generated; The maximum length of the generated precision mask is equal to the number of columns of the memory array; The initial value of the generated precision mask is a binary mask with high M bits of 1 and the rest bits of 0, and M represents the effective bit number of the mantissa part corresponding to the exponent difference of the current operation task.
8. The floating point in-memory computing circuit for dynamically adjusting mantissa truncation accuracy based on exponent difference as claimed in claim 5, wherein the initial value generating circuit in the exponent precision mapper comprises a logic gate control circuit and a signal generating array, wherein the logic gate control circuit is used for generating 10bit precision precoding C9-C0 according to the input 4bit exponent difference D3-D0, and the signal generating circuit comprises 10 signal generating units and is used for generating 10bit precision masks ZC 9-ZC 0 according to the 10bit precision precoding C9-C0; The logic gate control circuit is composed of 14 inverters INV 1-INV 14, 8 AND gates AN 1-AN 8 and 4 OR gates OR 1-OR 4, wherein AN3 and AN5 adopt three-input AND gates, and the circuit connection relation is as follows: The input ends of INV 1-INV 3 are connected with VDD, and the output ends are respectively connected with C0, C1 and C2; two input ends of AN1 are respectively connected with C6 and A2, and AN output end of the AN1 is connected with C3; the input end of the INV4 is connected with the D1, the output end of the INV4 is connected with one of the input ends of the AN2, the other input end of the AN2 is connected with the C6, and the output end of the AN2 is connected with the C4; the input end of the INV5 is connected with D0, the output end of the INV5 is connected with one of the input ends of the AN3, the other two input ends of the AN3 are connected with C6 and D1, the output end of the AN3 is connected with one of the input ends of the OR1, the other input end of the OR1 is connected with C4, and the output end of the OR1 is connected with C5; the input ends of the INV6 and the INV7 are respectively connected with the D2 and the D3, the output ends of the INV6 and the INV7 are respectively connected with the two input ends of the AN4, and the output end of the AN4 is connected with the C6; the input end of the INV8 is connected with the D3, the output end of the INV8 is connected with one of the input ends of the AN5, the other two input ends of the AN5 are connected with the A2 and the D2, the output end of the AN5 is connected with one of the input ends of the OR2, the other input end of the OR2 is connected with the C6, and the output end of the OR2 is connected with the C7; Two input ends of OR3 are respectively connected with A1 and C6, and the output end is connected with C8; The input ends of the INV9 and the INV10 are respectively connected with the D0 and the D3, the output ends of the INV9 and the INV10 are respectively connected with the two input ends of the AN6, the output end of the AN6 is connected with one input end of the OR4, the other input end of the OR4 is connected with the C8, and the output end of the OR4 is connected with the C9; The input ends of the INV11 and the INV12 are respectively connected with the D1 and the D3, the output ends of the INV11 and the INV12 are respectively connected with the two input ends of the AN7, and the output end of the AN7 is connected with the A1; The input ends of the INV13 and the INV14 are respectively connected with D0 and D1, the output ends of the INV13 and the INV14 are respectively connected with the two input ends of the AN8, the output end of the AN8 is connected with the A2, and both the A1 and the A2 are intermediate signals; In the signal generating circuit, each signal generating unit is composed of 2 inverters INV15 and INV16,1 Buffer,1 NAND gate NAN1 and 1 OR gate 5, the input ends of the INV15 and the Buffer are connected with clock signals ST, the output ends of the INV15 and the Buffer are connected with two input ends of the NAN1, the output ends of the NAN1 and the INV16 are respectively connected with two input ends of the OR5, the input ends of each INV16 in 10 signal generating units are respectively connected with C0-C9, and the output signals of the output ends of each OR5 are respectively recorded as ZC 0-ZC 9.
9. The floating point in-memory computing circuit of claim 8, wherein the dynamic adjustment circuit is composed of 11D flip-flops DFF 0-DFF 10 which are cascaded in sequence, wherein Q ends of the D flip-flops of the previous stage are connected with D ends of the D flip-flops of the next stage, setting ends of the D flip-flops of the first stage DFF0 are connected with VDD, setting ends of the DFF 1-DFF 10 are respectively connected with ZC 9-ZC 0, enabling ends of the DFF0 are connected with a first enabling signal CP1, enabling ends of the other D flip-flops are connected with a second enabling signal CP, the CP1 signals are delayed for one period relative to the CP signals, D ends of the DFF0 are connected with KEY, and Q ends of the DFF 1-DFF 10 output dynamic precision masks DELA9-DELA0.
10. A CIM chip is characterized by being packaged by a floating point in-memory computing circuit for dynamically adjusting mantissa truncation precision based on exponent difference according to any one of claims 5-9.

Description

9T-SRAM unit, floating point memory internal computing circuit and CIM chip Technical Field The invention belongs to the field of integrated circuits, and particularly relates to a 9T-SRAM unit, a floating point in-memory computing circuit for dynamically adjusting mantissa truncation precision based on exponent difference, and a CIM chip corresponding to the floating point in-memory computing circuit. Background Along with popularization of artificial intelligence such as deep learning in edge computing scenes, demands for high-energy efficiency and high-precision floating point operation are increasingly urgent. The in-memory Computing (CIM) paradigm has effectively alleviated the problem of "memory walls" by embedding computing units into a memory array, which has become one of the key technologies for achieving energy efficient computing. The rapid expansion of Deep Neural Networks (DNNs) in size and complexity has driven their widespread adoption in a variety of applications ranging from computer vision to natural language processing. The floating point in-memory computing (FP-CIM) is more suitable for complex AI reasoning and training tasks due to the wide dynamic range and precision. While CIM implementations based on integer arithmetic have shown excellent performance per watt, their limited numerical range and precision are insufficient to support high-fidelity DNN training and complex reasoning tasks. This limitation has stimulated a strong interest in floating point CIM (FP-CIM) architecture. However, one major obstacle in FP-CIM design stems from the inherent complexity of floating point algorithms. In particular, mantissa computation paths have become a critical bottleneck. Conventional methods require full precision multiplication after exponential alignment, which consumes a significant amount of power consumption and silicon area. Recent research explores various optimization techniques such as split computing architecture, dynamic accuracy computation, and integration of booth multipliers in memory. Despite these advances, achieving an efficient mantissa multiplication unit to dynamically adjust accuracy without affecting computational accuracy remains a fundamental challenge, ultimately limiting the energy efficiency of the FP-CIM system. Disclosure of Invention In order to solve the problem of high power consumption and large area commonly existing in the existing floating point memory calculation, the invention provides a 9T-SRAM unit, a floating point memory calculation circuit for dynamically adjusting mantissa truncation precision based on exponent difference values, and a corresponding CIM chip thereof. The technical scheme provided by the invention is as follows: a 9T-SRAM cell that serves as a basic cell for an SRAM-based CIM circuit and is used to implement maskable single bit multiplication operations. The 9T-SRAM unit is composed of 3 PMOS tubes P1-P3 and 6 NMOS tubes N1-N6. P1, P2, N1, N2 constitute a latch structure with inverted storage nodes Q and QB. N3 is used as a transmission pipe between Q and a bit line BL, and N4 is used as a transmission pipe between QB and a bit line BLB. The gates of N3, N4 are connected to word line WL. The grid electrode of N6 is connected with Q, the source electrode of N6 is connected with VSS, and the drain electrode of N6 is connected with the source electrode of N5. The grid of N5 is used as the INPUT port INPUT of multiplication, the grid of P3 is connected with the switch signal line KEY, and the drains of N5 and P3 are connected and used as the output port OUT of the multiplication result. When key=1, the 9T-SRAM unit is in an enabling state, the OUT port outputs multiplication results, and when key=0, the 9T-SRAM unit is in a shielding state, and the OUT port outputs results which are constant to 0. In the present invention, the circuit connection relationship of the 9T-SRAM cell is as follows: The gates of P1, P2 and P3 are connected with VDD, the gates of P1 and N1 are connected with the drains of P2 and N2 and the source of N4 and serve as a storage node QB, the gates of P2, N2 and N6 are connected with the drains of P1 and N1 and the source of N3 and serve as a storage node Q, the drain of N3 is connected with BL, the drain of N4 is connected with BLB, the gates of N3 and N4 are connected with WL, the gate of N6 is connected with Q, the source of N6 is connected with VSS, the drain of N6 is connected with the source of N5, the gate of N5 serves as an INPUT port INPUT of multiplication, the gate of P3 is connected with a switch signal line KEY, and the drains of N5 and P3 are connected and serve as an output port OUT of multiplication results. As a further improvement of the present invention, the operating logic of the 9T-SRAM cell for performing a single bit multiplication operation is as follows: When key=1, one operand a is prestored in the inverted storage nodes Q and QB, the other operand B is INPUT to the 9T-SRAM cell through t