CN-121996175-A - Multi-core-grain-oriented distributed storage management device and computing equipment

CN121996175ACN 121996175 ACN121996175 ACN 121996175ACN-121996175-A

Abstract

The invention discloses a multi-core-oriented distributed storage management device and a computing device, wherein the distributed storage management device comprises a DVM proxy module, a DVM remote sending module and a DVM remote receiving module which are respectively arranged in each core, the DVM proxy module is used for receiving DVM request messages with revocation address information submitted by core nodes in the core, and a distributed broadcasting mechanism is adopted to send the DVM request messages to the core nodes and IO nodes of the core and other cores, so that the revocation addresses carried in the DVM request messages are revoked to realize unified memory view. The invention aims to solve the problem of cache consistency of the invalidation of multi-core large-chip address space data, and comprises the problem of broadcasting the invalidation information of DVM operation, avoid deadlock of DVM synchronous operation among multi-core grains, and optimize execution delay of the DVM operation under a multi-core grain scene to be overlong.

Inventors

YANG QIANMING
SHI WEI
ZHANG YING
LI NAN
QIAO YURAN
GONG RUI
WANG YONGWEN
WANG RUIBO
LAI MINGCHE
SUI BINGCAI
HUANG PENGCHENG
YANG MAOWANG
TIE JUNBO
WANG YONG
ZHANG JIANFENG
LIU WEI
ZENG KUN
ZHANG JIAN
FU WENWEN
FENG QUANYOU

Assignees

中国人民解放军国防科技大学

Dates

Publication Date: 20260508
Application Date: 20260409

Claims (10)

1. The distributed storage management device for the multiple cores is characterized by comprising a DVM proxy module, a DVM remote transmission module and a DVM remote receiving module which are respectively arranged in each core, wherein the DVM proxy module is used for receiving DVM request messages with the address information of the corresponding core, carrying out distributed broadcasting on the DVM request messages, including sending the DVM request messages into a local transaction queue, broadcasting the DVM request messages in the local transaction queue to a core node and an IO node of the corresponding core to carry out the invalidation of the DVM request messages so as to realize unified memory view, generating the DVM broadcast messages and submitting the DVM broadcast messages to the DVM remote transmission module, and the DVM remote transmission module is used for receiving the DVM broadcast messages from the DVM proxy module and the DVM response messages to other cores in communication, sending the DVM broadcast messages from the DVM remote transmission module to the core and the DVM request messages to the other core, and sending the DVM request messages from the DVM remote transmission module to the core through the DVM proxy module to the core node so as to realize unified memory view, and sending the DVM request messages from the DVM remote transmission module to the core to the other core nodes through the DVM request messages.
2. The multi-core oriented distributed storage management apparatus of claim 1, wherein the DVM agent module includes three pipelines, a first is a DVM request pipeline, a second is a broadcast response pipeline, and a third is a DVM broadcast request pipeline, the DVM request pipeline is configured to receive and process DVM request messages of all core nodes and other cores of the core where the DVM request pipeline is located, the broadcast response pipeline is configured to collect responses of the DVM broadcast messages, and reply DVM response messages to the core nodes, and the DVM broadcast request pipeline is configured to transmit the obsolete address information to all core nodes, IO nodes where necessary, and DVM remote sending modules.
3. The multi-granule oriented distributed storage management apparatus of claim 2 wherein the DVM request pipeline includes four stations, station M0, station M1, station R1, and station R2; the broadcast response pipeline comprises four stations of a station M0, a station M1, a station R1 and a station R2 and is combined with the DVM request pipeline; the DVM broadcast request pipeline comprises two stations S1 and S2, wherein the station M0 is positioned in a link layer, the slave station M0 to the station M1 firstly judges whether a local transaction queue or a remote transaction queue is free or not, if no free item exists, a retransmission response message is sent to an original request node, if the free item exists, a credit grant message is sent to the original request node or a DVM request message is received, then the DVM request message is analyzed, DVM waste address information and control information in the message are stored in a local transaction queue or a free item of the remote transaction queue according to the request equipment type of the DVM request message, wherein the local transaction queue is used for recording DVM request messages of all core nodes where the local transaction queue is positioned, the remote transaction queue is used for recording DVM request messages of all core nodes where other core nodes, the slave station M0 to the station M1 receives the DVM response message and updates a broadcasting counter of a corresponding transaction item in the local transaction queue or the remote transaction queue, the slave station M1 to the station R1 and the DVM request pipeline respond to a broadcasting counter of the corresponding transaction item in the local transaction queue or the remote transaction queue, the DVM request queue is used for a four-phase-synchronous transaction or a four-phase-off-of-time transaction synchronous transaction from the slave station 1 to a remote transaction queue, and a synchronous transaction from a remote transaction of a synchronous type is selected from a synchronous or a synchronous transaction of a synchronous type to a synchronous or a remote transaction from a synchronous station, the DVM request pipeline and the broadcast response pipeline select one item from four transaction items by using an LRU arbitration algorithm, a station R2 generates a transmission vector of the DVM broadcast message or transmits a DVM response message according to an arbitration result, wherein the transmission vector corresponds to a destination node to which the DVM broadcast message is transmitted, after the transmission vector of the DVM broadcast message is obtained, the transmission vector is converted into an identifiable physical identifier TgtID through a DVM broadcast state opportunity and is used for transmitting the DVM broadcast message to each destination node, the station S1 to the station S2 assemble a new DVM broadcast message by the DVM broadcast request pipeline, and the destination node which is not currently broadcast is provided according to a DVM broadcast state machine to be filled in the DVM broadcast message, and the DVM broadcast request pipeline transmits the DVM broadcast message in the station S2 and tells the destination node of the DVM broadcast state machine that the DVM broadcast message has been transmitted.
4. The multi-granule oriented distributed storage management apparatus of claim 3 wherein the DVM broadcast state machine comprises an IDLE state (IDLE), an ACTIVE state (ACTIVE), a send data identifier and a receive data state (send_dbid_ DBVALID), A send DVM broadcast message State (SENTREQ) and a send response message State (SENTCOMP), wherein the IDLE state (IDLE) is used to identify that the transaction queue entry is in an IDLE state, the ACTIVE state (ACTIVE) is used to identify that the transaction queue entry receives a DVM request, the send data identifier and the receive data state (send_dbid_ DBVALID) are used to identify that the transaction queue entry receives DVM invalidate information, the send DVM broadcast message State (SENTREQ) is used to identify that the transaction queue entry is in a send DVM broadcast message state, the send response message State (SENTCOMP) is used to identify that the transaction queue entry is sending a response message, the DVM broadcast state machine enters an IDLE state (IDLE) after the system is started, and the transition conditions of the five states include a state transition condition ① that the transaction queue entry is in an IDLE state when the transaction queue entry is not applied; a state transition condition ② that the transaction queue entry transitions to an ACTIVE state (ACTIVE) when the transaction queue entry is applied in an IDLE state (IDLE), a state transition condition ③ that the transaction queue entry unconditionally transitions to a send data identifier and receive data state (send_dbid_ DBVALID) in an ACTIVE state (ACTIVE), a state transition condition ④ that the transaction queue entry does not collect all DVM invalidate information when the transaction queue entry does not collect all DVM invalidate information in a send data identifier and receive data state (send_dbid_ DBVALID), the transaction queue entry is in a send data identifier and receive data state (SENT_DBID_ DBVALID), a state transition condition ⑤ that the transaction queue entry transitions to a send DVM broadcast message State (SENTREQ) when the transaction queue entry gathers the DVM revocation information in the send data identifier and receive data state (SENT_DBID_ DBVALID), a state transition condition ⑥ that the transaction queue entry transitions to a send response message State (SENTCOMP) when the transaction queue entry gathers the response message to the DVM broadcast request in the send DVM broadcast message State (SENTREQ), a state transition condition ⑦ that the transaction queue entry transitions to an IDLE state (IDLE) when the gathered DVM revocation information data verification error is detected in the send DVM broadcast message State (SENTREQ), and a state transition condition ⑧ that the transaction queue entry transitions to the IDLE state unconditionally in the send response message State (SENTCOMP).
5. The multi-core-oriented distributed storage management device according to claim 1, wherein the DVM remote sending module includes two pipelines, a first pipeline is a DVM broadcast request pipeline for sending a cross-core DVM broadcast message sent by the DVM proxy module to other cores, and a second pipeline is a DVM broadcast response pipeline for replying a DVM response message to a DVM proxy module of a core where the DVM response message is located after the response of the DVM broadcast request of the other cores is collected.
6. The multi-core oriented distributed storage management device according to claim 5, wherein the first pipeline of the DVM remote transmission module includes four stations, namely, station R10-station R13, station R10 is located at a link layer for opening a data path of the link layer connected to the first pipeline, the slave station R10-station R11 stores the DVM-terminated address information and control information in a free entry of a transmission transaction queue (DVM TRACKER 1) through message parsing, the transmission transaction queue (DVM TRACKER 1) is used for recording the DVM-terminated address information and a completion status of a DVM request, the slave station R11-station R12 directly transmits a DVM response message if the DVM remote transmission module does not handshake with a remote core, and if the handshake succeeds, selects a new DVM broadcast message from a transmission transaction queue (DVM TRACKER 1) to form a new core-crossing message, the slave station R12-station R13 stores the routing information of a core that is not currently broadcast, and the slave station R11-station R12-to-core status machine informs that the new core-not-broadcasted core-target packet has been updated, and the new core-target-to-core-target-free core-size information is not received from the second station R13-station until the new-core-target-station is received, and the new-core-target-size-request-packet is continuously broadcast by the master-core-station is sent.
7. The multi-core oriented distributed storage management apparatus of claim 5, wherein the second pipeline of the DVM remote transmission module includes three stations D1 to D3, the station D1 being a link layer, the DVM broadcast response messages from the station D1 to the station D2 being matched by message parsing to transmit transaction queue entries in the transaction queue (DVM TRACKER 1), the stations D2 to D3 picking one transaction queue entry from the transmit transaction queue (DVM TRACKER 1) that has collected all core responses, and concatenating the corresponding DVM response messages, the D3 station being configured to transmit the DVM response messages.
8. The multi-core-oriented distributed storage management device according to claim 1, wherein the DVM remote receiving module includes two pipelines, a first pipeline is a cross-core request message pipeline for receiving cross-core messages from DVM remote sending modules of other cores and sending the request messages to a DVM proxy module of the core where the request messages are located, and a second pipeline is a DVM response pipeline for receiving DVM response messages, cancelling transaction queue entries corresponding to DVM broadcast messages and sending the cross-core DVM response messages.
9. The multi-core oriented distributed storage management apparatus of claim 8, wherein the first pipeline of the DVM remote receive module includes a receive transaction queue (DVM TRACKER 2) and a retransmission buffer, the receive transaction queue (DVM TRACKER 2) is configured to receive DVM request messages of other cores that cross the cores, the retransmission buffer is configured to primarily handle flow control with the DVM proxy module, and when the DVM proxy module sends a retransmission response message, the retransmission buffer updates a transaction queue entry in the receive transaction queue (DVM TRACKER) to a retransmission state, and waits for the DVM proxy module to send a credit grant message until a transaction queue entry in the retransmission state in the receive transaction queue (DVM TRACKER 2) is selected and sent.
10. A computing device comprising a multi-die microprocessor and a memory interconnected, wherein the multi-die microprocessor comprises a multi-die oriented distributed storage management apparatus according to any one of claims 1 to 9.

Description

Multi-core-grain-oriented distributed storage management device and computing equipment Technical Field The invention relates to the technical field of multi-chip inter-chip consistency interconnection, in particular to a multi-chip oriented distributed storage management device and computing equipment. Background The inter-chip consistency interconnection is mainly used for consistency connection between processors (or accelerators), can ensure Cache consistency among a plurality of processors, can break through the resource bottleneck of single core particles, and coordinates and integrates the calculation, storage, IO and other resources of each processor. In order to meet the current high bandwidth, low latency performance requirements between processors (or accelerators), in recent years, various die-consistent interconnect technologies, such as CCIX interconnect technology and CXL interconnect technology, have been proposed by various large international chip manufacturers at different times. The large processor formed by a plurality of core grains can form a global data sharing consistency system, provides rich hardware resources, is beneficial to parallel computation among the core grains and shortens computation time. In practice of performing multi-core consistency interconnection, with the increase of the number of core particles, the time cost for maintaining cache consistency among the core particles is higher and higher. In particular, in the scenes of process migration, multi-process synchronization, process revocation and the like, which need to perform data invalidation in a large block address space, higher requirements are put on the time delay of cache consistency. In order to solve the problem of large-scale address space data invalidation, DVM (distributed virtual memory) operation is proposed in the CHI protocol, and the operation can perform global invalidation and synchronization operation for different buffer types, so that memory views can be unified, and cache consistency can be maintained. The DVM operation types may be categorized into TLB page table invalidate, branch predictor invalidate, instruction cache invalidate, and synchronous operations. The first three classes belong to asynchronous invalidation, other DVM request messages can be initiated without waiting for completion, and a plurality of operations are allowed to be parallel. The function of the synchronous operation is to ensure that all requesting nodes (such as processor core nodes and main IO nodes) complete all received asynchronous operations, and the synchronous operation is performed, and the synchronous operation does not carry any revocation information. The data path of the DVM mainly comprises interactions among all request nodes (processor core nodes and IO nodes with revocation demands), DVM operation management nodes (DN: DVM Node) and routing nodes (CELL) in the system. Any processor core node (core node for short) can be used as an initiator and a completer of a DVM request message, a CELL node mainly plays roles of routing and link interconnection, all modules connected with the CELL node can be guaranteed to be communicated with each other, and a DN node is a DVM operation management node and is used for receiving and broadcasting all DVM operations in a system. Any core node sends a DVM request message and transmits the DVM request message to a DN node through a CELL node, the DN node broadcasts the DVM message to all other core nodes and IO nodes with the revocation requirement after receiving the complete DVM message, each core node (or IO node) receiving the DVM message sends a response message to the DN node, and after all the response messages are collected, the DN node sends a response message to the original request node to indicate that the DVM request message is completed. Through the process, the DVM message carrying the voided information can be broadcast to all nodes needing voiding in the system and voided, so that the memory view is unified, and the aim of cache consistency is fulfilled. Meanwhile, by means of DVM operation of different types and filling address information into the DVM operation, different revocation object data revocation can be achieved. It should be noted that the CHI protocol is directed to single-die systems and does not relate to the field of multi-die interconnects. That is, the DVM operation proposed above can only solve the single-chip large-chip address space cache coherency problem. In order to solve the problem of cache consistency of a large-chip address space of multiple cores, the basic problem is firstly how to transmit the voiding information carried by DVM operation to the broadcasting problem of core nodes or IO nodes of all other cores (Die), secondly how to avoid deadlock problem of DVM synchronous operation among the multiple cores, and finally how to optimize the problem of excessively long execution delay of DVM operation under a multi-core