CN-122019244-A - Virtual fault domain isolation and recovery method suitable for intelligent-control interconnection system
Abstract
The invention relates to the technical field of chip interconnection, in particular to a virtual fault domain isolation and recovery method suitable for a flexible interconnection system, which comprises the steps that when a fault event occurs, a state machine is switched to a QUIESCE mode, and a transaction shadow table is maintained; and after reconnection, performing task replay according to the transaction shadow table, and switching the state machine back to the ACTIVE state. Aiming at the problem that a third party node with faults under a flexible interconnection system easily causes chain reaction in the prior art, a state machine of a fault domain is switched to a QUIESCE mode by adding an isolation mechanism on an IO interconnection chip for accessing the third party chip, a transaction shadow table of incomplete transactions is established, and the IO interconnection chip is used for carrying out proxy processing of partial transactions, so that faults of other devices in the interconnection system are prevented from being caused in a fault reconnection stage, and replay is carried out after an access object is recovered, thereby realizing the risk isolation and recovery process.
Inventors
- HE WEI
- CHEN WEIRONG
Assignees
- 上海方宜万强微电子有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260416
Claims (10)
- 1. A virtual fault domain isolation and recovery method suitable for a smart interconnection system is suitable for IO interconnection chips in the smart interconnection system and is characterized by comprising the steps of S1, distributing virtual fault domain identifiers to access objects accessed to the IO interconnection chips, then detecting fault events of the access objects, S2, when judging that the fault events occur, switching a state machine of the access objects to a QUIESCE mode by the IO interconnection chips, maintaining a transaction shadow table of the access objects, wherein the state machine is used for a superior interconnection system to confirm the state of the access objects, S3, classifying incomplete transactions in the transaction shadow table, monitoring the state of the access objects, and S4, after reconnecting the access objects, performing task replay according to the transaction shadow table, and switching the state machine back to an ACTIVE state.
- 2. The method for isolating and recovering virtual fault domain according to claim 1, wherein the step S1 comprises the steps of performing interactive authentication with the access object and obtaining capability information when the newly accessed access object occurs, the step S12 of allocating a virtual fault domain ID to the access object, the step S13 of mapping a plurality of logical fault domains for the virtual fault domain ID and then detecting the fault event, and the step S2 of performing proxy processing on the access object according to the logical fault domains.
- 3. The method according to claim 1, wherein a freeze timer is set for the fault event when executing the step S2, and wherein the step S4 is stopped and the state machine is switched to FROZEN when the freeze timer is triggered during the step S3.
- 4. The virtual fault domain isolation and restoration method according to claim 1, wherein the fault event comprises at least one of a third party object active reset indication, a power state exception, a clock loss, a link degradation exceeding a threshold, a heartbeat timeout, a protocol message verification failure, a context fingerprint inconsistency, a response timeout, a long time non-reclamation of credits, an interrupt suspension, or a management plane read failure.
- 5. The virtual fault domain isolation and restoration method according to claim 1, wherein the step S2 includes a step S21 of switching the state machine to a quench mode when it is determined that the fault event occurs, a step S22 of acquiring an incomplete transaction in which the access object is running and extracting transaction information, and a step S23 of creating the transaction shadow table according to the transaction information.
- 6. The virtual fault domain isolation and restoration method according to claim 5, wherein the transaction information includes at least one of a transaction identification, a transaction type, an address range, an affiliated queue/channel, a transaction phase, a completion progress, a dependency, whether proxy completion is allowed, and whether post-restoration replay is allowed.
- 7. The method according to claim 5, wherein in the step S21, the IO interconnect chip further performs the following operations on the access object, such as freezing new credits, reclaiming unused credits, marking dangling transaction identities, blocking new write requests, closing writable address windows, reserving read-only diagnostic windows, freezing queue head-to-tail updates, locking shared buffer ownership, limiting doorbell, interrupt or event from continuing to inject into at least one of the host systems.
- 8. The method according to claim 1, wherein the step S3 includes a step S31 of dividing the incomplete transactions into a waiting completion class, a proxied completion class and a recoverable replay class according to the transaction shadow table, a step S32 of performing proxy processing on the incomplete transactions of the waiting completion class and the proxied completion class by the IO interconnect chip and performing suspending waiting task replay on the incomplete transactions of the recoverable replay class, and a step S33 of performing read-write classification and respective processing on the incomplete transactions during the proxy processing.
- 9. The method according to claim 1, wherein the step S4 includes a step S41 of switching the state machine to a recovery state after reconnecting the access object, a step S42 of performing identity verification and context reconstruction on the access object, a step S43 of performing transaction replay according to the transaction shadow table, and a step S44 of RECOVERING the access object and sequentially switching the state machine to REJOIN states to ACTIVE states.
- 10. A storage medium comprising computer instructions which, when executed by a computer device, perform the virtual fault domain isolation and restoration method as claimed in any one of claims 1 to 9.
Description
Virtual fault domain isolation and recovery method suitable for intelligent-control interconnection system Technical Field The invention relates to the technical field of chip interconnection, in particular to a virtual fault domain isolation and recovery method suitable for a flexible interconnection system. Background The intelligent (UnifiedBus) interconnection system is a supernode-oriented interconnection protocol developed by Hua-Chen corporation and is used for solving the technical problem of interconnection of large-scale computing resource connection. The method forms a single logic address space by globally and uniformly addressing all computing, memory and storage resources, and eliminates the traditional master-slave scheduling architecture, so that each container can bypass a CPU and an operating system of an opposite terminal to realize microsecond-level delay when initiating cross-equipment/cross-node access. Based on the interconnection system, the third party core particle, the third party chip and the third party module can be supported to access the same interconnection domain. The physical bearer of the access object may include a package die interconnect, a board level or system level chip-to-chip interconnect (C2C), or may include a modular access after bridging, switching or network extension. For example, patent document PCT/CN2023/099530 discloses a data processing method, apparatus, device and system, and relates to the field of data processing. The method comprises the steps that a scheduler obtains a job to be processed, at least one supernode is controlled according to the resource requirement of the job to be processed, and the job to be processed is processed based on a global memory pool of the supernode. Wherein the job to be processed is a processing request associated with the distributed application. Thus, the global memory pool is a resource shared by the supernode internal nodes formed by uniformly addressing the storage media of the supernode internal nodes. Nodes connected through a high-speed interconnection technology in the supernodes share and access the global memory pool, and the job to be processed is processed, so that communication between the nodes in the supernodes based on MPI is avoided, a programming model of an application operated by the nodes is simplified, I/O communication between the nodes is effectively reduced, and the performance of the supernodes is fully exerted. Therefore, the data processing time is effectively shortened, the system energy consumption is reduced, and the system performance is improved. For another example, patent document with application number CN202511172887.7 discloses a supernode system, which includes an intelligent computing resource pool and a general computing resource pool, wherein the intelligent computing resource pool includes a first intra-pool exchange module, a first inter-pool exchange module and a plurality of GPUs, and the general computing resource pool includes a second intra-pool exchange module, a second inter-pool exchange module and a plurality of CPUs, so as to implement decoupling of heterogeneous computing resources. All GPUs in the same intelligent computing resource pool and different intelligent computing resource pools can be communicated through the exchange module in the first pool, all CPUs in the same intelligent computing resource pool and different intelligent computing resource pools can be communicated through the exchange module in the second pool, the system can provide intelligent computing resources and intelligent computing resources simultaneously, the intelligent computing resource pools and the intelligent computing resource pools can be expanded respectively, the flexible proportion of resources is realized, and the utilization rate is improved. The intelligent computing resources and the general computing resources are respectively pooled and respectively deployed in different cabinets, so that deployment decoupling of heterogeneous resources is realized, the density of a single-cabinet GPU of the intelligent computing resource cabinet is improved, the influence range of single-point faults is reduced, and the intelligent computing resource cabinet is easy to maintain. However, in the practical implementation process, the inventor finds that after the existing third party chip is accessed to the participating system to operate, if the existing third party chip is abnormal such as local reset, power failure, controller failure, heartbeat timeout, protocol violation, link degradation or context loss, the main system still faces risks such as transaction suspension, credit blockage, ordering relation destruction, shared memory pollution, suspension of interruption, sudden service stop and the like. Disclosure of Invention Aiming at the problems in the prior art, a virtual fault domain isolation and recovery method suitable for a flexible and qu interconnection system is provided