CN-121979828-A - Interconnection system based on double-layer topology
Abstract
The invention relates to the field of chip design, in particular to an interconnection system based on a double-layer topology, which comprises at least one switch and at least one interconnection topology, wherein each interconnection topology comprises N physical topologies and a plurality of logic topologies, and the switch is used for connecting all computing units in the interconnection topology. Wherein the plurality of logic topologies are obtained by configuring the switch, each logic topology comprises two cell groups and logic paths between the two cell groups, the ith logic topology comprises T logic paths, each logic path is used for connecting two computing cells belonging to the two cell groups, and the computing cells connected by different logic paths are different. The method can lead the two unit groups to be interconnected through the logic topology, thereby achieving the purposes of expanding the interconnection scale of the calculation units, increasing the topology redundancy and improving the reliability of the system.
Inventors
- FU XUAN
- LIU XIAOQING
- CONG GAOJIAN
- WEI LI
- LI ZHAOSHI
Assignees
- 沐曦集成电路(上海)股份有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20241028
Claims (10)
- 1. An interconnection system based on a dual-layer topology, wherein the system comprises at least one switch and at least one interconnection topology, and each interconnection topology comprises N physical topologies and a plurality of logic topologies; Wherein each physical topology comprises: L unit groups, each unit group comprising T computing units; The group internal physical connection line is used for connecting each computing unit in each unit group; the inter-group physical connection line is used for connecting the computing units respectively belonging to different unit groups, and two computing units connected by each inter-group physical connection line are different; Wherein the switch is for connecting all computing units in the interconnection topology; The switch is configured to obtain a plurality of logic topologies, each logic topology comprises two unit groups and logic paths between the two unit groups, wherein the ith logic topology comprises T logic paths, each logic path is used for connecting two computing units belonging to the two unit groups, and the computing units connected by different logic paths are different.
- 2. The system of claim 1, wherein the two cell groups in each logical topology belong to two cell groups in different physical topologies, respectively.
- 3. The system of claim 2, wherein two cell groups belonging to different physical topologies in each logical topology are physically adjacent, wherein the adjacent is adjacent in physical location when the physical topologies are sequentially distributed in the same direction, and wherein a first cell group and a last cell group in physical location are considered to be adjacent.
- 4. The system of claim 1, wherein the two cell groups in each logical topology are two cell groups in the same physical topology.
- 5. The system of claim 4, wherein two groups of cells belonging to the same physical topology in each logical topology are physically adjacent.
- 6. The system of claim 1, wherein when one inter-group physical connection fails, a new topology is configured: And acquiring two target unit groups connected by the failed inter-group physical connection line, configuring all inter-group physical connection lines between the two target unit groups to be unavailable, and switching to form a new topology by the rest physical topology and the logic topology for data exchange.
- 7. The system of any of claims 1-5, wherein upon failure of a computing unit, a degraded topology is obtained: When the computing unit fails, if the L1 unit groups are bound to form an inseparable basic unit group, the basic unit group where the failed computing unit is located is configured to be unavailable, degradation is carried out on the basis of the physical topology of the rest of the basic unit groups which do not fail to obtain at least one degraded physical topology, the switch is reconfigured according to the degraded physical topology to obtain a plurality of degraded logical topologies, and the degraded physical topology and the logical topology form a degraded topology.
- 8. A system according to any of claims 2-3, characterized in that when said logical topology fails, the logical topology of the failure is configured to be unavailable, resulting in at least one degraded topology.
- 9. The system of claim 1, wherein the intra-cell group comprises an intra-cell group topology, wherein the intra-cell group topology is a topology formed by T computing cells in the cell group through intra-cell physical wires, and wherein the topology is a ring topology, a mesh topology, or a star topology.
- 10. The system of claim 1, wherein the unit group includes an inter-group topology, and the inter-group topology is formed by point-to-point connection between T computing units in the unit group and T computing units in adjacent groups through inter-group physical connection lines, respectively.
Description
Interconnection system based on double-layer topology Technical Field The invention relates to the field of chip design, in particular to an interconnection system based on double-layer topology. Background With the increasing demand for High Performance Computing (HPC) and massively parallel processing, the construction of large-scale interconnect clusters has become a key way to increase computing power. However, in large-scale interconnected GPU clusters, system reliability issues are increasingly pronounced due to the complexity of the physical topology and the dependence of the connections between nodes. Particularly in complex interconnect structures involving a large number of compute nodes, the design of the physical topology affects not only the computing performance, but also directly the stability and fault tolerance of the system. In complex interconnect topologies, such as ring, mesh, or tree structures, there are often multiple dependencies on the connections between nodes. If one of the critical nodes or connection links fails, it may cause interruption of communication or a significant degradation in performance of the entire network. For example, in a ring topology, all nodes are connected by a closed loop. If any segment of the connection in the ring fails, the integrity of the ring will be compromised, resulting in data not being able to be transmitted along the intended path, and the entire system may be paralyzed. Accordingly, there is a need for an interconnect system that can extend the interconnect topology and increase the fault redundancy of the system. Disclosure of Invention Aiming at the technical problems, the technical scheme adopted by the invention is that the interconnection system based on the double-layer topology comprises at least one switch and at least one interconnection topology, wherein each interconnection topology comprises N physical topologies and a plurality of logic topologies. Each physical topology comprises L unit groups, each unit group comprises T computing units, intra-group physical connecting lines and inter-group physical connecting lines, wherein the intra-group physical connecting lines are used for connecting the computing units in each unit group, the inter-group physical connecting lines are used for connecting the computing units respectively belonging to different unit groups, and two computing units connected by the inter-group physical connecting lines are different. Wherein the switch is configured to connect all computing units in the interconnection topology. The switch is configured to obtain a plurality of logic topologies, each logic topology comprises two unit groups and logic paths between the two unit groups, wherein the ith logic topology comprises T logic paths, each logic path is used for connecting two computing units belonging to the two unit groups, and the computing units connected by different logic paths are different. The invention has at least the following beneficial effects: The interconnection system based on the double-layer topology provided by the embodiment of the invention comprises a physical topology and a logic topology configured by a switch, wherein the logic topology is formed by interconnecting computing units in two unit groups by the switch, and the purpose of expanding the interconnection scale of the GPU can be achieved by interconnecting the two unit groups by the logic topology. Meanwhile, as the logic topology has the characteristic of flexible and configurable communication paths, the redundancy and flexibility of the system topology can be improved by matching with the physical topology, when one or more wires fail, the data exchange or the switching interconnection scale can be carried out by replacing the failed paths with new paths provided by the logic paths and the physical paths, the fault tolerance is high, and the overall reliability and flexibility of the system are improved. Drawings In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. FIG. 1 is a schematic diagram of an interconnect system according to a first embodiment of the present invention; Fig. 2 is a schematic diagram of a logic topology formed after a switch configuration according to a second embodiment of the present invention; FIG. 3 is a schematic diagram of a loop topology to which the topology of FIG. 2 is switched; FIG. 4 is a schematic diagram of a degraded topology in the event of a failure of the computing unit of FIG. 3; fig. 5 is a schematic diagram of a redundant topology to which the logical topology of fig. 3 switches upon failure. Deta