CN-122001773-A - Topology grouping method and device for multi-track InfiniBand network
Abstract
The application provides a topology grouping method and device of a multi-track InfiniBand network, wherein the method comprises the steps of obtaining link information and port information of each device of the multi-track InfiniBand network, grouping servers based on the link information and the port information of the servers to obtain a plurality of server groups, constructing a corresponding super computing unit based on each server group and a Leaf switch connected with each server group, constructing a Spine switch group based on the link information and the port information of the Leaf switch of each super computing unit, and constructing a Core switch group based on the link information and the port information of the Spine switch of each Spine switch group. The method can ensure the high-efficiency utilization of cluster resources and the stable output of computing performance by grouping the topology of the multi-track InfiniBand network.
Inventors
- Gong Zhuming
- WANG TAO
- HUANG YONGBAO
- CHEN PENG
Assignees
- 中移(苏州)软件技术有限公司
- 中国移动通信集团有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260409
Claims (10)
- 1. A topology grouping method of a multi-rail InfiniBand network, comprising: Acquiring link information and port information of each device of a multi-rail InfiniBand network of a target period, wherein each device at least comprises a server, a Leaf switch, a Spine switch and a Core switch; Grouping servers based on link information and port information of servers of the multi-track InfiniBand network of the target period to obtain a plurality of server groups; Constructing a corresponding super computing unit based on each server group and the Leaf switches connected with the server groups, wherein the super computing unit comprises a plurality of servers and a plurality of Leaf switches, and the servers in one super computing unit are only connected with the Leaf switches in the super computing unit; Constructing a Spine switch group based on link information and port information of the Leaf switch of each super computing unit, wherein the Leaf switch in one super computing unit is connected with the Spine switch of one Spine switch group; And constructing Core switch groups based on the link information and the port information of the Spine switches of each Spine switch group, wherein the Spine switches in one Spine switch group are connected with the Core switches of one Core switch group.
- 2. The method of claim 1, wherein grouping servers based on the link information and port information of the servers of the multi-track InfiniBand network for the target period of time, resulting in a plurality of server groups, comprises: calling CollectDeviceLinkPortInfo functions to acquire information of all servers and connected Leaf switches; And grouping the servers according to the information of all the servers and the Leaf switches connected with the servers to obtain a plurality of server groups.
- 3. The method of claim 2, wherein constructing a corresponding supercomputer unit based on each server packet and its attached Leaf switch, comprises: acquiring port information of each server of a server group; determining a link between each server and a Leaf switch based on port information of the server; determining a set of Leaf switches connected to a server based on the links of the server and the Leaf switches; performing deduplication on the union of the Leaf switch sets of all servers to obtain a Leaf switch set of a super computing unit; a Leaf switch is randomly selected from a set of Leaf switches of the supercomputer as a core switch of the supercomputer.
- 4. The method of claim 1, wherein constructing the Spine switch packet based on the link information and the port information of the Leaf switch of each supercomputer unit comprises: Acquiring an ID of each Leaf switch of a Leaf switch set of each super computing unit; Determining the link of the Spine switch connected with each Leaf switch according to the identification of each Leaf switch; Determining a set of Spine switches to which each Leaf switch is connected based on links of the Spine switches to which each Leaf switch is connected; calling FindLinksBetween functions to obtain a unique Spine switch set of the Leaf switch set of the super computing unit so as to enable the connection relation between each Leaf switch and the Spine switch to be correct; and removing repeated Spine switches of the Spine switch set through the operation of removing the repeated Spine switches, and generating a Spine switch packet.
- 5. The method of claim 1, wherein constructing the Core switch packet based on the link information and the port information of the Spine switch of each Spine switch packet comprises: Searching Core switch links connected with the Spine switches for the Spine switches in each Spine switch group; calling FindLinksBetween functions to obtain a unique Core switch set of the Spine switch group so as to enable the connection relation between each Spine switch and the Core switch to be correct; Core switch packets are generated by removing duplicate Core switches of a Core switch set by a deduplication operation.
- 6. The method according to claim 1, wherein the method further comprises: a plurality of supercomputer units, spine switch packets and Core switch packets are displayed.
- 7. A topology packet transpose of a multi-track InfiniBand network, comprising: An obtaining unit, configured to obtain link information and port information of each device of a multi-rail InfiniBand network in a target period, where each device at least includes a server, a Leaf switch, a Spine switch, and a Core switch; a processing unit, configured to group servers based on link information and port information of servers of the multi-rail InfiniBand network in the target period, so as to obtain multiple server groups; The system comprises a first grouping unit, a second grouping unit and a third grouping unit, wherein the first grouping unit is used for constructing a corresponding super computing unit based on each server grouping and the Leaf switches connected with the server grouping, the super computing unit comprises a plurality of servers and a plurality of Leaf switches, and the servers in one super computing unit are only connected with the Leaf switches in the super computing unit; The device comprises a super computing unit, a second grouping unit, a first grouping unit and a second grouping unit, wherein the super computing unit is used for building a Spine switch grouping based on link information and port information of a Leaf switch of the super computing unit; And the third grouping unit is used for constructing Core switch groups based on the link information and the port information of the Spine switch of each Spine switch group, wherein the Spine switches in the Spine switch groups are connected with the Core switches of one Core switch group.
- 8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any of claims 1-6 when executing the computer program.
- 9. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the method of any one of claims 1-6.
- 10. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the method of any of claims 1-6.
Description
Topology grouping method and device for multi-track InfiniBand network Technical Field The application relates to the technical field of computer networks, in particular to a topology grouping method and device of a multi-rail InfiniBand network. Background The existing complex network topology grouping method has the following technical problems in processing an InfiniBand (IB) network: The layout is fixed, the display effect is limited, the traditional topology visualization generally adopts a fixed-interval layout, and the equipment such as a server, a switch and the like is displayed in a layered manner to represent a hierarchical structure. However, in the multi-rail IB network of the intelligent computation center and the large-scale computation cluster, the number of nodes is huge, the connection relationship is complex, connecting lines in the topological graph are staggered, the display effect is disordered, the identification is not easy, and the visual definition and the effective presentation of the topological structure are affected. The lack of flexible hierarchical packet mechanisms, multi-rail IB networks, due to their high bandwidth and low latency requirements, employ complex full-mesh or parallel multi-rail topologies, making the connections between nodes and switches extremely dense. The existing hierarchical grouping method is difficult to effectively process the diversity and density of the structure, the actual logic level of the network cannot be embodied in the display, the information is easy to be redundant, and the network manager is inconvenient to quickly understand and position key equipment. Real-time monitoring and dynamic adaptability are insufficient, and in an intelligent computing center environment, real-time state and flow monitoring of equipment are of great importance. However, most of the existing schemes are static display, and lack of dynamic updating and response capability, so that a manager cannot acquire key state information in time in the processes of equipment fault detection, alarm processing and performance optimization, and the visualization effect and fault diagnosis efficiency of cluster operation are limited. The complexity of the multi-rail IB network enables the performance bottleneck and network faults to be more hidden, the traditional static display can not directly provide real-time data such as flow hot spots, equipment loads and the like, the diagnosis difficulty is increased, potential problems are difficult to discover in time, and therefore the reliability and the performance of the intelligent computation center cluster are affected. Disclosure of Invention In view of the above, the present application provides a topology grouping method and apparatus for a multi-rail InfiniBand network, so as to solve the above technical problems. In a first aspect, an embodiment of the present application provides a topology grouping method for a multi-rail InfiniBand network, including: Acquiring link information and port information of each device of a multi-rail InfiniBand network of a target period, wherein each device at least comprises a server, a Leaf switch, a Spine switch and a Core switch; Grouping servers based on link information and port information of servers of the multi-track InfiniBand network of the target period to obtain a plurality of server groups; Constructing a corresponding super computing unit based on each server group and the Leaf switches connected with the server groups, wherein the super computing unit comprises a plurality of servers and a plurality of Leaf switches, and the servers in one super computing unit are only connected with the Leaf switches in the super computing unit; Constructing a Spine switch group based on link information and port information of the Leaf switch of each super computing unit, wherein the Leaf switch in one super computing unit is connected with the Spine switch of one Spine switch group; And constructing Core switch groups based on the link information and the port information of the Spine switches of each Spine switch group, wherein the Spine switches in one Spine switch group are connected with the Core switches of one Core switch group. In one possible implementation, grouping servers based on link information and port information of servers of the multi-track InfiniBand network of the target period to obtain a plurality of server groups includes: calling CollectDeviceLinkPortInfo functions to acquire information of all servers and connected Leaf switches; And grouping the servers according to the information of all the servers and the Leaf switches connected with the servers to obtain a plurality of server groups. In one possible implementation, a corresponding supercomputer unit is built based on each server packet and its connected Leaf switches, comprising: acquiring port information of each server of a server group; determining a link between each server and a Leaf switch based on po