CN-122002166-A - Artificial intelligent server cluster network based on hybrid photoelectric interconnection and CXL-oF protocol, memory access method, device and electronic equipment

CN122002166ACN 122002166 ACN122002166 ACN 122002166ACN-122002166-A

Abstract

The present disclosure relates to the technical field oF artificial intelligence server cluster networks, and in particular, to an artificial intelligence server cluster network, a memory access method, a device and an electronic device based on a hybrid photoelectric interconnection and a CXL-orf protocol, where the network includes a network manager, a hybrid physical interconnection layer and a plurality oF server cabinets, the server cabinets include an electrical switch and a plurality oF server nodes, the hybrid physical interconnection layer includes an electrical switching plane and a reconfigurable optical switching plane, the server nodes include a CXL-orf bridging device, and the CXL-orf bridging device intelligently selects a reconfigurable optical switching plane with a low delay or a high bandwidth for transmission according to description characteristics such as a type and a size oF a CXL data packet. The network manager may dynamically reconfigure the physical topology of the reconfigurable optical switching plane. The present disclosure enables low latency, high bandwidth, energy efficient data interworking in a large scale artificial intelligence server cluster network.

Inventors

CUI WENPENG
BU JUNXIANG
Yang Bokuan
Cai yulu
SUN PENGFEI
ZHENG ZHE
YANG QINGCHEN
CHEN YUZHE
LIU YU
GAO YAN
LI MING
MEN HAO

Assignees

北京智芯微电子科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260409

Claims (20)

1. An artificial intelligence server cluster network based on a hybrid optical interconnect and CXL-oF protocol is characterized in that the artificial intelligence server cluster network comprises a network manager, a hybrid physical interconnect layer and a plurality oF server cabinets, wherein the server cabinets comprise an electrical switch and a plurality oF server nodes, the hybrid physical interconnect layer comprises an electrical switching plane and a reconfigurable optical switching plane, the electrical switching plane comprises an electrical backbone network, the reconfigurable optical switching plane comprises at least one centralized optical path switching device, the plurality oF server nodes in the server cabinets are interconnected through corresponding electrical switches based on electrical interfaces arranged by the server nodes, the electrical switches in the server cabinets are connected with the electrical backbone network in the electrical switching plane, the server nodes in the server cabinets are connected with the centralized optical path switching device in the reconfigurable optical switching plane through optical interfaces, the network manager is respectively connected with the electrical switching plane and the centralized optical path switching device in the reconfigurable optical switching plane, the server nodes comprise CXoF-bridge CXoF protocol supporting CXoF-CXoF protocol, and CXoF protocol supporting device, CXoF protocol packaging device and CXoF-packaging device, wherein the CXoF protocol packaging device comprises CXoF-F protocol packaging device and CXoF protocol. The CXL-oF bridging device is configured to acquire a memory access instruction generated based on an AI model training task, determine a target server node according to a request address in the memory access instruction, generate a CXL data packet corresponding to the memory access instruction through the CXL module and write the CXL data packet into an outbound traffic queue if the target server node is not a local server node, wherein the photoelectric joint scheduler determines a routing interface oF the CXL data packet according to a description characteristic oF the CXL data packet in the outbound traffic queue, and then the Fabric packaging engine packages the CXL data packet according to a request address oF the memory access instruction, a routing interface and the description characteristic oF the CXL data packet to generate a corresponding Fabric transmission frame, and route the Fabric transmission frame to the routing interface to be transmitted to the target server node through a corresponding designated switching plane, wherein the routing interface comprises an electrical interface or an optical interface, the electrical interface and the designated switching plane are the designated switching plane, and the optical interface is a reconfigurable designated switching plane; The network manager is configured to manage the reconfigurable optical switching plane and the electrical switching plane through an out-of-band management network, and comprises an execution stage based on an AI model training task, and control the reconfiguration of the physical optical topology structure of the reconfigurable optical switching plane by issuing an OCS reconfiguration instruction corresponding to the execution stage to the reconfigurable optical switching plane in advance.
2. The artificial intelligence server cluster network of claim 1, wherein the determining the target server node from the request address in the memory access instruction comprises: Acquiring identifiers of target server nodes corresponding to request addresses of the memory access instructions based on a preset global physical address mapping table, wherein each table entry in the global physical address mapping table is used for describing a corresponding relation between a global physical address interval and identifiers of server nodes in the plurality of server cabinets, and the request addresses correspond to global physical addresses in the global physical address interval; And determining the target server node according to the identifier of the target server node.
3. The artificial intelligence server cluster network of claim 1, wherein the description features of the CXL data packets include a class and/or a size of the CXL data packets, the optoelectric joint scheduler determining a routing interface for the CXL data packets based on the description features of the CXL data packets in the outbound traffic queue, comprising: The method comprises the steps of monitoring CXL data packets in an outbound flow queue through a classification engine based on flow characteristics, determining a routing interface of the CXL data packets according to the type and/or the size of the current CXL data packets, wherein the routing interface of the current CXL data packets is an electrical interface of a local server node if the current CXL data packets are first type data packets, the first type data packets comprise CXL.io configuration packets, CXL.cache probes or data packets with payloads smaller than a preset length, and judging whether an optical transmission queue is in a congestion state or not if the current CXL data packets are second type data packets and the corresponding target server nodes currently have available active optical path connection, and if the current CXL data packets are not, the routing interface of the current CXL data packets is an optical interface of the local server node, and the second type data packets comprise large memory page data packets based on CXL.mem protocol or migration data packets marked as a active type.
4. The artificial intelligence server cluster network oF claim 3, wherein the CXL-if bridging device is further configured to: When the reconfigurable optical switching plane is being reconfigured or the optical transmission queue is in a congestion state, firstly slicing CXL data packets which are set to be transmitted through the reconfigurable optical switching plane or CXL data packets which are judged to be second-class data packets, then packaging the sliced CXL data packets to generate corresponding Fabric transmission frames, and routing the corresponding Fabric transmission frames to an electrical interface of a local server node to be transmitted to the target server node through the electrical switching plane.
5. The artificial intelligence server cluster network of claim 1, wherein the Fabric transmission frame comprises a custom switch frame header, payload data and frame check information, the custom switch frame header sequentially comprising a routing tag field, a routing hint field, a timestamp field and a traffic type field, the Fabric encapsulation engine encapsulating the CXL data packet according to a request address of the memory access instruction and routing interface and description characteristics of the CXL data packet, the Fabric transmission frame comprising: Generating a routing tag field in the custom switch frame header according to the request address of the memory access instruction and a preset global physical address mapping table; generating a route prompt field in the custom switch frame header according to the route interface of the CXL data packet and forwarding instruction information injected by the network manager; Generating a time stamp field in the custom switch frame header based on the sending time of the Fabric transmission frame; generating a flow type field in the custom switch frame header according to the description characteristic of the CXL data packet; And taking the CXL data packet as the effective load data, taking a CRC check code as the frame check information, and sequentially connecting the custom exchange frame header, the effective load data and the frame check information to generate the Fabric transmission frame.
6. The artificial intelligence server cluster network of claim 1, wherein the performing stage of the AI model-based training task controls the reconstruction of the physical optical topology of the reconfigurable optical switching plane by issuing OCS reconstruction instructions corresponding to the performing stage to the reconfigurable optical switching plane in advance, comprising: Acquiring a current AI model training task based on an AI model training frame, and acquiring a plurality of continuous execution stages of the current AI model training task in the whole execution process and optimal physical light topological structure configuration corresponding to the reconfigurable light exchange plane in each execution stage according to calculation map information of the current AI model training task before the current AI model training task is executed; When the current AI model training task is executed, in a current execution stage, generating an OCS reconstruction instruction of a next execution stage in advance according to an optimal physical optical topology configuration corresponding to the reconfigurable optical switching plane in the next execution stage, and issuing the OCS reconstruction instruction of the next execution stage to the reconfigurable optical switching plane in the current execution stage or in a gap between the current execution stage and the next execution stage, so that the centralized optical path switching device in the reconfigurable optical switching plane reconstructs its physical optical topology structure according to the OCS reconstruction instruction of the next execution stage before the next execution stage.
7. The artificial intelligence server cluster network of claim 6, wherein the obtaining, according to the computational graph information of the current AI model training task, a plurality of continuous execution phases of the overall process of executing the current AI model training task and an optimal physical light topology configuration corresponding to the reconfigurable light exchange plane in each execution phase includes: Obtaining a corresponding communication operator set in each execution stage by analyzing the calculation graph information of the current AI model training task, wherein the communication operator set comprises one or more communication operators; For any execution stage, predicting a full node pair communication traffic matrix at the end of the any execution stage according to a communication operator set of the any execution stage by utilizing a pre-trained traffic prediction model, wherein matrix elements (i, j) in the full node pair communication traffic matrix represent expected communication data amounts from a server node i to a server node j which participate in executing corresponding communication operators in the any execution stage; and calculating the optimal physical optical topology configuration corresponding to the reconfigurable optical switching plane in any execution stage according to the full node pair communication traffic matrix at the end of any execution stage.
8. The artificial intelligence server cluster network of claim 1, wherein the network manager is further configured to: Acquiring a parallel strategy of an AI model training task, and dividing each server node in the plurality of server cabinets into a plurality of Dynamic Consistency Domains (DCDs) based on the parallel strategy of the AI model training task, wherein server nodes belonging to the same DCD are used for cooperatively processing specific data or model partitions of the AI model training task; the CXL-oh bridge device further comprising a DCD manager, the CXL-oh bridge device further configured to: The method comprises the steps of maintaining cache strong consistency of internal memory access of the same DCD through a hardware monitoring protocol for server nodes in the same DCD, maintaining cache weak consistency among the plurality of DCDs through a DCD manager based on a DCD directory table, wherein directory entries in the DCD directory table comprise corresponding relations between memory page addresses and server nodes and memory page states of memory pages, and the memory page states comprise sharing, exclusive and invalid.
9. The artificial intelligence server cluster network oF claim 8, wherein the CXL-if bridging device is further configured to: When the memory access instruction is judged to be a memory write operation instruction crossing DCD, triggering a cross-domain consistency maintenance operation for a target memory page, wherein the target memory page refers to a memory page corresponding to a memory address to be accessed by the memory access instruction, and the cross-domain consistency maintenance operation comprises the steps of only sending a point-to-point reverse invalidation message to a server node which is recorded in the DCD directory table and has a shared state of the target memory page based on the DCD directory table, instead of broadcasting the reverse invalidation message in a whole network, and completing the write operation after invalidation confirmation of the corresponding node is obtained.
10. The artificial intelligence server cluster network of claim 9, wherein the server node is configured to: In response to detecting a cross-DCD write conflict event, sending a cache invalidation request to a server node currently holding the target memory page in an exclusive state recorded in the DCD directory table to cause it to cancel the exclusive state and update the DCD directory table to resolve the write conflict to maintain final consistency of the data.
11. A memory access method in an artificial intelligence server cluster network based on a hybrid optical interconnect and CXL-orf protocol, the artificial intelligence server cluster network comprising a network manager, a hybrid physical interconnect layer and a plurality oF server cabinets, the server cabinets comprising an electrical switch and a plurality oF server nodes, the hybrid physical interconnect layer comprising an electrical switching plane and a reconfigurable optical switching plane, the electrical switching plane comprising an electrical backbone network, the reconfigurable optical switching plane comprising at least one centralized optical path switching device, the plurality oF server nodes in the server cabinets being interconnected by corresponding electrical switches based on electrical interfaces provided by themselves, the electrical switch in the server cabinets being connected by optical interfaces to the electrical backbone network in the electrical switching plane, the server nodes in the server cabinets being connected by optical interfaces to the centralized optical path switching device in the reconfigurable optical switching plane, the network manager being connected respectively to the electrical backbone network in the electrical switching plane and the reconfigurable optical path switching device in the reconfigurable optical switching plane, the plurality oF server nodes in the server cabinets comprising a CXL-Fabric bridge, the cxo Fabric and the memory access method comprising: Acquiring a memory access instruction generated based on an AI model training task, determining a target server node according to a request address in the memory access instruction, if the target server node is not a local server node, generating a CXL data packet corresponding to the memory access instruction through the CXL module, and writing the CXL data packet into an outbound flow queue; the photoelectric joint scheduler determines a routing interface of the CXL data packet according to the description characteristic of the CXL data packet in the outbound flow queue, then the Fabric packaging engine packages the CXL data packet according to the request address of the memory access instruction and the routing interface and the description characteristic of the CXL data packet to generate a corresponding Fabric transmission frame, and routes the Fabric transmission frame to the routing interface to be transmitted to the target server node through a corresponding appointed exchange plane, wherein the routing interface comprises an electrical interface or an optical interface, the appointed exchange plane corresponding to the electrical interface is an electrical exchange plane, and the appointed exchange plane corresponding to the optical interface is a reconfigurable optical exchange plane; The network manager is used for managing the reconfigurable optical switching plane and the electrical switching plane through an out-of-band management network, and comprises an execution stage based on an AI model training task and a physical optical topology structure reconfiguration control module, wherein the physical optical topology structure reconfiguration control module is used for controlling reconfiguration of the reconfigurable optical switching plane in a mode of issuing an OCS reconfiguration instruction corresponding to the execution stage to the reconfigurable optical switching plane in advance.
12. The memory access method according to claim 11, wherein the determining the target server node according to the request address in the memory access instruction includes: Acquiring identifiers of target server nodes corresponding to request addresses of the memory access instructions based on a preset global physical address mapping table, wherein each table entry in the global physical address mapping table is used for describing a corresponding relation between a global physical address interval and identifiers of server nodes in the plurality of server cabinets, and the request addresses correspond to global physical addresses in the global physical address interval; And determining the target server node according to the identifier of the target server node.
13. The memory access method according to claim 11, wherein the description characteristic of the CXL packet includes a class and/or a size of the CXL packet, and the optoelectric joint scheduler determining a routing interface for the CXL packet based on the description characteristic of the CXL packet in the outbound traffic queue includes: The method comprises the steps of monitoring CXL data packets in an outbound flow queue through a classification engine based on flow characteristics, determining a routing interface of the CXL data packets according to the type and/or the size of the current CXL data packets, wherein the routing interface of the current CXL data packets is an electrical interface of a local server node if the current CXL data packets are first type data packets, the first type data packets comprise CXL.io configuration packets, CXL.cache probes or data packets with payloads smaller than a preset length, and judging whether an optical transmission queue is in a congestion state or not if the current CXL data packets are second type data packets and the corresponding target server nodes currently have available active optical path connection, and if the current CXL data packets are not, the routing interface of the current CXL data packets is an optical interface of the local server node, and the second type data packets comprise large memory page data packets based on CXL.mem protocol or migration data packets marked as a active type.
14. The memory access method of claim 13, the memory access method is characterized by further comprising the following steps: And when the reconfigurable optical switching plane is being reconfigured or the optical sending queue is in a congestion state, firstly slicing CXL data packets set to be transmitted through the reconfigurable optical switching plane or CXL data packets judged to be oF a second type oF data packets, then slicing the CXL data packets obtained by slicing, packaging, generating a corresponding Fabric transmission frame, and routing the corresponding Fabric transmission frame to an electrical interface oF a local server node so as to be transmitted to the target server node through the electrical switching plane.
15. The memory access method according to claim 11, wherein the Fabric transmission frame includes a custom switch frame header, payload data and frame check information, the custom switch frame header includes, in order, a routing tag field, a routing hint field, a timestamp field and a traffic type field, the Fabric encapsulation engine encapsulates the CXL data packet according to a request address of the memory access instruction and a routing interface and description characteristics of the CXL data packet, and generating a corresponding Fabric transmission frame includes: Generating a routing tag field in the custom switch frame header according to the request address of the memory access instruction and a preset global physical address mapping table; generating a route prompt field in the custom switch frame header according to the route interface of the CXL data packet and forwarding instruction information injected by the network manager; Generating a time stamp field in the custom switch frame header based on the sending time of the Fabric transmission frame; generating a flow type field in the custom switch frame header according to the description characteristic of the CXL data packet; And taking the CXL data packet as the effective load data, taking a CRC check code as the frame check information, and sequentially connecting the custom exchange frame header, the effective load data and the frame check information to generate the Fabric transmission frame.
16. The memory access method according to claim 11, wherein the controlling the reconfiguration of the physical optical topology of the reconfigurable optical switching plane by issuing the OCS reconfiguration instruction corresponding to the execution phase to the reconfigurable optical switching plane in advance includes: Acquiring a current AI model training task based on an AI model training frame, and acquiring a plurality of continuous execution stages of the current AI model training task in the whole execution process and optimal physical light topological structure configuration corresponding to the reconfigurable light exchange plane in each execution stage according to calculation map information of the current AI model training task before the current AI model training task is executed; When the current AI model training task is executed, in a current execution stage, generating an OCS reconstruction instruction of a next execution stage in advance according to an optimal physical optical topology configuration corresponding to the reconfigurable optical switching plane in the next execution stage, and issuing the OCS reconstruction instruction of the next execution stage to the reconfigurable optical switching plane in the current execution stage or in a gap between the current execution stage and the next execution stage, so that the centralized optical path switching device in the reconfigurable optical switching plane reconstructs its physical optical topology structure according to the OCS reconstruction instruction of the next execution stage before the next execution stage.
17. The memory access method according to claim 16, wherein the obtaining, according to the calculation map information of the current AI model training task, a plurality of continuous execution phases of the overall process of execution of the current AI model training task and an optimal physical optical topology configuration corresponding to the reconfigurable optical switching plane in each execution phase includes: Obtaining a corresponding communication operator set in each execution stage by analyzing the calculation graph information of the current AI model training task, wherein the communication operator set comprises one or more communication operators; For any execution stage, predicting a full node pair communication traffic matrix at the end of the any execution stage according to a communication operator set of the any execution stage by utilizing a pre-trained traffic prediction model, wherein matrix elements (i, j) in the full node pair communication traffic matrix represent expected communication data amounts from a server node i to a server node j which participate in executing corresponding communication operators in the any execution stage; and calculating the optimal physical optical topology configuration corresponding to the reconfigurable optical switching plane in any execution stage according to the full node pair communication traffic matrix at the end of any execution stage.
18. The memory access method of claim 11, the memory access method is characterized by further comprising the following steps: Acquiring a parallel strategy of an AI model training task through the network manager, and dividing each server node in the plurality of server cabinets into a plurality of Dynamic Consistency Domains (DCDs) based on the parallel strategy of the AI model training task, wherein the server nodes belonging to the same DCD are used for cooperatively processing specific data or model partitions of the AI model training task; the CXL-oF bridging device further comprises a DCD manager, and the memory access method further comprises the following steps: The method comprises the following steps of maintaining cache strong consistency of internal memory access of the same DCD through a hardware monitoring protocol for server nodes in the same DCD, maintaining cache weak consistency among the plurality of DCDs based on a DCD directory table through the DCD manager, wherein directory entries in the DCD directory table comprise corresponding relations between memory page addresses and server nodes and memory page states of memory pages, and the memory page states comprise sharing, exclusive and invalid.
19. The memory access method of claim 18, the memory access method is characterized by further comprising the following steps: When the memory access instruction is judged to be a memory write operation instruction crossing DCD, triggering a cross-domain consistency maintenance operation for a target memory page, wherein the target memory page refers to a memory page corresponding to a memory address to be accessed by the memory access instruction, and the cross-domain consistency maintenance operation comprises the steps of only sending a point-to-point reverse invalidation message to a server node which is recorded in the DCD directory table and has a shared state of the target memory page based on the DCD directory table, instead of broadcasting the reverse invalidation message in a whole network, and completing the write operation after invalidation confirmation of the corresponding node is obtained.
20. The memory access method of claim 19, the memory access method is characterized by further comprising the following steps: In response to detecting a cross-DCD write conflict event, sending a cache invalidation request to a server node currently holding the target memory page in an exclusive state recorded in the DCD directory table to cause it to cancel the exclusive state and update the DCD directory table to resolve the write conflict to maintain final consistency of the data.

Description

Artificial intelligent server cluster network based on hybrid photoelectric interconnection and CXL-oF protocol, memory access method, device and electronic equipment Technical Field The disclosure relates to the technical field oF artificial intelligent server cluster networks, in particular to an artificial intelligent server cluster network based on hybrid photoelectric interconnection and CXL-oF protocol, a memory access method, a memory access device and electronic equipment. Background With the rapid development of deep learning technology, the unprecedented speed of deep learning models is spanning from trillion to trillion, and this trend presents a serious challenge to computing infrastructure, making traditional architecture designs face unprecedented crisis. In the background of gradual slowing down of moore's law, the performance bottleneck of a large-scale AI server cluster is not limited to the computational power of a single node any more, but is rapidly transferred to the data handling efficiency between nodes and the memory access delay across nodes, which currently becomes a key factor for restricting the improvement of the performance of the AI server cluster. The current main AI server cluster architecture mainly relies on layered electrical interconnection technology, for example, NVLink or PCIe is used for connection in the servers, and communication is realized between the servers through InfiniBand or RoCE Ethernet. Such an architecture maintains a certain efficiency in small-scale clusters (e.g., hundreds of accelerator cards), but suffers from difficult physical hurdles and energy efficiency dilemmas the cluster scale extends to thousands or even tens of thousands of accelerator nodes. Firstly, with the continuous increase of the high-speed signal rate, the physical attenuation of the electrical signal at high frequency increases exponentially, and the complex equalization technology and retimer introduced for maintaining the signal integrity cause extremely high SerDes interface power consumption, which severely restricts the energy efficiency and the computation density of the large-scale cluster. Secondly, the static network topology represented by the Fat-Tree has serious mismatch with the very regular aggregate communication mode (such as All-Reduce) in the artificial intelligence training, so that the data packet needs to be shuttled back and forth among the multi-level switches, the network hop count and unpredictable long tail delay are increased, and expensive computing resources are left empty when waiting for data. Finally, if a fully-connected ten-thousand-card cluster is constructed, a traditional three-layer Fat-Tree architecture is adopted, and tens of thousands of high-speed copper cables or optical cables and thousands of switches are needed, the wiring in the mode not only brings huge physical deployment cost, but also makes cable management and fault investigation extremely complex, and faults, looseness or poor contact of any cable can cause interruption or great degradation of the performance of the whole training task, and the reliability of the system is greatly reduced along with the expansion of the scale. Meanwhile, CXL (Compute Express Link) protocol brings revolutionary memory semantics for heterogeneous computing, which successfully breaks memory islands by maintaining cache consistency at hardware level. The CXL protocol is designed primarily for short-range interconnections within racks, and it is difficult to support transmission across data centers. More importantly, its snoop-based consistency mechanism is very prone to "broadcast storms" in large-scale nodes, and its underlying tree topology assumption is in structural conflict with the flattened, mesh communication modes required for artificial intelligence training. Facing the scalability dilemma of electrical interconnects and the distance and topology limitations of the CXL protocol, the industry is beginning to explore more radical physical layer solutions. The optical path switching (Optical Circuit Switching, OCS) technology uses MEMS (Micro-Electro-MECHANICAL SYSTEMS, micro-Electro-mechanical system) micromirror arrays or liquid crystal technology to directly switch optical paths in the optical domain without optical-Electro-optical (O-E-O) conversion, and is thus considered as an ultimate physical solution to the "post-molar age" interconnection bottleneck, which can provide transparent, high bandwidth, low energy consumption transmission. Although many megacompanies have attempted to use OCS technology at data centers, applying it to AI server clusters still faces significant challenges, mainly in: 1. Reconstruction delay-the switching time of MEMS OCS is typically in the order of milliseconds, while liquid crystal OCS is fast but also in the order of microseconds. In contrast, packet forwarding for electrical switching is nanosecond, and for communication modes that frequently vary duri