CN-121981181-A - MoE expert deployment system and method based on wafer-level chip

CN121981181ACN 121981181 ACN121981181 ACN 121981181ACN-121981181-A

Abstract

The invention discloses a wafer-level chip-based MoE expert deployment system and a wafer-level chip-based MoE expert deployment method, wherein the system comprises a statistics module and a clustering module, wherein the statistics module is used for counting expert coactivation probability distribution reflecting inter-expert cross-layer cooperative activation relation in MoE and communication requirements of a computing core on the wafer-level chip, and the clustering module is used for clustering with the aim of minimizing traffic of cross physical areas and minimizing intra-group communication load difference of expert groups carried by each physical area based on the expert coactivation probability distribution and the communication requirements to obtain an expert grouping scheme of the MoE and a physical area layout mapping scheme corresponding to the wafer-level chip. According to the method, clustering can be carried out according to the expert cross-layer co-occurrence rule, and the high-frequency co-activated experts are constrained in the same physical area, so that a large amount of global communication is converted into local exchange, and the cross-region communication overhead of long distance and high delay is obviously reduced. Meanwhile, the load distribution is optimized by combining the communication requirements, and local hot spots are effectively avoided.

Inventors

OUYANG PENG
LI XIUDONG
WANG BO

Assignees

北京清微智能科技股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260402

Claims (15)

1. A wafer-level chip-based MoE expert deployment system, comprising: the statistics module is used for counting expert coactivation probability distribution reflecting inter-expert cross-layer cooperative activation relation in the MoE and communication requirements of a computing core on the wafer-level chip; And the clustering module is used for clustering with the aim of minimizing the traffic of the cross physical areas and minimizing the intra-group communication load difference of the expert group carried by each physical area based on the expert co-activation probability distribution and the communication requirement, so as to obtain an expert grouping scheme of MoE and a physical area layout mapping scheme corresponding to the wafer-level chip.
2. The system of claim 1, wherein the clustered expert grouping scheme and physical region layout mapping scheme satisfy constraints, wherein the constraints comprise a first condition and a second condition, wherein: The first condition is that the total memory capacity of each physical area is greater than or equal to the total memory of the expert parameters in the expert group carried by the physical area; the second condition is that the number of physical areas divided by the physical area layout mapping scheme does not exceed the maximum routing path number of the wafer-level chip computing core.
3. The system of claim 2, wherein the clustering module is further configured to copy copies of parameters of selected experts in the group of experts carried by the physical region to one or more physical regions adjacent to the physical region when the physical region layout mapping scheme does not satisfy a second condition or when there is a difference in intra-group communication load of the group of experts carried by the physical region greater than a preset difference threshold.
4. The system of claim 1 or 2, wherein the clustering module is further configured to establish a mapping relationship between each expert group and a physical area of the wafer level chip based on the expert group scheme, and generate the physical area layout mapping scheme according to the mapping relationship between each expert group and the physical area of the wafer level chip.
5. The system of claim 1, further comprising a partitioning module comprising a partitioning unit configured to partition the network-on-chip of the wafer level chip into a plurality of physical areas according to the physical area layout mapping scheme, each of the physical areas configured to carry one expert group.
6. The system of claim 5, wherein the partitioning module further comprises: An allocation unit for allocating at least one routing path for the expert groups carried by the remaining physical areas based on the ingress router of each physical area.
7. The system of claim 5, wherein the partitioning module further comprises: and the storage unit is used for calculating the forwarding paths between the boundary router and each calculation core in the physical area based on the physical area layout mapping scheme, and storing the forwarding paths in a routing table of the boundary router.
8. The system of claim 5, wherein the partitioning module further comprises: the flow control unit is used for determining the buffer area state of the downstream node through the credit signal returned by the downstream node in the physical area, and adjusting the data packet injection rate of the upstream node in the physical area according to the buffer area state; and the self-routing unit is used for selecting a substitute forwarding node for the data packet according to a preset evasion rule if a downstream node on a preset routing path of the data packet indicates congestion during forwarding the data packet.
9. The system of claim 5, wherein the partitioning module further comprises: And the pre-fetching unit is used for pre-fetching the parameters of the inactive expert required by the calculation of the token in the next layer to the edge cache of the physical area according to the expert co-activation probability distribution between the activated expert of the current layer of the MoE and the inactive expert of the next layer before the token enters the physical area.
10. The system of claim 1, wherein the statistics module comprises: And the first statistics unit is used for counting the paired frequencies of each two experts activated by the same input token in different layers of the MoE in sequence within a preset time window to obtain the expert coactivation probability distribution.
11. The system of claim 1, wherein the statistics module comprises: And the second statistical unit is used for counting the data quantity which needs to be sent to other computing cores by each computing core of the wafer level chip in a preset time window, so as to obtain the communication requirement.
12. The MoE expert deployment method based on the wafer-level chip is characterized by comprising the following steps of: Counting expert coactivation probability distribution reflecting inter-expert cross-layer cooperative activation relationship in MoE and communication requirement of a computing core on a wafer level chip; Based on the expert coactivation probability distribution and the communication requirement, clustering is carried out with the aim of minimizing the communication quantity crossing the physical areas and minimizing the intra-group communication load difference of expert groups carried by each physical area, so as to obtain an expert grouping scheme of MoE and a physical area layout mapping scheme corresponding to the wafer-level chip.
13. A chip characterized by an integrated wafer-level chip-based MoE expert deployment system according to any of claims 1 to 11.
14. A computing board card comprising the chip of claim 13.
15. An electronic device comprising the computing board card of claim 14.

Description

MoE expert deployment system and method based on wafer-level chip Technical Field The invention relates to the technical field of artificial intelligence, in particular to a wafer-level chip-based MoE expert deployment system and method. Background As the scale of large-scale language models and hybrid expert models continues to expand, they place unprecedented demands on computing device computational power, memory bandwidth, and interconnection efficiency. The traditional acceleration scheme based on multiple chips (such as GPU clusters) is limited by the bandwidth and delay of the interconnection between the chips, and when the special dynamic sparse All-to-All communication of a hybrid expert model (Mixture of Experts, moE) is processed, serious communication bottleneck and load imbalance problems are easy to generate, so that the overall reasoning efficiency and the system expansibility are restricted. The wafer-level integration technology integrates hundreds of thousands to millions of computing cores and ultra-high bandwidth on-chip memories by taking the whole silicon wafer as a single chip, thereby providing a new hardware foundation for breaking through the bottleneck. Such chips employ large-scale mesh network-on-chip interconnection massive cores, which topology has significant advantages in terms of physical layout and manufacturing costs. However, this introduces new architectural challenges, namely firstly, the total amount of memory resources is huge, but the memory resources are distributed locally to each computing core in a distributed form, a single core can only hold a small amount of data and is difficult to accommodate the complete model parameters, secondly, the path concurrency capability of the network-on-chip router is severely limited (such as less than 32) due to the core area and the wiring complexity, which is fundamentally contradictory to the global, irregular and highly concurrent All-to-All communication modes generated during the MoE operation. Therefore, how to improve the efficiency of expert scheduling and reduce the communication overhead on the wafer-level chip with limited routing resources has become a key technical problem to be solved urgently. This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section. Disclosure of Invention The embodiment of the invention provides a MoE expert deployment system based on wafer-level chips, which is used for improving the expert scheduling efficiency and reducing the communication overhead on the wafer-level chips with limited routing resources. The MoE expert deployment system based on the wafer-level chip comprises: the statistics module is used for counting expert coactivation probability distribution reflecting inter-expert cross-layer cooperative activation relation in the MoE and communication requirements of a computing core on the wafer-level chip; And the clustering module is used for clustering with the aim of minimizing the traffic of the cross physical areas and minimizing the intra-group communication load difference of the expert group carried by each physical area based on the expert co-activation probability distribution and the communication requirement, so as to obtain an expert grouping scheme of MoE and a physical area layout mapping scheme corresponding to the wafer-level chip. Further, the expert grouping scheme and the physical region layout mapping scheme obtained by clustering meet constraint conditions, wherein the constraint conditions comprise a first condition and a second condition, and the first condition and the second condition are as follows: The first condition is that the total memory capacity of each physical area is greater than or equal to the total memory of the expert parameters in the expert group carried by the physical area; the second condition is that the number of physical areas divided by the physical area layout mapping scheme does not exceed the maximum routing path number of the wafer-level chip computing core. Further, the clustering module is further configured to copy a copy of parameters of a selected expert in the expert group carried by the physical area to one or more physical areas adjacent to the physical area when the physical area layout mapping scheme does not satisfy the second condition or when a difference of communication loads in the group of the expert group carried by the physical area is greater than a preset difference threshold. Further, the clustering module is further configured to establish a mapping relationship between each expert group and the physical area of the wafer level chip based on the expert group grouping scheme, and generate the physical area layout mapping scheme according to the mapping relationship between each expert group and the physical area of the wafer level chip. Further, the sys