CN-122019113-A - Distributed attention computing system

CN122019113ACN 122019113 ACN122019113 ACN 122019113ACN-122019113-A

Abstract

The invention provides a distributed attention computing system, and relates to the technical field of high-performance computing. The device comprises a dividing unit and a scheduling unit, wherein the dividing unit is used for carrying out task division on an input sequence request according to a mapping relation between a pre-established logic partition and a physical core of a wafer level chip to obtain a task distribution queue containing calculation tasks, and the scheduling unit is used for respectively carrying out dynamic scheduling calculation on the first calculation task and the second calculation task based on physical core distance perception to obtain a task scheduling sequence and executing the task scheduling sequence. According to the system provided by the embodiment of the invention, the calculation tasks are mapped to the adjacent physical core positions of the wafer-level chip, so that the dynamic scheduling calculation based on the physical core distance perception is further performed, and the efficiency of distributed attention calculation is improved in a manner of avoiding long-delay communication.

Inventors

OUYANG PENG
LI XIUDONG

Assignees

北京清微智能科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (12)

1. A distributed attention computing system, comprising a partitioning unit and a scheduling unit, wherein: The dividing unit is used for carrying out task division on the input sequence request according to a mapping relation between a pre-established logic block and a physical core of the wafer level chip to obtain a task distribution queue containing calculation tasks, wherein the task distribution queue comprises a first task distribution queue corresponding to a query vector and a second task distribution queue corresponding to a key value vector, the first calculation task in the first task distribution queue is mapped to a physical core horizontal coordinate with continuous numerical values, and the second calculation task in the second task distribution queue is mapped to a physical core vertical coordinate with continuous numerical values; the scheduling unit is used for respectively carrying out dynamic scheduling calculation based on physical core distance perception on the first computing task and the second computing task to obtain a task scheduling sequence, and executing the task scheduling sequence.
2. The distributed attention computing system of claim 1 wherein, the distributed attention computing system further includes a building unit: The construction unit is used for mapping the logic blocks to the physical cores according to the preset block parameters of the logic blocks and a preset mapping function to obtain the mapping relation.
3. The distributed attention computing system according to claim 2, wherein the preset mapping function comprises a first mapping function based on a physical layout policy, and the construction unit is specifically configured to: Dividing the physical cores according to the partitioning parameters to obtain a plurality of sub-physical blocks; distributing the logical blocks corresponding to the logical block identifiers to target sub-physical blocks according to the logical block identifiers, the number of the plurality of sub-physical blocks and the size of each sub-physical block, and mapping calculation tasks for each target sub-physical block, wherein the number of the calculation tasks mapped to the target sub-physical blocks is equal to the block parameters; Continuously arranging members of a query vector group based on the sequence of the logical block identifiers to obtain a first arrangement sequence, and continuously and equally arranging members of a key value vector group based on the sequence of the logical block identifiers to obtain a second arrangement sequence, wherein the arrangement interval length is a first block parameter corresponding to the query vector; The method comprises the steps of determining a first initial position in a first target sub-physical block for mapping a query vector group, sequentially mapping each query vector group member in the first arrangement sequence to a subsequent column in the same row as the first initial position, determining a second initial position in a second target sub-physical block for mapping a key value vector group, and sequentially mapping each key value vector group member in the second arrangement sequence to a subsequent row in the same column as the second initial position.
4. The distributed attention computing system according to claim 2, wherein the preset mapping function comprises a second mapping function based on a mathematical mapping function, and the construction unit is specifically configured to: Decomposing the logic block identifier to obtain an index in a query vector group in a first value range and an index in a key value vector group in a second value range; The upper limit value of the first value range is a first block parameter, and the upper limit value of the second value range is a second block parameter; mapping the blocking parameter to a target physical position of the physical core according to the query vector intra-group index and the key value vector intra-group index by using the following mapping expression: wherein i is the index in the query vector group, j is the index in the key value vector group, For the preset row step size parameter, For the preset series of cycle parameters, The number of rows for the physical core; For the preset column step size parameter, In order to preset the cycle parameters of the rows, Is the number of columns of the physical core.
5. The distributed attention computing system of any of claims 1 to 4, wherein the scheduling unit comprises a profit computation subunit, an update subunit, and an acquisition subunit, wherein: the profit calculation subunit is used for performing profit calculation based on physical core distance perception on the first task distribution queue and the second task distribution queue to obtain a first profit calculation result corresponding to the query vector group and a second profit calculation result corresponding to the key value vector; the updating subunit is configured to perform communication and calculation overlap calculation of the key value vector set if it is determined that the first profit calculation result is smaller than the second profit calculation result, and update calculation tasks in the first task distribution queue and the second task distribution queue according to the calculation result; And the acquisition subunit is used for executing the calculation task of the third parameter set until the completion if the calculation task in the first task distribution queue and the second task distribution queue is determined to be empty, so as to obtain the task scheduling sequence.
6. The distributed attention computing system of claim 5, wherein the profit computation subunit is further configured to: Traversing the physical core distance to perform candidate communication operation on the query vector group and the key value vector group, and respectively obtaining the distance length value of the query vector group and the first calculation task number of unlocking corresponding to the query vector group and the second calculation task number of key value vector group and the distance length value of unlocking corresponding to the key value vector group, which are traversed each time; Acquiring a first distance correction calculation cost corresponding to the distance length value of the query vector group and a second distance correction calculation cost corresponding to the distance length value of the key value vector group; Calculating the ratio of the number of unlocked first calculation tasks to the first distance correction calculation cost of each traversal, taking the maximum value of the ratio as the first profit calculation result, and calculating the ratio of the number of unlocked second calculation tasks to the second distance correction calculation cost of each traversal, taking the maximum value of the ratio as the second profit calculation result.
7. The distributed attention computing system of claim 6, wherein the profit computation subunit is further specifically configured to: And performing candidate communication operation on the query vector group and the key value vector group based on a first pre-established pre-configured path, wherein the first pre-configured path corresponds to the first communication primitive.
8. The distributed attention computing system of claim 5, wherein the distributed attention computing system is further configured to: and in the process of executing the calculation task of the output matrix group, carrying out communication operation on the output matrix group based on a second pre-configured path which is established in advance, wherein the second pre-configured path corresponds to the second communication primitive.
9. A distributed attention computing method, comprising: Performing task division on an input sequence request according to a pre-established mapping relation between a logic block and a physical core of a wafer level chip to obtain a task distribution queue containing calculation tasks, wherein the task distribution queue comprises a first task distribution queue corresponding to a query vector and a second task distribution queue corresponding to a key value vector, the first calculation task in the first task distribution queue is mapped to a physical core abscissa with continuous numerical values, and the second calculation task in the second task distribution queue is mapped to a physical core ordinate with continuous numerical values; And respectively carrying out dynamic scheduling calculation based on physical core distance perception on the first computing task and the second computing task to obtain a task scheduling sequence, and executing the task scheduling sequence.
10. A wafer level chip comprising the distributed attention computing system of any one of claims 1 to 8.
11. A board comprising the wafer level chip of claim 10.
12. An electronic device comprising the board card of claim 11.

Description

Distributed attention computing system Technical Field The invention relates to the technical field of high-performance computing, in particular to a distributed attention computing system. Background The input sequence, which may be in particular a context window of a large language model consisting of tokens, is increasingly prominent as large language models evolve towards longer context windows (millions to tens of millions of tokens). In order to break through the bottleneck, a Wafer-level integration technology (System-on-Wafer) is an emerging direction, 85 ten thousand/90 ten thousand AI cores can be integrated on a Wafer-level chip, 40GB on-chip SRAM and 22PB/s on-chip bandwidth (7000 times higher than GPU) are provided, and millions of core integration is realized by the existing packaging technology. Meanwhile, the distributed Attention algorithm continuously evolves that Ring-Attention is calculated and communicated through overlapping annular communication, but the communication quantity is linearly increased along with GPU (graphics processing unit), mesh-Attention adopts two-dimensional tile division to replace one-dimensional annular, and theoretical communication complexity is reduced from O (n) to O (O #)) The traffic is reduced by 79% in the 256 GPU configuration. Currently, the mainstream LLM reasoning systems (e.g. vLLM, SGLang) and DNN compilers (e.g. Ladder) are optimized for the GPU/TPU shared memory architecture, while T10 and other compilers facing distributed on-chip memory assume that the on-chip crossbar provides constant delay. Although the above technologies are advanced, there is a fundamental mismatch problem, for example, ring communication generates long-distance communication (delay difference is 1000 times) spanning thousands of hops on 2D Mesh NoC when Ring-type attribute is combined with wafer level chip, communication waiting time is up to 91.5% under 128 GPU configuration, performance is lower than single GPU, logic tile is mapped to physical GPU when Mesh-attribute is combined with GPU cluster to ignore topology position, Q/KV communication in tile can span multiple nodes, and speed is only increased by 2.9 times under 256 GPU, so that O is not fully exerted) Theoretical advantages, the general DNN compiler (e.g., ladder) assumes uniform memory access, cannot handle non-uniform delays of wafer level chips, data partitioning roughness causes memory constraint violations, 100 times slower than single A100 on WSE-2. The root cause is that a topology abstraction layer exists between a logic 2D tile of Mesh-attribute and a physical 2D Mesh of a wafer-level chip, so that communication paths are not aligned to physical proximity, fine-grained communication cannot be supported by utilizing wafer-level ultra-high bandwidth (22 PB/s), and advanced hardware and performance gap of software mismatch is caused. Disclosure of Invention In view of the problems in the prior art, embodiments of the present invention provide a distributed attention computing system that can at least partially solve the problems in the prior art. In one aspect, the invention provides a distributed attention computing system, which comprises a dividing unit and a scheduling unit, wherein: The dividing unit is used for carrying out task division on the input sequence request according to a mapping relation between a pre-established logic block and a physical core of the wafer level chip to obtain a task distribution queue containing calculation tasks, wherein the task distribution queue comprises a first task distribution queue corresponding to a query vector and a second task distribution queue corresponding to a key value vector, the first calculation task in the first task distribution queue is mapped to a physical core horizontal coordinate with continuous numerical values, and the second calculation task in the second task distribution queue is mapped to a physical core vertical coordinate with continuous numerical values; the scheduling unit is used for respectively carrying out dynamic scheduling calculation based on physical core distance perception on the first computing task and the second computing task to obtain a task scheduling sequence, and executing the task scheduling sequence. Wherein the distributed attention computing system further comprises a construction unit: The construction unit is used for mapping the logic blocks to the physical cores according to the preset block parameters of the logic blocks and a preset mapping function to obtain the mapping relation. The preset mapping function comprises a first mapping function based on a physical layout strategy, and the construction unit is specifically used for: Dividing the physical cores according to the partitioning parameters to obtain a plurality of sub-physical blocks; distributing the logical blocks corresponding to the logical block identifiers to target sub-physical blocks according to the logical block identifiers, the number