CN-121998000-A - Self-attention task dynamic segmentation method and system

CN121998000ACN 121998000 ACN121998000 ACN 121998000ACN-121998000-A

Abstract

The application discloses a self-attention task dynamic segmentation method and a self-attention task dynamic segmentation system, which relate to the technical field of computer science and comprise the steps of determining a self-attention model structure and a hardware topological structure, wherein the self-attention model structure is determined by input self-attention model parameters, the hardware topological structure is determined by input hardware parameters, determining the current reasoning stage of a self-attention model based on the input self-attention task parameters, and determining the optimal task segmentation strategy corresponding to the current reasoning stage based on the current reasoning stage, each pre-configured task segmentation strategy, the self-attention model structure and the hardware topological structure. The application can flexibly determine the task division strategy of each stage, flexibly select the task division strategy of the card granularity, the core granularity and the single core granularity, reasonably divide and distribute the self-attention task to the multi-card multi-core system, thereby improving the calculation efficiency and the storage utilization rate of the system.

Inventors

GUO YANG
YANG JIANXUN
JIANG GUOYUE
OUYANG PENG

Assignees

北京清微智能科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20251224

Claims (10)

1. A self-attention task dynamic segmentation method, comprising: Determining a self-attention model structure and a hardware topology, wherein the self-attention model structure is determined by input self-attention model parameters, and the hardware topology is determined by input hardware parameters; Determining the current reasoning stage of the self-attention model based on the input self-attention task parameters; And determining an optimal task segmentation strategy corresponding to the current reasoning stage based on the current reasoning stage, each task segmentation strategy which is preconfigured, the self-attention model structure and the hardware topological structure, so as to segment the self-attention task according to the optimal task segmentation strategy.
2. The method of claim 1, wherein determining an optimal task segmentation strategy corresponding to the current inference stage based on the current inference stage, the preconfigured task segmentation strategies, the self-attention model structure and the hardware topology comprises: Responding to the stage where the current reasoning is located as a pre-filling stage, and determining an optimal task segmentation strategy corresponding to the pre-filling stage based on a performance mapping relation between each task segmentation strategy which is pre-configured and the pre-filling stage; And responding to the stage where the current reasoning is positioned as a decoding stage, and determining an optimal task segmentation strategy corresponding to the decoding stage based on each task segmentation strategy, a self-attention model structure and a hardware topological structure which are configured in advance.
3. The method for dynamically splitting tasks according to claim 2, wherein determining the optimal task splitting policy corresponding to the decoding stage based on each task splitting policy, the self-attention model structure and the hardware topology which are configured in advance comprises: Aiming at each task segmentation strategy which is pre-configured, determining the calculated amount of a single-core task based on the hardware topological structure, the self-attention model structure and the task mapping position corresponding to each task segmentation strategy which is pre-configured; Aiming at each task segmentation strategy which is configured in advance, based on the self-attention model structure and the single-core task calculated amount, determining the to-be-calculated scale of the self-attention model structure under each task segmentation strategy respectively; And determining the calculation delay of each task segmentation strategy based on the scale to be calculated, and determining the optimal task segmentation strategy corresponding to the decoding stage based on the calculation delay of each task segmentation strategy so as to segment the self-attention task according to the optimal task segmentation strategy.
4. The method for dynamically splitting a task according to claim 3, The self-attention model parameters include Q head total number, KV head total number, batch number, context length, hidden layer dimension and head dimension; The hardware parameters include the number of cards, the number of cores on each card, the number of cores on each core, and tensor parallelism.
5. The method for dynamically splitting tasks according to claim 4, wherein determining the calculated amount of single-core tasks based on the hardware topology, the self-attention model structure and task mapping positions corresponding to each task splitting policy configured in advance comprises: responding to the task segmentation strategy as a first task segmentation sub-strategy, wherein the task mapping position is a single core, and determining all tasks with D1 head of single-core task calculation amount distributed to the single core, wherein D1=ceil (Z/(K multiplied by N)); Responding to the task segmentation strategy as a second task segmentation sub-strategy, wherein the task mapping position is a single core grain, K inter-core tensors on the single core grain are parallel, and determining a single-core task with calculated amount of D2 heads (1/K) distributed to the single core, wherein D2=ceil (Z/N), and KV Cache amount stored on each core is data of E2 heads (1/K), wherein E2=ceil (W/N); Responding to the task segmentation strategy being a third task segmentation sub-strategy, wherein the task mapping position is a whole card, wherein (K multiplied by N) inter-core tensors on the whole card are parallel, determining a task with single-core task calculation amount of D3 heads (1/(K multiplied by N)) allocated to the single cores, wherein D3=Z; Wherein ceil () is an upward rounding function, Z is the number of Q heads allocated to each card, z=x/TP, X is the total number of Q heads, TP is the tensor parallelism, W is the number of KV heads allocated to each card, wherein w=y/TP, and Y is the total number of KV heads.
6. The method of claim 5, wherein the self-attention model structure comprises RMSNorm layers, linear In layers, roPE Q layers, roPE K layers, Q A K T layer, a softmax layer, The method comprises a V layer, a linear_Out layer and a Res_Add layer, wherein the method respectively determines the to-be-calculated scale of the self-attention model structure under each task segmentation strategy according to the self-attention model structure and the single-core task calculation amount, and comprises the following steps: based on the calculated amount of the single-core task, the RMSNorm layer, the linear_in layer, the RoPE _Q layer, the RoPE _K layer and the Q are respectively determined A K T layer, a softmax layer, The sub-templates to be calculated of the V layer, the linear_Out layer and the Res_Add layer; based on the RMSNorm layers, the linear_in layer, the RoPE _Q layer, the RoPE _K layer and the Q A K T layer, a softmax layer, And the sub-templates to be calculated of the V layer, the linear_Out layer and the Res_Add layer respectively determine the scale to be calculated of the self-attention model structure under each task segmentation strategy.
7. The method for dynamically splitting tasks according to claim 3, wherein determining an optimal task splitting policy corresponding to the decoding stage based on the calculation delay of each task splitting policy further comprises: Determining whether layers of the self-attention model structure require communication; And determining communication delay aiming at a layer needing communication, and determining an optimal task segmentation strategy corresponding to the decoding stage based on the communication delay and the calculation delay.
8. A self-attention task dynamic segmentation system, comprising: a structure determination module configured to determine a self-attention model structure and a hardware topology, wherein the self-attention model structure is determined by input self-attention model parameters and the hardware topology is determined by input hardware parameters; The current reasoning stage determining module is configured to determine the current reasoning stage of the self-attention model according to the input self-attention task parameters; The optimal task segmentation strategy determining module is configured to determine an optimal task segmentation strategy corresponding to the current reasoning stage according to the current reasoning stage, each task segmentation strategy, a self-attention model structure and a hardware topological structure which are configured in advance, so as to segment the self-attention task according to the optimal task segmentation strategy.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements a self-attention task dynamic segmentation method according to any one of claims 1 to 7 when executing the computer program.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, implements a self-attention task dynamic segmentation method according to any one of claims 1 to 7.

Description

Self-attention task dynamic segmentation method and system Technical Field The invention relates to the technical field of computer science, in particular to a task segmentation algorithm of a multi-card multi-core system, and particularly relates to a self-attention task dynamic segmentation method and system. Background In recent years, large language models, such as GPT series, LLaMa series, QWEN, deepSeek, etc., have been rapidly developed, and they exhibit strong capabilities in many fields of natural language processing, such as intelligent customer service, text generation, machine translation, etc. The self-attention mechanism is used as the core of a transducer architecture and is the key for understanding and processing texts of a large language model, and can capture complex semantic dependencies in massive text data through the self-attention mechanism, but the calculation amount of self-attention tasks increases exponentially with the increase of the model size and the input text length. The traditional single-card computing capability cannot meet the requirement, and the processing is highly accelerated by means of the powerful computing power of the multi-card multi-chip system. Advances in hardware technology have prompted the widespread use of multi-card multi-die systems. GPU clusters, multi-core CPU and the like provide a hardware basis for large-scale calculation. However, resource allocation and task scheduling for these systems is challenging. How to reasonably divide and distribute self-attention tasks to a multi-card multi-core system, fully exert the parallel computing advantages of the self-attention tasks, improve the overall computing efficiency and become a problem to be solved urgently. Disclosure of Invention In order to solve at least one technical problem set forth in the background art section, the application provides a self-attention task dynamic segmentation method and a self-attention task dynamic segmentation system, which can determine an optimal task segmentation strategy, reasonably segment and distribute the self-attention task to a multi-card multi-core system, thereby improving the computing efficiency and the storage utilization rate of the system and meeting the application requirements of a large language model and the like. In a first aspect, an embodiment of the present invention provides a method for dynamically splitting a self-attention task, including: Determining a self-attention model structure and a hardware topology, wherein the self-attention model structure is determined by input self-attention model parameters, and the hardware topology is determined by input hardware parameters; Determining the current reasoning stage of the self-attention model based on the input self-attention task parameters; And determining an optimal task segmentation strategy corresponding to the current reasoning stage based on the current reasoning stage, each task segmentation strategy which is preconfigured, the self-attention model structure and the hardware topological structure, so as to segment the self-attention task according to the optimal task segmentation strategy. In some optional manners of this embodiment, the determining, based on the stage in which the current inference is located, the preconfigured task segmentation policies, the self-attention model structure and the hardware topology structure, the optimal task segmentation policy corresponding to the stage in which the current inference is located includes: Responding to the stage where the current reasoning is located as a pre-filling stage, and determining an optimal task segmentation strategy corresponding to the pre-filling stage based on a performance mapping relation between each task segmentation strategy which is pre-configured and the pre-filling stage; And responding to the stage where the current reasoning is positioned as a decoding stage, and determining an optimal task segmentation strategy corresponding to the decoding stage based on each task segmentation strategy, a self-attention model structure and a hardware topological structure which are configured in advance. In some optional manners of this embodiment, the determining, based on each task segmentation policy, the self-attention model structure and the hardware topology that are configured in advance, an optimal task segmentation policy corresponding to the decoding stage includes: Aiming at each task segmentation strategy which is pre-configured, determining the calculated amount of a single-core task based on the hardware topological structure, the self-attention model structure and the task mapping position corresponding to each task segmentation strategy which is pre-configured; Aiming at each task segmentation strategy which is configured in advance, based on the self-attention model structure and the single-core task calculated amount, determining the to-be-calculated scale of the self-attention model structure under each ta