Search

CN-121979560-A - Multi-head attention task accelerator mapping configuration generation method and system

CN121979560ACN 121979560 ACN121979560 ACN 121979560ACN-121979560-A

Abstract

The invention provides a method and a system for generating mapping configuration of a multi-head attention task accelerator, wherein the method comprises the steps of obtaining operator shared parameters, head dimensions and key sequence length of the multi-head attention task, obtaining shared memory total capacity and register total capacity of a space accelerator, constructing shared memory overhead and register memory overhead based on the operator shared parameters, the head dimensions and the key sequence length, constructing shared memory overhead constraint based on the shared memory overhead and the shared memory total capacity, constructing register memory overhead constraint based on the register memory overhead and the register total capacity, setting a blocking factor corresponding to each task parameter in the operator shared parameters as a minimum value of the blocking factor, executing a constraint range step, obtaining a blocking factor range corresponding to each task parameter, obtaining an initial mapping space according to the blocking factor range, searching under a preset reinforcement learning algorithm based on the initial mapping space, and generating optimal mapping configuration. The invention can improve the generation efficiency of the optimal mapping configuration.

Inventors

  • WANG FUYU
  • SHEN MINGHUA
  • Qin aoxiang
  • LIAO JIAYUAN

Assignees

  • 中山大学

Dates

Publication Date
20260505
Application Date
20251226

Claims (10)

  1. 1. A multi-head attention task accelerator mapping configuration generation method is characterized by comprising the following steps: Acquiring operator sharing parameters, head dimensions and key sequence lengths of a multi-head attention task to be accelerated; Acquiring the total capacity of a shared memory and the total capacity of a register of a target space accelerator; constructing shared memory overhead and register memory overhead based on the operator sharing parameters, the header dimension and the key sequence length; constructing a shared memory overhead constraint based on the shared memory overhead and the shared memory total capacity; Constructing a register memory overhead constraint based on the register memory overhead and the total register capacity; Setting a blocking factor corresponding to each task parameter in the operator sharing parameters as a preset minimum value of the blocking factors, and executing a constraint range step on each task parameter to obtain a blocking factor range corresponding to each task parameter; obtaining an initial mapping space according to the blocking factor range, searching under a preset reinforcement learning algorithm based on the initial mapping space, and generating an optimal mapping configuration of a target space accelerator; The step of restricting the range includes: Substituting the blocking factors corresponding to the task parameters except the current task parameters in the operator sharing parameters into the shared memory overhead constraint and the register memory overhead constraint to solve, so as to obtain a maximum blocking size; and generating a blocking factor range corresponding to the current task parameter according to the maximum value of the blocking size and the current task parameter.
  2. 2. The method for generating a multi-headed attention task accelerator mapping configuration as recited in claim 1, wherein said constructing shared memory overhead and register memory overhead based on said operator common parameters, said head dimension, and said key sequence length comprises: Constructing a first operator shared memory overhead and a second operator shared memory overhead based on the head dimension, the key sequence length, and the batch size, the head number and the query sequence length in the operator shared parameters; Constructing a first operator register memory overhead based on the key sequence length and the batch size, the head number and the query sequence length in the operator shared parameter; constructing a second operator register memory overhead based on the head dimension and the batch size, the head number and the query sequence length in the operator shared parameter; Constructing shared memory overhead according to the first operator shared memory overhead and the second operator shared memory overhead; and constructing register memory overhead according to the first operator register memory overhead and the second operator register memory overhead.
  3. 3. The method for generating a multi-head attention task accelerator mapping configuration according to claim 2, wherein said constructing a first operator shared memory overhead and a second operator shared memory overhead based on the head dimension, the key sequence length, and a batch size, a head number, and a query sequence length in the operator shared parameters comprises: constructing first operator input tensor memory overhead based on the head dimension, the batch size, the head number and the query sequence length in the operator shared parameter; Constructing first operator weight tensor memory overhead based on the key sequence length, the head dimension and the batch size and head number in the operator shared parameter; constructing a second operator input tensor memory overhead based on the key sequence length and the batch size, the head number and the query sequence length in the operator sharing parameters; Constructing a second operator weight tensor memory overhead based on the key sequence length, the head dimension and the batch size and head number in the operator shared parameter; Constructing a first operator shared memory overhead according to the first operator input tensor memory overhead and the first operator weight tensor memory overhead; and constructing second operator shared memory overhead according to the second operator input tensor memory overhead and the second operator weight tensor memory overhead.
  4. 4. The method for generating a multi-head attention task accelerator mapping configuration according to claim 1, wherein the obtaining an initial mapping space according to the blocking factor range, and searching under a preset reinforcement learning algorithm based on the initial mapping space, and generating an optimal mapping configuration of a target space accelerator, wherein the initial mapping space includes the blocking factor range and a preset parallel factor range, and the obtaining of the parallel factor range includes: Acquiring the number of computing cores of a target space accelerator, and constructing parallel computing overhead constraint based on the number of computing cores; Acquiring the operator sharing parameter, the header dimension and the parallel factor corresponding to the key sequence length; And obtaining a parallel factor range based on the parallel factor and the parallel computing overhead constraint.
  5. 5. The method for generating a mapping configuration of a multi-head attention task accelerator according to claim 4, wherein the obtaining an initial mapping space according to the blocking factor range, searching under a preset reinforcement learning algorithm based on the initial mapping space, and generating an optimal mapping configuration of a target space accelerator, wherein the searching under the preset reinforcement learning algorithm based on the initial mapping space, obtaining the optimal mapping configuration of the target space accelerator, comprises: executing a factor decision step based on each target factor range in the initial mapping space to obtain the optimal mapping configuration of the target space accelerator; The factor decision step comprises: Searching under a preset reinforcement learning algorithm according to the current target factor range to obtain a determined value corresponding to the current target factor range, and marking the current target factor range as a decision; Obtaining a second factor range corresponding to the target factor range which is not marked as decided in the initial mapping space based on the determined value under a preset factor range updating algorithm; updating an initial mapping space according to a determined value corresponding to the target factor range marked as decided in the initial mapping space and the second factor range; And if the initial mapping space is confirmed to have the target factor range which is not marked as the decided target factor range, re-executing the factor decision step based on the target factor range which is not marked as the decided target factor range in the initial mapping space, otherwise, stopping executing the factor decision step, and generating the optimal mapping configuration of the target space accelerator according to the determined value corresponding to each target factor range in the initial mapping space.
  6. 6. The method for generating a map configuration of a multi-head attention task accelerator according to claim 5, wherein said obtaining, based on the determined value, a second factor range corresponding to a target factor range that is not marked as decided in the initial map space under a preset factor range update algorithm includes: And if the current target factor is the blocking factor, obtaining a second factor range corresponding to the target factor range which is not marked as decided in the initial mapping space under the shared memory overhead constraint and the register memory overhead constraint based on the determined value, otherwise, obtaining a second factor range corresponding to the target factor range which is not marked as decided in the initial mapping space under the parallel computing overhead constraint based on the determined value.
  7. 7. The method for generating a mapping configuration of a multi-head attention task accelerator according to claim 1, wherein the obtaining an initial mapping space according to the blocking factor range, searching under a preset reinforcement learning algorithm based on the initial mapping space, and generating an optimal mapping configuration of a target space accelerator further comprises: acquiring the computation time delay, the memory access time delay and the memory access power consumption of the target space accelerator after executing the multi-head attention task according to the optimal mapping configuration; generating a reward value according to the calculation time delay, the access time delay and the access power consumption; and optimizing the reinforcement learning algorithm under a preset optimization algorithm based on the reward value.
  8. 8. The method for generating the map configuration of the multi-head attention task accelerator according to claim 1, wherein substituting the block factors corresponding to the task parameters other than the current task parameters in the operator sharing parameters into the shared memory overhead constraint and the register memory overhead constraint to solve, to obtain the maximum value of the block sizes, includes: Substituting the blocking factors corresponding to the task parameters except the current task parameters in the operator sharing parameters into the shared memory overhead constraint to solve, so as to obtain a first blocking size maximum; Substituting the blocking factors corresponding to the task parameters except the current task parameters in the operator sharing parameters into the register memory overhead constraint to solve, so as to obtain a second blocking size maximum; and if the first block size maximum value is smaller than the second block size maximum value, taking the first block size maximum value as a block size maximum value, and otherwise, taking the second block size maximum value as a block size maximum value.
  9. 9. The method for generating a map configuration of a multi-head attention task accelerator according to claim 8, wherein generating a range of blocking factors corresponding to a current task parameter according to the maximum value of the blocking size and the current task parameter comprises: Calculating based on the current task parameter and the maximum value ratio of the block sizes to obtain the minimum value of the block factors; Calculating the ratio of the current task parameter under a preset minimum value of the block size to obtain the maximum value of the block factor; and generating a blocking factor range corresponding to the current task parameter according to the minimum value and the maximum value of the blocking factor.
  10. 10. A multi-head attention task accelerator map configuration generation system for implementing the multi-head attention task accelerator map configuration generation method according to any one of claims 1 to 9, comprising: the task parameter acquisition module is used for acquiring operator sharing parameters, head dimensions and key sequence lengths of the multi-head attention task to be accelerated; the accelerator parameter acquisition module is used for acquiring the total capacity of the shared memory and the total capacity of the register of the target space accelerator; The memory overhead construction module is used for constructing shared memory overhead and register memory overhead based on the operator shared parameter, the head dimension and the key sequence length; The shared memory overhead constraint construction module is used for constructing shared memory overhead constraint based on the shared memory overhead and the total capacity of the shared memory; The register overhead constraint construction module is used for constructing a register memory overhead constraint based on the register memory overhead and the total capacity of the register; the block factor range obtaining module is used for setting the block factor corresponding to each task parameter in the operator sharing parameters to be a preset minimum value of the block factors, and executing a constraint range step on each task parameter to obtain a block factor range corresponding to each task parameter; the optimal mapping configuration generation module is used for obtaining an initial mapping space according to the blocking factor range, searching under a preset reinforcement learning algorithm based on the initial mapping space, and generating the optimal mapping configuration of the target space accelerator; wherein the step of restricting the range includes: Substituting the blocking factors corresponding to the task parameters except the current task parameters in the operator sharing parameters into the shared memory overhead constraint and the register memory overhead constraint to solve, so as to obtain a maximum blocking size; and generating a blocking factor range corresponding to the current task parameter according to the maximum value of the blocking size and the current task parameter.

Description

Multi-head attention task accelerator mapping configuration generation method and system Technical Field The invention belongs to the technical field of electronic information, and particularly relates to a multi-head attention task accelerator mapping configuration method and system. Background In the traditional scheme, a multi-head attention task is deployed on a space accelerator with a multi-level memory hierarchy structure by adopting an operator mapping strategy, under the operator mapping strategy, each operator in the multi-head attention task reads an input tensor from an off-chip memory of the space accelerator, transmits the input tensor to an on-chip memory for operation, and then writes a calculation result back to the off-chip memory, but because the on-chip memory capacity is limited, an intermediate result among the operators needs to be written back to the off-chip memory for frequent reading and writing of subsequent operation, so that a large amount of unnecessary memory access expenditure is caused. In order to solve the problems, the prior art utilizes a blocking mechanism to decompose tensors into small blocks which are adapted to on-chip memory, so that intermediate variables can be temporarily stored in the on-chip memory and directly multiplexed by downstream operators, and the access to off-chip memory is effectively reduced. However, in the prior art, the mapping configuration related to the blocking is obtained by searching in the mapping space which is not pruned and is very huge by using the reinforcement learning method, so that a great amount of time is required to screen the proper mapping configuration, the optimal mapping configuration is difficult to search, and the searching efficiency of the optimal mapping configuration is greatly reduced. Disclosure of Invention The invention aims to provide a method and a system for generating mapping configuration of a multi-head attention task accelerator, which are used for solving the technical problems and improving the generation efficiency of optimal mapping configuration. In order to solve the technical problems, the invention provides a method for generating mapping configuration of a multi-head attention task accelerator, which comprises the following steps: Acquiring operator sharing parameters, head dimensions and key sequence lengths of a multi-head attention task to be accelerated; Acquiring the total capacity of a shared memory and the total capacity of a register of a target space accelerator; constructing shared memory overhead and register memory overhead based on the operator sharing parameters, the header dimension and the key sequence length; constructing a shared memory overhead constraint based on the shared memory overhead and the shared memory total capacity; Constructing a register memory overhead constraint based on the register memory overhead and the total register capacity; Setting a blocking factor corresponding to each task parameter in the operator sharing parameters as a preset minimum value of the blocking factors, and executing a constraint range step on each task parameter to obtain a blocking factor range corresponding to each task parameter; obtaining an initial mapping space according to the blocking factor range, searching under a preset reinforcement learning algorithm based on the initial mapping space, and generating an optimal mapping configuration of a target space accelerator; The step of restricting the range includes: Substituting the blocking factors corresponding to the task parameters except the current task parameters in the operator sharing parameters into the shared memory overhead constraint and the register memory overhead constraint to solve, so as to obtain a maximum blocking size; and generating a blocking factor range corresponding to the current task parameter according to the maximum value of the blocking size and the current task parameter. In the scheme, the operator sharing parameter characterizes task parameters of which the blocking factors in the multi-head attention task need to be kept consistent in each operator, and based on the operator sharing parameter, the head dimension and the key sequence length, shared memory overhead and register memory overhead are constructed, wherein the shared memory overhead can characterize the size of a shared memory occupied space of an accelerator according to the blocking size corresponding to a group of blocking factors, and the register memory overhead can characterize the size of a blocking size corresponding to a group of blocking factors occupied space of an accelerator register. Then, the scheme solves the obtained blocking factor range based on the formed shared memory overhead constraint and the formed register memory overhead constraint, can remove invalid blocking factors which can cause memory overflow, ensures that each candidate mapping configuration in the initial mapping space can not cause memory overflow, ens