CN-121997981-A - Self-attention calculating method and device and related products

CN121997981ACN 121997981 ACN121997981 ACN 121997981ACN-121997981-A

Abstract

The application provides a self-attention calculating method, a device and related products, the method comprises the steps of rearranging a plurality of words in an input sequence to obtain a first rearranging sequence, an attention mask corresponding to the first rearranging sequence and a grid mask, wherein the attention mask comprises a plurality of sub-attention masks, values in any position in the grid mask are used for indicating whether dependency relations exist among the words in the sub-attention mask corresponding to the position, the number of the first values in the grid mask corresponding to the first rearranging sequence is smaller than or equal to the number of the first values in the grid mask corresponding to the input sequence, task scheduling information is generated according to the position of the first values in the grid mask, and the self-attention calculating is carried out on the first rearranging sequence according to the task scheduling information. By adopting the method and the device, the problem of unbalanced load among the computing devices when the plurality of computing devices perform self-attention computation in parallel can be solved.

Inventors

QIAN ZHENGHAO
LI YIWEN

Assignees

华为技术有限公司

Dates

Publication Date: 20260508
Application Date: 20241107

Claims (20)

1. A method of self-attention computation, the method comprising: Rearranging a plurality of words in an input sequence to obtain a first rearranged sequence, an attention mask corresponding to the first rearranged sequence and a grid mask, wherein the attention mask comprises a plurality of sub-attention masks, a value in any position in the grid mask is used for indicating whether a dependency relationship exists between the words in the sub-attention mask corresponding to the position, the number of the first values in the grid mask corresponding to the first rearranged sequence is smaller than or equal to the number of the first values in the grid mask corresponding to the input sequence, and the first values in the grid mask are used for indicating that a calculation task exists in the position corresponding to the first values; generating task scheduling information according to the position of the first value of the numerical value in the grid mask; And carrying out self-attention calculation on the first rearrangement sequence according to the task scheduling information.
2. The method of claim 1, wherein rearranging the plurality of tokens in the input sequence to obtain a first rearranged sequence, an attention mask and a trellis mask corresponding to the first rearranged sequence, comprises: Clustering the attention mask corresponding to the input sequence to obtain a plurality of clustering results, wherein each clustering result in the plurality of clustering results records the information of the word elements classified into the same category in the input sequence; Rearranging a plurality of words in the input sequence according to information indicated in each clustering result to obtain a plurality of rearranged sequences, wherein each clustering result has a corresponding rearranged sequence, and the relative sequence of the plurality of words in the same category in the clustering result is the same as the relative sequence of the corresponding words in the rearranged sequence corresponding to the clustering result; generating attention masks corresponding to the plurality of rearrangement sequences, wherein the dependency relationship among the words in the attention masks corresponding to the rearrangement sequences is the same as the dependency relationship among the words in the attention masks corresponding to the input sequences; Dividing the rows and columns of each attention mask in the attention masks corresponding to the rearrangement sequences into a specified number to obtain grid masks corresponding to the rearrangement sequences; Determining a first grid mask from grid masks corresponding to the plurality of rearrangement sequences, wherein the rearrangement sequence corresponding to the first grid mask is the first rearrangement sequence, the attention mask corresponding to the first grid mask is the attention mask corresponding to the first rearrangement sequence, and the first grid mask is the grid mask corresponding to the first rearrangement sequence.
3. The method according to claim 1 or 2, wherein the value of the sub-attention mask at the corresponding position on the corresponding trellis mask is a first value when there is a dependency relationship between the tokens within the sub-attention mask.
4. A method according to claim 2 or 3, wherein determining a first trellis mask from the trellis masks corresponding to the plurality of reordered sequences comprises: Calculating a first result of each of a plurality of lattice masks, wherein the first result is the number of the numerical values in each lattice mask as a first value; Calculating a second result of each of the plurality of trellis masks, the second result being a variance of a number of computing tasks selectable by each computing node in the each trellis mask; Calculating a judging result of each of the plurality of grid masks, wherein the judging result is a weighted average value of the first result and the second result; and taking the grid mask corresponding to the judgment result with the smallest value among the plurality of judgment results as a first grid mask.
5. The method of claim 4, wherein the computing tasks selectable by each computing node are at least one of a query matrix and a key-value matrix, wherein the key-value matrix comprises a key matrix and a value matrix.
6. The method of claim 5, wherein each computing node in the task scheduling information needs to satisfy a first condition when selecting a computing task in each computing round and satisfy a second condition, the first condition being that the computing task selected for each computing node is at least one of a query matrix and a key value matrix, wherein the key value matrix includes a key matrix and a value matrix, and the second condition being that a traffic volume of each computing node in each computing round is less than or equal to a communication unit threshold.
7. The method according to claim 1, wherein the method further comprises: rearranging a plurality of word elements in the input sequence when the self-attention calculation is non-distributed calculation, and obtaining a second rearranged sequence and an attention mask corresponding to the second rearranged sequence; Determining position information of one or more areas in the attention mask corresponding to the second reorder sequence, wherein the areas are areas which meet the first length and the first width in the attention mask corresponding to the second reorder sequence, and the values in the areas are all second values; and inputting the position information of the one or more areas and the attention mask corresponding to the second reordered sequence into a self-attention operator, and performing self-attention calculation on the second reordered sequence, wherein the position information of the one or more areas is used for indicating that the corresponding positions of the areas in the second reordered sequence do not perform self-attention calculation.
8. The method of claim 7, wherein rearranging the plurality of tokens in the input sequence to obtain a second rearranged sequence and an attention mask corresponding to the second rearranged sequence, comprises: Clustering the attention mask corresponding to the input sequence to obtain a first clustering result, wherein the first clustering result records the information of the lemmas classified into the same category in the input sequence; Rearranging the plurality of words in the input sequence according to the information indicated in the first clustering result to obtain a second rearranged sequence, wherein the relative sequence of the plurality of words in the same category in the first clustering result is the same as the relative sequence of the corresponding words in the second rearranged sequence; and generating an attention mask corresponding to the second reordered sequence, wherein the dependency relationship among the words in the attention mask corresponding to the second reordered sequence is the same as the dependency relationship among the words in the attention mask corresponding to the input sequence.
9. The method of claim 8, wherein clustering the attention mask corresponding to the input sequence to obtain a first clustering result comprises: clustering attention masks corresponding to the input sequences to obtain a plurality of clustering results and contour coefficients corresponding to the clustering results; And taking the clustering result corresponding to the profile coefficient with the largest median value of the profile coefficients corresponding to the plurality of clustering results as a first clustering result.
10. A self-attention computing device, the device comprising: The system comprises an acquisition module, a first rearrangement module and a second rearrangement module, wherein the acquisition module is used for rearranging a plurality of words in an input sequence to obtain a first rearrangement sequence, an attention mask corresponding to the first rearrangement sequence and a grid mask, wherein the attention mask comprises a plurality of sub-attention masks, a value in any position in the grid mask is used for indicating whether a dependency relationship exists among the words in the sub-attention mask corresponding to the position, the number of the first values in the grid mask corresponding to the first rearrangement sequence is smaller than or equal to the number of the first values in the grid mask corresponding to the input sequence, and the first values in the grid mask are used for indicating that a calculation task exists in the position corresponding to the first values; the generating module is used for generating task scheduling information according to the position of which the numerical value in the grid mask is the first value; and the calculation module is used for carrying out self-attention calculation on the first rearrangement sequence according to the task scheduling information.
11. The apparatus of claim 10, wherein the acquisition module is configured to: Clustering the attention mask corresponding to the input sequence to obtain a plurality of clustering results, wherein each clustering result in the plurality of clustering results records the information of the word elements classified into the same category in the input sequence; Rearranging a plurality of words in the input sequence according to information indicated in each clustering result to obtain a plurality of rearranged sequences, wherein each clustering result has a corresponding rearranged sequence, and the relative sequence of the plurality of words in the same category in the clustering result is the same as the relative sequence of the corresponding words in the rearranged sequence corresponding to the clustering result; generating attention masks corresponding to the plurality of rearrangement sequences, wherein the dependency relationship among the words in the attention masks corresponding to the rearrangement sequences is the same as the dependency relationship among the words in the attention masks corresponding to the input sequences; Dividing the rows and columns of each attention mask in the attention masks corresponding to the rearrangement sequences into a specified number to obtain grid masks corresponding to the rearrangement sequences; Determining a first grid mask from grid masks corresponding to the plurality of rearrangement sequences, wherein the rearrangement sequence corresponding to the first grid mask is the first rearrangement sequence, the attention mask corresponding to the first grid mask is the attention mask corresponding to the first rearrangement sequence, and the first grid mask is the grid mask corresponding to the first rearrangement sequence.
12. The apparatus according to claim 10 or 11, wherein the value of a sub-attention mask at a corresponding position on a corresponding trellis mask is a first value when there is a dependency relationship between tokens within the sub-attention mask.
13. The apparatus according to claim 11 or 12, wherein the acquisition module is configured to: Calculating a first result of each of a plurality of lattice masks, wherein the first result is the number of the numerical values in each lattice mask as a first value; Calculating a second result of each of the plurality of trellis masks, the second result being a variance of a number of computing tasks selectable by each computing node in the each trellis mask; Calculating a judging result of each of the plurality of grid masks, wherein the judging result is a weighted average value of the first result and the second result; and taking the grid mask corresponding to the judgment result with the smallest value among the plurality of judgment results as a first grid mask.
14. The apparatus of claim 13, wherein the computing task selectable by each computing node is at least one of a query matrix and a key-value matrix, wherein the key-value matrix comprises a key matrix and a value matrix.
15. The apparatus according to claim 13 or 14, wherein each computing node in the task scheduling information needs to satisfy a first condition when selecting a computing task in each computing round and satisfy a second condition, the first condition being that the computing task selected by each computing node is at least one of a query matrix and a key value matrix, and the key value matrix includes a key matrix and a value matrix, and the second condition being that a traffic volume of each computing node in each computing round is less than or equal to a communication unit threshold.
16. The apparatus of claim 10, wherein the obtaining module is configured to reorder a plurality of tokens in the input sequence to obtain a second reordered sequence and an attention mask corresponding to the second reordered sequence when the self-attention computation is a non-distributed computation; The module further comprises a determining module, wherein the determining module is used for determining the position information of one or more areas in the attention mask corresponding to the second reorder sequence, the areas are areas which meet the first length and the first width in the attention mask corresponding to the second reorder sequence, and the values in the areas are all second values; The computing module is configured to input the position information of the one or more regions and the attention mask corresponding to the second reordered sequence into a self-attention operator, and perform self-attention computation on the second reordered sequence, where the position information of the one or more regions is used to indicate that the corresponding positions of the regions in the second reordered sequence do not perform self-attention computation.
17. The apparatus of claim 16, wherein the acquisition module is configured to: Clustering the attention mask corresponding to the input sequence to obtain a first clustering result, wherein the first clustering result records the information of the lemmas classified into the same category in the input sequence; Rearranging the plurality of words in the input sequence according to the information indicated in the first clustering result to obtain a second rearranged sequence, wherein the relative sequence of the plurality of words in the same category in the first clustering result is the same as the relative sequence of the corresponding words in the second rearranged sequence; and generating an attention mask corresponding to the second reordered sequence, wherein the dependency relationship among the words in the attention mask corresponding to the second reordered sequence is the same as the dependency relationship among the words in the attention mask corresponding to the input sequence.
18. The apparatus of claim 17, wherein the obtaining module is configured to cluster attention masks corresponding to the input sequences to obtain a plurality of cluster results and profile coefficients corresponding to the plurality of cluster results, and take a cluster result corresponding to a profile coefficient with a largest median of the profile coefficients corresponding to the plurality of cluster results as a first cluster result.
19. A computing device, the computing device comprising a memory and a processor; The processor is configured to execute instructions stored in the memory to cause the computing device to perform the method of any one of claims 1 to 9.
20. A cluster of computing devices, comprising at least one computing device, each computing device comprising a processor and a memory; the processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to perform the method of any one of claims 1 to 9.

Description

Self-attention calculating method and device and related products Technical Field The application relates to the technical field of artificial intelligence, in particular to a self-attention calculating method, a self-attention calculating device and related products. Background The self-attention mechanism (self-attention) is a technology capable of capturing the relation between various positions in an input sequence, and can help a large model to capture key information in the input sequence more accurately, so that the self-attention mechanism is widely applied to large model training based on a transducer architecture. However, as the length of the input sequence increases, the computational complexity increases, so for long sequences, a sequence parallel method is generally adopted to improve the computational efficiency. In this approach, a long sequence is divided into multiple shorter sub-sequences and processed in parallel on different compute nodes to optimize resource utilization. In a ring attention (ring attention), the computing nodes are organized in a ring topology, each computing node being connected to its neighbors, each computing node communicating in a fixed order. Each computing node determines whether to perform attention calculations on each computing round based on the attention mask corresponding to the input sequence. However, this fixed scheduling approach of ring attention may lead to the problem of unbalanced computational load between the different compute nodes on each compute round. Disclosure of Invention The embodiment of the application provides a self-attention calculating method, a self-attention calculating device and related products, which can solve the problem of unbalanced load among computing devices when a plurality of computing devices perform self-attention calculation in parallel. In a first aspect, the present application provides a self-attention computing method, the method comprising: Rearranging a plurality of words in an input sequence to obtain a first rearranged sequence, an attention mask corresponding to the first rearranged sequence and a grid mask, wherein the attention mask comprises a plurality of sub-attention masks, a value in any position in the grid mask is used for indicating whether a dependency relationship exists between the words in the sub-attention mask corresponding to the position, the number of the first values in the grid mask corresponding to the first rearranged sequence is smaller than or equal to the number of the first values in the grid mask corresponding to the input sequence, and the first values in the grid mask are used for indicating that a calculation task exists in the position corresponding to the first values; generating task scheduling information according to the position of the first value of the numerical value in the grid mask; And performing self-attention calculation on the first rearrangement sequence according to the task scheduling information. In the above scheme, the value at any position in the lattice mask is used to indicate whether there is a dependency relationship between the lemmas in the sub-attention mask corresponding to the position. When the value in the lattice mask is the first value, it is indicated that there is a calculation task at the position corresponding to the value. The number of the first values in the lattice masks corresponding to the first rearrangement sequence is smaller than or equal to the number of the first values in the lattice masks corresponding to the input sequence. Therefore, by rearranging a plurality of words in the input sequence, the calculation task can be reduced, and the calculation efficiency can be improved. In addition, the task scheduling information in the embodiment of the application is determined according to the position where the numerical value in the trellis mask is the first value. By modifying the task scheduling information, not only can invalid computing tasks be avoided being received on each computing device and the waste of computing resources be reduced, but also the number of computing tasks in the whole self-attention computing process of each computing device can be balanced as much as possible, and the problem of unbalanced load on different computing devices can be solved. Based on the first aspect, in a possible implementation manner, the specific process of rearranging the plurality of word elements in the input sequence to obtain the first rearranged sequence, the attention mask corresponding to the first rearranged sequence, and the lattice mask is as follows: clustering the attention mask corresponding to the input sequence to obtain a plurality of clustering results, wherein each clustering result in the plurality of clustering results records the information of the word elements classified into the same category in the input sequence; Rearranging a plurality of words in an input sequence according to information indicated in e