CN-121996709-A - Sequence processing method, device, computing equipment and program product

CN121996709ACN 121996709 ACN121996709 ACN 121996709ACN-121996709-A

Abstract

The application discloses a sequence processing method, a sequence processing device, computing equipment and a program product, and belongs to the technical field of AI. In the present application, for a plurality of segments included in an input sequence, self-attention between elements within each segment can be calculated, resulting in intra-segment self-attention containing local context information. In addition, feature information of the plurality of segments may be extracted, and the feature information of the plurality of segments may be used to calculate self-attention between each element and the plurality of segments, thereby obtaining inter-segment self-attention containing global context information. On this basis, the self-attention output result based on the intra-segment self-attention and inter-segment self-attention of the plurality of elements contains both the local context information of each element in the segment and the global context information of each element. Therefore, the application improves the precision of the self-attention output result on the premise of reducing the calculated amount and the memory expenditure of the self-attention calculation as much as possible.

Inventors

ZHANG ZHENG
ZHOU WEI
MENG WEIKANG

Assignees

华为技术有限公司

Dates

Publication Date: 20260508
Application Date: 20241104

Claims (18)

1. A method of sequence processing, the method comprising: Dividing an input sequence into a plurality of segments, wherein the input sequence comprises a plurality of elements, each segment comprises at least two elements, the plurality of elements in the input sequence are used for representing a plurality of data units in data to be processed, and the plurality of elements are in one-to-one correspondence with the plurality of data units; calculating self-attentiveness among the elements in each segment to obtain the self-attentiveness in the segment corresponding to each element; Extracting characteristic information of the plurality of fragments, and calculating self-attentiveness between each element in the input sequence and the plurality of fragments based on the characteristic information of the plurality of fragments to obtain inter-fragment self-attentiveness corresponding to each element in the input sequence; A self-attention output result is determined based on intra-segment self-attention and inter-segment self-attention corresponding to a plurality of elements in the input sequence.
2. The method of claim 1, wherein the data to be processed is text, the data unit is a word element in the text, the plurality of elements is a feature vector of a plurality of word elements in the text, or the data to be processed is an image, the data unit is an image region in the image, the plurality of elements is a feature vector of a plurality of image regions in the image, or the data to be processed is a video, the data unit is a video unit in the video, the plurality of elements is a feature vector of a plurality of video units in the video, or the data to be processed is voice data, the data unit is a voice segment in the voice data, and the plurality of elements is a feature vector of a plurality of voice segments in the voice data.
3. The method according to claim 1 or 2, wherein the feature information includes a feature key vector and a feature value vector, and wherein the extracting feature information of the plurality of segments includes: Extracting features of a key matrix of a first segment from the plurality of segments to obtain a feature key vector of the first segment, wherein the key matrix of the first segment comprises key vectors of elements in the first segment; And extracting the characteristics of the value matrix of the first segment to obtain a characteristic value vector of the first segment, wherein the value matrix of the first segment comprises the value vector of the elements in the first segment.
4. A method according to claim 3, wherein calculating the self-attention between each element in the input sequence and the plurality of segments based on the feature information of the plurality of segments, to obtain the inter-segment self-attention corresponding to each element in the input sequence, comprises: And performing self-attention calculation based on the query matrix of the input sequence, the feature key vectors and the feature value vectors of the plurality of fragments to obtain inter-segment self-attention corresponding to each element in the input sequence, wherein the query matrix of the input sequence comprises the query vectors of the elements in the input sequence.
5. The method according to any one of claims 1 to 4, wherein calculating the self-attentiveness between the elements in each segment to obtain the self-attentiveness in the segment corresponding to each element includes: And performing linear self-attention calculation based on a query matrix, a key matrix and a value matrix of a first segment in the plurality of segments to obtain intra-segment self-attention corresponding to each element in the first segment, wherein the query matrix of the first segment comprises query vectors of the elements in the first segment, the key matrix of the first segment comprises key vectors of the elements in the first segment, and the value matrix of the first segment comprises value vectors of at least two elements in the first segment.
6. The method of claim 5, wherein the query matrix of the first segment comprises a first query matrix and a second query matrix, and wherein the key matrix of the first segment comprises a first key matrix and a second key matrix, the method further comprising: Based on the position of each element in the first segment in the input sequence, respectively performing position coding on an original query matrix and an original key matrix of the first segment by using a first position coding operator to obtain the first query matrix and the first key matrix; and based on the position of each element in the first segment in the input sequence, respectively performing position coding on an original query matrix and an original key matrix of the first segment by using a second position coding operator to obtain a second query matrix and a second key matrix.
7. The method of claim 5 or 6, wherein the performing linear self-attention calculations based on the query matrix, key matrix, and value matrix for a first segment of the plurality of segments comprises: Acquiring positive matrixes and negative matrixes corresponding to the query matrixes of the first fragments and positive matrixes and negative matrixes corresponding to the key matrixes of the first fragments, wherein the positive matrixes comprise positive elements in the corresponding matrixes, and the negative matrixes comprise negative elements in the corresponding matrixes; determining an absolute value matrix of a negative matrix corresponding to the query matrix of the first segment and an absolute value matrix of a negative matrix corresponding to the key matrix of the first segment, wherein the absolute value matrix is obtained by taking absolute values of negative elements in the negative matrix; And performing linear self-attention calculation based on absolute value matrixes of positive matrixes and negative matrixes corresponding to the query matrixes of the first fragments, absolute value matrixes of positive matrixes and negative matrixes corresponding to the key matrixes of the first fragments and the value matrixes.
8. The method of claim 7, wherein when the query matrix of the first segment includes a first query matrix and a second query matrix, and the key matrix of the first segment includes a first key matrix and a second key matrix, performing the linear self-attention computation based on an absolute value matrix of positive and negative matrices corresponding to the query matrix of the first segment, an absolute value matrix of positive and negative matrices corresponding to the key matrix of the first segment, and the value matrix, comprises: Calculating a first self-attention of each element in the first segment based on an absolute value matrix of positive and negative matrices corresponding to the first query matrix, an absolute value matrix of positive and negative matrices corresponding to the first key matrix, and the value matrix; calculating a second self-attention of each element in the first segment based on the absolute value matrices of the positive and negative matrices corresponding to the second query matrix, the absolute value matrices of the positive and negative matrices corresponding to the second key matrix, and the value matrices; An intra-segment self-attention for each element within the first segment is determined based on the first self-attention and the second self-attention for each element within the first segment.
9. A sequence processing apparatus, the apparatus comprising: The segmentation module is used for dividing an input sequence into a plurality of segments, wherein the input sequence comprises a plurality of elements, each segment comprises at least two elements, the plurality of elements are used for representing a plurality of data units in data to be processed, and the plurality of elements are in one-to-one correspondence with the plurality of data units; the in-segment self-attention calculating module is used for calculating self-attention between at least two elements in each segment to obtain the in-segment self-attention corresponding to each element; The inter-segment self-attention calculating module is used for extracting the characteristic information of the plurality of segments, calculating the self-attention between each element in the input sequence and the plurality of segments based on the characteristic information of the plurality of segments, and obtaining the inter-segment self-attention corresponding to each element in the input sequence; and the self-attention fusion module is used for determining a self-attention output result based on the intra-segment self-attention and the inter-segment self-attention corresponding to a plurality of elements in the input sequence.
10. The apparatus of claim 9, wherein the data to be processed is text, the data unit is a token in the text, the plurality of elements is a feature vector of a plurality of tokens in the text, or the data to be processed is an image, the data unit is an image region in the image, the plurality of elements is a feature vector of a plurality of image regions in the image, or the data to be processed is a video, the data unit is a video unit in the video, the plurality of elements is a feature vector of a plurality of video units in the video, or the data to be processed is voice data, the data unit is a voice segment in the voice data, and the plurality of elements is a feature vector of a plurality of voice segments in the voice data.
11. The apparatus according to claim 9 or 10, wherein the feature information comprises a feature key vector and a feature value vector, and wherein the inter-segment self-attention calculation module is specifically configured to: Extracting features of a key matrix of a first segment from the plurality of segments to obtain a feature key vector of the first segment, wherein the key matrix of the first segment comprises key vectors of elements in the first segment; And extracting the characteristics of the value matrix of the first segment to obtain a characteristic value vector of the first segment, wherein the value matrix of the first segment comprises the value vector of the elements in the first segment.
12. The apparatus of claim 11, wherein the inter-segment self-attention computation module is specifically configured to: And performing self-attention calculation based on the query matrix of the input sequence, the feature key vectors and the feature value vectors of the plurality of fragments to obtain inter-segment self-attention corresponding to each element in the input sequence, wherein the query matrix of the input sequence comprises the query vectors of the elements in the input sequence.
13. The apparatus according to any one of claims 9 to 12, wherein the intra-segment self-attention calculation module is specifically configured to: And performing linear self-attention calculation based on a query matrix, a key matrix and a value matrix of a first segment in the plurality of segments to obtain intra-segment self-attention corresponding to each element in the first segment, wherein the query matrix of the first segment comprises query vectors of the elements in the first segment, the key matrix of the first segment comprises key vectors of the elements in the first segment, and the value matrix of the first segment comprises value vectors of at least two elements in the first segment.
14. The apparatus of claim 13, wherein the query matrix of the first segment comprises a first query matrix and a second query matrix, the key matrix of the first segment comprises a first key matrix and a second key matrix, and the intra-segment self-attention computation module is further specifically configured to: Based on the position of each element in the first segment in the input sequence, respectively performing position coding on an original query matrix and an original key matrix of the first segment by using a first position coding operator to obtain the first query matrix and the first key matrix; and based on the position of each element in the first segment in the input sequence, respectively performing position coding on an original query matrix and an original key matrix of the first segment by using a second position coding operator to obtain a second query matrix and a second key matrix.
15. The apparatus according to claim 13 or 14, wherein the intra-segment self-attention computation module is specifically configured to: Acquiring positive matrixes and negative matrixes corresponding to the query matrixes of the first fragments and positive matrixes and negative matrixes corresponding to the key matrixes of the first fragments, wherein the positive matrixes comprise positive elements in the corresponding matrixes, and the negative matrixes comprise negative elements in the corresponding matrixes; determining an absolute value matrix of a negative matrix corresponding to the query matrix of the first segment and an absolute value matrix of a negative matrix corresponding to the key matrix of the first segment, wherein the absolute value matrix is obtained by taking absolute values of negative elements in the negative matrix; And performing linear self-attention calculation based on absolute value matrixes of positive matrixes and negative matrixes corresponding to the query matrixes of the first segments, absolute value matrixes of positive matrixes and negative matrixes corresponding to the key matrixes of the first segments and the value matrixes.
16. A computing device comprising a processor for executing a computer program stored in a memory to implement the sequence processing method of any one of claims 1 to 8.
17. A cluster of computing devices, characterized in that it comprises a plurality of computing devices, each comprising a processor and a memory, the processors of the plurality of computing devices being adapted to execute instructions stored in the memories of the plurality of computing devices, such that the cluster of computing devices performs the sequential processing method according to any of claims 1 to 8.
18. A computer program product containing instructions which, when executed by a computing device, cause the computing device to perform the sequence processing method of any of claims 1 to 8.

Description

Sequence processing method, device, computing equipment and program product Technical Field The present application relates to the field of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), and more particularly to a sequence processing method, apparatus, computing device, and program product. Background Currently, AI models such as large language models (large language model, LLM), computer Vision (CV) large models, multi-modal models, and the like are mostly based on transform (transducer) architecture. the transformation framework realizes efficient modeling of the relation between any two elements in the input sequence through a self-attention (self-attention) mechanism, so that the context environment of the input sequence can be fully considered when the AI model executes the prediction task, and the precision of the prediction task is improved. In the related art, in order to reduce the calculation amount of self-attention calculation and reduce the memory overhead, an input sequence may be segmented into a plurality of segments, and self-attention among elements in each segment is calculated, so as to obtain a self-attention output result corresponding to the input sequence. In this self-attention calculating method, since the input sequence is divided into a plurality of segments and the self-attention calculation is performed with the segments as granularity, the self-attention output result contains only the local context information of the element in the segment, and the accuracy of the self-attention output result is not high, and thus the accuracy of the output sequence predicted based on the self-attention output result is not high. Disclosure of Invention The application provides a sequence processing method, a sequence processing device, a computing device and a program product, which can improve the precision of a self-attention output result on the premise of reducing the calculated amount and the memory expenditure of self-attention calculation as much as possible, thereby improving the prediction precision of an output sequence. In order to achieve the above purpose, the application adopts the following technical scheme: According to a first aspect, a sequence processing method is provided, the method comprises the steps of dividing an input sequence into a plurality of fragments, wherein each fragment comprises at least two elements, the elements in the input sequence are used for representing a plurality of data units in data to be processed, the elements are in one-to-one correspondence with the data units, self-attentiveness among the elements in each fragment is calculated to obtain intra-fragment self-attentiveness corresponding to each element, characteristic information of the fragments is extracted, self-attentiveness among the elements in the input sequence and the fragments is calculated based on the characteristic information of the fragments, self-attentiveness among the fragments corresponding to each element in the input sequence is obtained, and self-attentiveness output results are determined based on the intra-fragment self-attentiveness and the inter-fragment self-attentiveness corresponding to the elements in the input sequence. Wherein the plurality of elements in the input sequence have a sequential order, and each segment includes at least two elements that are consecutive. And, one element in the input sequence is used to characterize one data unit in the data to be processed. In the present application, for a plurality of segments included in an input sequence, self-attention between elements within each segment can be calculated, resulting in intra-segment self-attention containing local context information. In addition, feature information of the plurality of segments may be extracted, and the feature information of the plurality of segments may be used to calculate self-attention between each element and the plurality of segments, thereby obtaining inter-segment self-attention containing global context information. On this basis, the self-attention output result based on the intra-segment self-attention and inter-segment self-attention of the plurality of elements contains both the local context information of each element in the segment and the global context information of each element. Therefore, the application improves the precision of the self-attention output result on the premise of reducing the calculated amount and the memory expenditure of the self-attention calculation as much as possible, thereby improving the precision of the output sequence predicted based on the self-attention output result. Optionally, in the present application, the input sequence is used to characterize a sequence of data to be processed, and each element in the input sequence is used to characterize one data unit in the data to be processed. The plurality of data units in the data to be processed have a sequence, for example, the plurality of data units may have a tim