CN-121542063-B - Data processing device, data processing method, chip and electronic equipment

CN121542063BCN 121542063 BCN121542063 BCN 121542063BCN-121542063-B

Abstract

The application discloses a data processing device, a data processing method, a chip and electronic equipment, and relates to the technical field of data processing. The data processing device comprises a storage module, a calculation module and a merging and sorting module, wherein the storage module is configured to store data, the calculation module is coupled to the storage module and configured to access an original data vector stored by the storage module, the original data vector comprises m data elements, the original data vector is split based on a target k value to obtain a plurality of first data vectors, the data elements in the first data vectors are respectively subjected to odd-even merging and sorting to obtain a plurality of first ordered data vectors, and the data elements in the first ordered data vectors are subjected to merging and sorting to obtain top-k data elements in the m data elements. The data processing device, the data processing method, the chip and the electronic equipment provided by the application can obtain the top-k data elements without a special hardware ordering network.

Inventors

LIU CHAO
OUYANG PENG
YU YI
LI XIUDONG
WANG BO

Assignees

北京清微智能科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260120

Claims (20)

1. A data processing apparatus, comprising: A storage module configured to store data; a computing module including a single instruction multiple data execution unit and a vector register set, the single instruction multiple data execution unit coupled to the vector register set, the vector register set coupled to the storage module, the computing module configured to: Accessing an original data vector stored by the storage module, wherein the original data vector comprises m data elements, and m is a positive integer; splitting the original data vector based on a target k value to obtain a plurality of first data vectors, wherein k is a positive integer; Mapping the data elements in each first data vector into the vector register group respectively, and performing parallel comparison-exchange operation on the data elements in the vector register group by utilizing the single-instruction multi-data execution unit to realize parity merging and sequencing on the data elements in each first data vector respectively to obtain a plurality of first ordered data vectors; merging and sorting the data elements in the first ordered data vectors to obtain top-k data elements in the m data elements; wherein parity merge ordering of data elements in each of the first data vectors comprises: Mapping the data elements in the first data vector into the vector register group according to a first mapping rule, performing parallel comparison-exchange operation on the data elements in the same position in the vector register group by utilizing the single-instruction multi-data execution unit to obtain two first sub-data vectors, and reversely mapping the data elements in the two first sub-data vectors back to an address space of the storage module for storing the first data vector according to the first mapping rule respectively so as to update the first data vector stored in the storage module into a second data vector; Mapping the data elements in the second data vector into the vector register group according to a second mapping rule, performing parallel comparison-exchange operation on the data elements in the same position in the vector register group by utilizing the single-instruction multiple-data execution unit to obtain two second sub-data vectors, and reversely mapping the data elements in the two second sub-data vectors back to an address space where the second data vector is stored by the storage module according to the second mapping rule respectively, so that the second data vector stored in the storage module is updated into a third data vector; And the like, until the comparison-exchange operation required by one round of parity merging and sorting is completed, namely the parity merging and sorting of the first data vector is completed, and a first ordered data vector is obtained.
2. The data processing apparatus of claim 1, wherein the computing module splits the original data vector based on a target k value to obtain a plurality of first data vectors, comprising: Calculating a power integer of 2 which is greater than or equal to a k value according to a target k value; and splitting the original data vector by taking the integer of the power of 2 which is larger than or equal to the k value as a grouping granularity to obtain a plurality of first data vectors.
3. The data processing apparatus of claim 1 wherein the computation module maps data elements in one of the first data vectors into the vector register set at a time and performs parallel compare-swap operations on the data elements in the vector register set with the single instruction multiple data execution unit to perform parity merge ordering on the data elements in the first data vector to obtain one first ordered data vector; Or alternatively The computing module maps data elements in at least two first data vectors into the vector register group each time, and performs parallel comparison-exchange operation on the data elements in the vector register group by utilizing the single-instruction multi-data execution unit to implement parallel parity merging and sorting on the data elements in the at least two first data vectors, so as to obtain first ordered data vectors corresponding to the at least two first data vectors.
4. The data processing apparatus according to claim 1, wherein, The mapping of the data elements in the first data vector into the vector register set according to a first mapping rule comprises determining the data element pairs to be compared-exchanged in the first data vector based on a parity merge ordering algorithm; mapping two data elements in each of the data element pairs to the same positions of two different vector registers, respectively, wherein the data elements in each of the data element pairs are mapped into the two different vector registers; The mapping the data elements in the second data vector into the vector register set according to the second mapping rule comprises determining a data element pair needing comparison-exchange in the second data vector based on a parity merge sorting algorithm, and mapping two data elements in each data element pair to the same positions of two different vector registers, wherein the data elements in each data element pair are mapped into the two different vector registers.
5. The data processing apparatus according to claim 4, wherein, The method comprises the steps of determining an index value of each data element according to the position of each data element in the two first sub-data vectors in two vector registers, writing each data element in the two first sub-data vectors back to the storage module according to the index value, covering the data element corresponding to the index value of the first data vector stored in the storage module, and updating the first data vector stored in the storage module into a second data vector; The method comprises the steps of determining an index value of each data element according to the position of each data element in the two second sub-data vectors in two vector registers, writing each data element in the two second sub-data vectors back to the storage module according to the index value, covering the data element corresponding to the index value of the second data vector stored in the storage module, and updating the second data vector stored in the storage module into a third data vector.
6. The data processing apparatus of claim 1 wherein the single instruction multiple data execution unit performs parallel compare-swap operations that may be performed in parallel during parity merge ordering of at least two of the first data vectors using a single instruction multiple data operation instruction while the computing module processes the at least two of the first data vectors in parallel.
7. The data processing apparatus of claim 1, wherein the computing module performs merge ordering on the data elements in the first plurality of ordered data vectors to obtain top-k data elements in the m data elements, comprising: Grouping the plurality of first ordered data vectors in pairs, and respectively merging and sorting data elements in the two first ordered data vectors in each group to obtain a plurality of first target data vectors; intercepting top-b data elements of each first target data vector respectively to generate a plurality of second ordered data vectors, wherein b is a power integer of 2 which is greater than or equal to the k value; Grouping the plurality of second ordered data vectors in pairs, and respectively merging and sorting data elements of the two second ordered data vectors in each group to obtain a plurality of second target data vectors; Intercepting top-b data elements of each second target data vector respectively to generate a plurality of third ordered data vectors; and so on until 1 nth ordered data vector is generated; and intercepting a top-k data element in the nth ordered data vector, wherein the top-k data element in the nth ordered data vector is a top-k data element in the m data elements.
8. The data processing device of claim 7, wherein the computing module performs pairwise grouping on the plurality of first ordered data vectors, performs merging and sorting on data elements in the two first ordered data vectors in each group to obtain a plurality of first target data vectors, and comprises performing pairwise grouping on the plurality of first ordered data vectors, processing one group of the first ordered data vectors at a time, sequentially completing merging and sorting on the two first ordered data vectors in each group to obtain a plurality of first target data vectors, or performing pairwise grouping on the plurality of first ordered data vectors, processing at least two groups of the first ordered data vectors at a time in parallel until merging and sorting on the two first ordered data vectors in each group is completed, and finally obtaining a plurality of first target data vectors; The calculation module performs pairwise grouping on the plurality of second ordered data vectors, performs merging and sorting on data elements of the two second ordered data vectors in each group to obtain a plurality of second target data vectors, and comprises the steps of performing pairwise grouping on the plurality of second ordered data vectors, processing one group of the second ordered data vectors each time, sequentially completing merging and sorting on the two second ordered data vectors in each group to obtain a plurality of second target data vectors, or performing pairwise grouping on the plurality of second ordered data vectors, processing at least two groups of the second ordered data vectors each time in parallel until merging and sorting on the two second ordered data vectors in each group is completed, and finally obtaining a plurality of second target data vectors.
9. The data processing apparatus of claim 8, wherein the computing module merge ordering the data elements of the two first ordered data vectors within each group, comprising: Mapping data elements in two first ordered data vectors in the group into the vector register group according to a third mapping rule, performing parallel comparison-exchange operation on the data elements in the same position in the vector register group by utilizing the single-instruction multiple-data execution unit to obtain two first intermediate data vectors, and then reversely mapping the data elements in the two first intermediate data vectors back into the storage module according to the third mapping rule to obtain two second intermediate data vectors; Mapping the data elements in the two second intermediate data vectors into the vector register group according to a fourth mapping rule, performing parallel comparison-exchange operation on the data elements in the same position in the vector register group by utilizing the single-instruction multiple-data execution unit to obtain two third intermediate data vectors, and then reversely mapping the data elements in the two third intermediate data vectors into the storage module according to the fourth mapping rule to obtain two fourth intermediate vectors; and the like, until the comparison-exchange operation required by one round of merging and sorting is completed, namely completing merging and sorting of the two first ordered data vectors, and obtaining a first target data vector.
10. The data processing apparatus of claim 9, wherein the computing module merge ordering the data elements of the two second ordered data vectors within each group, comprising: Mapping data elements in two second ordered data vectors in the group into the vector register group according to a third mapping rule, performing parallel comparison-exchange operation on the data elements in the same position in the vector register group by utilizing the single-instruction multiple-data execution unit to obtain two first intermediate data vectors, and then reversely mapping the data elements in the two first intermediate data vectors back into the storage module according to the third mapping rule to obtain two second intermediate data vectors; Mapping the data elements in the two second intermediate data vectors into the vector register group according to a fourth mapping rule, performing parallel comparison-exchange operation on the data elements in the same position in the vector register group by utilizing the single-instruction multiple-data execution unit to obtain two third intermediate data vectors, and then reversely mapping the data elements in the two third intermediate data vectors into the storage module according to the fourth mapping rule to obtain two fourth intermediate vectors; and the like, until the comparison-exchange operation required by one round of merging and sorting is completed, namely completing merging and sorting of the two second ordered data vectors, and obtaining a second target data vector.
11. The data processing apparatus according to claim 10, wherein said mapping data elements in two first/second ordered data vectors within the group into the vector register group according to a third mapping rule, respectively, comprises: determining a first valid data element pair to be compared and exchanged in two ordered data vectors in the group based on a set merging and sorting algorithm; Mapping two data elements in each first valid data element pair to the same positions of two different vector registers, wherein the data elements in each first valid data element pair are mapped into the two different vector registers; The mapping the data elements in the two second intermediate data vectors into the vector register set according to a fourth mapping rule includes: determining a second valid data element pair which needs to be compared and exchanged in the two second intermediate data vectors based on the set merging and sorting algorithm; And mapping two data elements in each second valid data element pair to the same positions of two different vector registers respectively, wherein the data elements in each second valid data element pair are mapped into the two different vector registers.
12. The data processing apparatus according to claim 11, wherein the reversely mapping the data elements in the two first intermediate data vectors back to the storage module according to the third mapping rule, respectively, to obtain two second intermediate data vectors, includes: determining an index value of each data element according to the position of each data element in the two first intermediate data vectors in the two vector registers, writing each data element in the two first intermediate data vectors back to the storage module according to the index value, and covering the data elements of the index values corresponding to the two ordered data vectors stored in the storage module so that the two ordered data vectors stored in the storage module are updated into two second intermediate data vectors; The reversely mapping the two third intermediate data vectors back to the storage module according to the fourth mapping rule to obtain two fourth intermediate vectors, including: and determining an index value of each data element according to the positions of each data element in the two third intermediate data vectors in the two vector registers, writing each data element in the two third intermediate data vectors back to the storage module according to the index value, and covering the data elements of the index values corresponding to the two second intermediate data vectors stored in the storage module so that the two second intermediate data vectors stored in the storage module are updated into the two fourth intermediate data vectors.
13. The data processing apparatus of claim 10 wherein the single instruction multiple data execution unit performs parallel compare-swap operations that may be performed in parallel during merge ordering of at least two sets of the ordered data vectors using a single instruction multiple data operation instruction each time the computing module processes the at least two sets of the first/second ordered data vectors in parallel.
14. The data processing apparatus of claim 11, wherein the compare-swap operation performed on the data elements in the first valid data element pair is a compare-swap operation that contributes to the top-k data element identification; the compare-swap operation performed on the data elements in the second valid data element pair is a compare-swap operation that contributes to the top-k data element identification.
15. The data processing apparatus of claim 1, further comprising: an instruction decode unit, coupled to the memory module, configured to parse RISC-V instructions extracted from the memory module; A scheduler, coupled to the instruction decode unit and the computation module, configured to determine an execution order of the RISC-V instructions based on the instruction characteristics of the RISC-V instructions and the resource conditions of the computation module, such that the single instruction multiple data execution unit executes the RISC-V instructions in the execution order of the RISC-V instructions.
16. The data processing apparatus of claim 15, wherein the storage module comprises: A static random access memory coupled to the computing module configured to store data and RISC-V instructions.
17. The data processing apparatus of claim 16, further comprising: an instruction cache unit coupled to the static random access memory and configured to fetch and store RISC-V instructions from the static random access memory; And the instruction decoding unit is coupled to the instruction cache unit and the instruction decoding unit and is configured to orderly store RISC-V instructions acquired from the instruction cache unit, so that the instruction decoding unit acquires the RISC-V instructions from the instruction cache unit.
18. A method of data processing, comprising: Accessing an original data vector stored by a storage module, wherein the original data vector comprises m data elements, and m is a positive integer; splitting the original data vector based on a target k value to obtain a plurality of first data vectors, wherein k is a positive integer; Mapping the data elements in each first data vector into a vector register group respectively, and carrying out parallel comparison-exchange operation on the data elements in the vector register group by utilizing a single-instruction multi-data execution unit to realize parity merging and sequencing on the data elements in each first data vector respectively to obtain a plurality of first ordered data vectors; merging and sorting the data elements in the first ordered data vectors to obtain top-k data elements in the m data elements; wherein parity merge ordering of data elements in each of the first data vectors comprises: Mapping the data elements in the first data vector into the vector register group according to a first mapping rule, performing parallel comparison-exchange operation on the data elements in the same position in the vector register group by utilizing the single-instruction multi-data execution unit to obtain two first sub-data vectors, and reversely mapping the data elements in the two first sub-data vectors back to an address space of the storage module for storing the first data vector according to the first mapping rule respectively so as to update the first data vector stored in the storage module into a second data vector; Mapping the data elements in the second data vector into the vector register group according to a second mapping rule, performing parallel comparison-exchange operation on the data elements in the same position in the vector register group by utilizing the single-instruction multiple-data execution unit to obtain two second sub-data vectors, and reversely mapping the data elements in the two second sub-data vectors back to an address space where the second data vector is stored by the storage module according to the second mapping rule respectively, so that the second data vector stored in the storage module is updated into a third data vector; And the like, until the comparison-exchange operation required by one round of parity merging and sorting is completed, namely the parity merging and sorting of the first data vector is completed, and a first ordered data vector is obtained.
19. A chip comprising a data processing apparatus as claimed in any one of claims 1 to 17.
20. An electronic device comprising a chip as claimed in claim 19.

Description

Data processing device, data processing method, chip and electronic equipment Technical Field The present application relates to the field of data processing technologies, and in particular, to a data processing apparatus, a data processing method, a chip, and an electronic device. Background In a Large Language Model (LLM) technical system, a top-k operator is a core component for balancing model efficiency and performance, and the core value is that an assistance model realizes high-efficiency decision in a complex natural language processing task by dynamically screening key information. LLM represented by GPT, LLaMA, BERT is required to complete core tasks such as text generation, semantic understanding, logical reasoning and the like, the tasks require a model to rapidly locate key contents from massive candidate information (such as words and knowledge fragments), and the characteristic that top-k operators 'take top k optimal solutions' is exactly and accurately adapted to the high-frequency requirement. Especially in the mixed expert model (MoE) architecture, the top-k operator acts as a core hub to achieve sparse activation and efficient computation, which directly determines the resource allocation logic of the model. The MoE large model refers to that the model is split into a plurality of expert networks, activated expert combinations are dynamically screened for each input sample by means of a gating network, a top-k operator is a core engine of the sparse activation technology, and the expert subset participating in the current calculation is efficiently determined by sequencing expert matching scores, so that the computational bottleneck of a traditional dense model (such as a transducer) is fundamentally broken through. Since each layer of a conventional dense large model requires calculation of all parameters of that layer of the network, the amount of calculation grows exponentially with the model size (e.g., GPT-3 for 175B parameters is subject to high reasoning costs). Compared with the method, the dynamic sparse activation of the MoE large model is realized through top-k routing, and a small number of activation experts are selected to participate in calculation, so that the calculation amount of the model is obviously reduced, and the actual calculation amount of the model can be controlled to be equivalent to that of a medium-small dense large model while the parameter scale of the model is expanded to trillion or trillion levels, so that the expandability and economy of the model are obviously improved. Considering the key role of the top-k operator in the large language model, the calculation efficiency (such as the index and the score of the first k experts are rapidly extracted) of the top-k operator on hardware equipment such as GPU (graphic processor)/TPU (tensor processor) is researched and optimized, the training and reasoning speed of the MoE large model is directly influenced, and the key technology point for connecting algorithm design and engineering landing is realized. In the prior art, the data is generally ordered in parallel by using a special hardware ordering network, and then the needed top-k data is selected according to the ordered result, so that although the execution efficiency is high by means of special hardware, a software compiler is needed to carry out customized design aiming at the characteristics of the special hardware, the programming flexibility and the universality are poor, and the large model is not beneficial to the expansion and migration of different hardware platforms. This section is intended to provide a background or context to the embodiments of the application that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section. Disclosure of Invention In order to solve at least one of the above problems in the prior art, embodiments of the present application provide a data processing apparatus, a data processing method, a chip, and an electronic device. The embodiment of the application provides a data processing device which comprises a storage module configured to store data, a calculation module and a data processing module, wherein the calculation module comprises a single-instruction multi-data execution unit and a vector register group, the single-instruction multi-data execution unit is coupled to the vector register group, the vector register group is coupled to the storage module, the calculation module is configured to access an original data vector stored by the storage module, the original data vector comprises m data elements, m is a positive integer, the original data vector is split based on a target k value to obtain a plurality of first data vectors, k is a positive integer, the data elements in each first data vector are mapped into the vector register group respectively, parallel comparison-exchange operation is carried out on the data elements in the vector register gro