CN-121636221-B - Operator execution method, device, equipment, storage medium and product

CN121636221BCN 121636221 BCN121636221 BCN 121636221BCN-121636221-B

Abstract

The application discloses an operator executing method, an operator executing device, equipment, a storage medium and a product, wherein the method comprises the steps of applying a preset number of buffer areas for threads executing operators, wherein the preset number is more than or equal to 3, for any buffer area, sequentially sending target data elements in a thread local register array into the buffer areas through the threads to be accumulated to obtain an accumulated result in the buffer area, wherein the target data elements are data elements corresponding to thread local register indexes with the preset number in the thread local register array, accumulating the accumulated results of the corresponding buffer areas through the threads to obtain a row accumulated result of each row or a column accumulated result of each column in the thread local register array. By adopting the embodiment of the application, the execution time of the operator on the chip can be reduced, and the performance of the operator can be improved.

Inventors

Request for anonymity
Request for anonymity
Request for anonymity
Request for anonymity
Request for anonymity

Assignees

上海壁仞科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20260204

Claims (10)

1. An operator execution method, comprising: applying a preset number of buffer areas for executing the threads of the operator, wherein the preset number is more than or equal to 3; For any buffer area, sequentially sending target data elements in a thread local register array into the buffer area through the threads to accumulate to obtain an accumulated result in the buffer area, wherein the target data elements are data elements corresponding to thread local register indexes spaced by the preset number in the thread local register array, and the difference value between the thread local register indexes corresponding to two target data elements sequentially received in the same buffer area is the preset number; and accumulating accumulated results of the corresponding buffer areas through the threads to obtain row accumulated results of each row in the thread local register array or column accumulated results of each column.
2. The operator performing method according to claim 1, wherein the preset number is determined according to any one of the following: the arrangement sequence of the thread local register indexes in the thread local register array; The number of thread local registers of the thread local register array.
3. The operator execution method according to claim 1, wherein for any one of the buffers, the sequentially sending, by the thread, the target data elements in the thread local register array into the buffer for accumulation, to obtain an accumulation result in the buffer, includes: Confirming the thread local register indexes with the preset number in the current accumulation instruction period of the thread, sending the data elements corresponding to the thread local register indexes confirmed in the thread local register array to the buffer area of the target buffer area index for accumulation until the last accumulation instruction period is over, and obtaining an accumulation result for accumulating the target data elements in each buffer area; The target buffer index is the remainder of dividing the thread local register index by the preset number.
4. The operator execution method according to claim 1, wherein the accumulating, by the thread, the accumulated results of the corresponding buffers to obtain a row accumulated result of each row in the thread local register array or a column accumulated result of each column, includes: The thread sends the accumulated results belonging to the same row to the same buffer area for accumulation to obtain the row accumulated results in the thread local register array, or And sending the accumulated results belonging to the same column to the same buffer area for accumulation through the thread to obtain a column accumulated result in the thread local register array.
5. The operator performing method of claim 1, wherein the method further comprises: and sending a row accumulation result or a column accumulation result in the thread local register array to the same buffer area for accumulation through the thread to obtain a row-column reduction result in the thread local register array.
6. The operator execution method of claim 1 further comprising performing cross-thread reduction calculations using row accumulation results or column accumulation results in the thread local register array.
7. An operator execution apparatus, comprising: the application module is used for applying a preset number of buffer areas for threads executing the operators, wherein the preset number is more than or equal to 3; The first accumulation module is used for sequentially sending target data elements in a thread local register array to the buffer area through the threads for accumulation to obtain an accumulation result in the buffer area, wherein the target data elements are data elements corresponding to thread local register indexes spaced by the preset number in the thread local register array, and the difference value between the thread local register indexes corresponding to two target data elements sequentially received in the same buffer area is the preset number; And the second accumulation module is used for accumulating the accumulation results of the corresponding buffer areas through the threads to obtain row accumulation results of each row or column accumulation results of each column in the thread local register array.
8. An operator execution device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the operator execution method according to any one of claims 1 to 6 when executing the computer program.
9. A computer readable storage medium, wherein the computer readable storage medium comprises a stored computer program, and wherein the computer program, when executed, controls a device in which the computer readable storage medium is located to perform the operator performing method according to any one of claims 1 to 6.
10. A computer program product comprising computer programs/instructions which when executed by a processor implement the operator execution method of any one of claims 1 to 6.

Description

Operator execution method, device, equipment, storage medium and product Technical Field The present application relates to the field of artificial intelligence technologies, and in particular, to an operator execution method, apparatus, device, storage medium, and product. Background In the execution process of the fusion operator of the chip, the fusion operator relies on a thread local register (Thread Local Register, TLR) in the hardware architecture of the chip to complete the protocol calculation of the data in the row direction or the column direction (for example, the calculation of maximum value of the row or the column, minimum value of the row or the column, accumulation sum of the row or the column and the like), and finally, one protocol value is calculated for each row or each column of the TLR array. In the implementation process of the fusion operator in the existing scheme, the protocol calculation in the thread (thread) is implemented by using two TLRs, and serious data dependence exists, namely, a subsequent hardware instruction can be started only after waiting for the calculation result of a previous hardware instruction to return. This can result in a single hardware instruction that is elongated in latency, resulting in excessive overall execution time of the operator on the chip, severely limiting operator performance. Disclosure of Invention The application provides an operator execution method, an operator execution device, operator execution equipment, an operator execution storage medium and an operator execution product, which are used for solving the problems that in the prior art, the waiting time of a single hardware instruction is prolonged, the total execution time of an operator on a chip is overlong, and the performance of the operator is severely restricted due to the fact that the severe data dependence exists in the protocol calculation in a thread. To achieve the above object, an embodiment of the present application provides an operator execution method, including: applying a preset number of buffer areas for executing the threads of the operator, wherein the preset number is more than or equal to 3; for any buffer area, sequentially sending target data elements in a thread local register array into the buffer area through the thread to accumulate to obtain an accumulated result in the buffer area, wherein the target data elements are data elements corresponding to thread local register indexes which are spaced by the preset number in the thread local register array; and accumulating accumulated results of the corresponding buffer areas through the threads to obtain row accumulated results of each row in the thread local register array or column accumulated results of each column. As an improvement of the above-described scheme, the preset number is determined according to any one of the following: the arrangement sequence of the thread local register indexes in the thread local register array; The number of thread local registers of the thread local register array. As an improvement of the above solution, for any one of the buffers, the sending, by the thread, the target data elements in the thread local register array to the buffer in sequence for accumulation, to obtain an accumulation result in the buffer, includes: Confirming the thread local register indexes with the preset number in the current accumulation instruction period of the thread, sending the data elements corresponding to the thread local register indexes confirmed in the thread local register array to the buffer area of the target buffer area index for accumulation until the last accumulation instruction period is over, and obtaining an accumulation result for accumulating the target data elements in each buffer area; The target buffer index is the remainder of dividing the thread local register index by the preset number. As an improvement of the above solution, the accumulating, by the thread, the accumulated results of the corresponding buffers to obtain a row accumulated result of each row in the thread local register array, or a column accumulated result of each column, including: The thread sends the accumulated results belonging to the same row to the same buffer area for accumulation to obtain the row accumulated results in the thread local register array, or And sending the accumulated results belonging to the same column to the same buffer area for accumulation through the thread to obtain a column accumulated result in the thread local register array. As an improvement of the above solution, the method further includes: and sending a row accumulation result or a column accumulation result in the thread local register array to the same buffer area for accumulation through the thread to obtain a row-column reduction result in the thread local register array. As an improvement of the scheme, the method further comprises the step of performing cross-thread protocol calculation by utilizing a