CN-122019172-A - Computing chip, reduction operation method, related device and medium

CN122019172ACN 122019172 ACN122019172 ACN 122019172ACN-122019172-A

Abstract

The disclosure provides a computing chip, a reduction operation method, a related device and a medium, wherein the computing chip comprises a plurality of computing units, a data caching unit and a global memory, wherein the computing units are used for executing computing tasks comprising thread bundles and workgroups, each computing unit comprises a loading storage unit and a local shared memory, the data caching unit is used for caching workgroup level reduction results from the computing units and executing reduction operations across workgroups, the global memory is used for storing grid level reduction results after the data caching unit reduction operations, a first-level hardware reduction module is integrated in the loading storage unit and used for executing reduction operations in the thread bundles and the workgroups and submitting the obtained workgroup level reduction results to the local shared memory, and a second-level hardware reduction module is integrated in the data caching unit and is coupled with the first-level hardware reduction module and used for executing reduction operations across workgroups. The method reduces access delay and call overhead and improves the execution efficiency of grid reduction operation.

Inventors

ZHU HENGYE
JIANG YUXIANG
XU PINGAN
Hao Shaopu

Assignees

苏州亿铸智能科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260202

Claims (15)

1. A computing chip, comprising: the system comprises a plurality of computing units, a plurality of processing units and a processing unit, wherein the computing units are used for executing computing tasks comprising thread bundles and work groups, and each computing unit comprises a loading storage unit and a local shared memory; The data caching unit is used for caching the workgroup level reduction results from the plurality of computing units and executing reduction operation across workgroups; the global memory is used for storing grid level reduction results after the reduction operation of the data caching unit; The load storage unit is integrated with a first-stage hardware reduction module, and is used for executing reduction operation in a thread bundle and a work group and submitting the obtained work group-stage reduction result to the local shared memory; And the data caching unit is integrated with a second-level hardware reduction module which is coupled with the first-level hardware reduction module and is used for executing reduction operation crossing the working group.
2. The computing chip of claim 1, wherein the first stage hardware reduction module comprises: the reduction tree hardware unit is used for executing parallel reduction calculation on the data of each thread in the single thread bundle and outputting a thread bundle level reduction result; And the first atomic operation unit is used for executing atomic reduction operation on the thread bundle level reduction results of a plurality of thread bundles in the same working group and storing the obtained working group level reduction results into the local shared memory.
3. The computing chip of claim 1, wherein the second level hardware reduction module comprises: And the second atomic operation unit is used for executing atomic reduction operation on a plurality of workgroup level reduction results obtained from the local shared memories of a plurality of workgroups and generating the grid level reduction results.
4. The computing chip of claim 1, wherein the first stage hardware reduction module and the second stage hardware reduction module cooperate in an asynchronous pipelined manner.
5. The computing chip of claim 1, wherein the reduction operation is atomic accumulation, maximization, or minimization.
6. The computing chip of claim 2, wherein the reduction tree hardware unit comprises a plurality of comparators or adders connected in a hierarchical cascade to form a tree topology, wherein the number of comparators or adders of each stage is half of the number of the upper stages.
7. The computing chip of claim 6, wherein the number of comparators or adders is determined based on a total number of threads in each of the thread bundles.
8. A reduction operation method applied to the computing chip according to any one of claims 1 to 7, the reduction operation method comprising: In each computing unit, executing reduction operation on data of each thread in the same thread bundle through a first-stage hardware reduction module to generate a thread bundle-level reduction result; Performing atomic reduction operation on the thread bundle level reduction results of each thread bundle in the same working group in a local shared memory through the first-stage hardware reduction module to obtain a working group level reduction result of the working group; and performing atomic reduction operation on the work group level reduction results of all work groups in the same grid in a data cache unit through a second-level hardware reduction module, generating grid level reduction results of the grid, and writing back the grid level reduction results into a global memory.
9. The reduction operation method according to claim 8, wherein in each computing unit, the first-stage hardware reduction module performs a reduction operation on data of each thread in the same thread bundle to generate a thread bundle-level reduction result, including: And carrying out hierarchical pairwise reduction on the data of each thread in the same thread bundle through a reduction tree hardware unit until the thread bundle level reduction result is generated.
10. The reduction operation method according to claim 8, wherein the performing, by the first-stage hardware reduction module, the atomic reduction operation on the thread bundle level reduction result of each thread bundle in the same workgroup in a local shared memory to obtain a workgroup level reduction result of the workgroup includes: Acquiring a first destination address in the local shared memory, wherein the first destination address stores the workgroup level reduction result; And executing an atomic reduction operation on the current value at the first destination address and a thread bundle level reduction result of the current thread bundle by a first atomic operation unit for each thread bundle in the same working group, and updating the value at the first destination address.
11. The reduction operation method according to claim 8, wherein the performing, by the second-level hardware reduction module, the atomic reduction operation on the workgroup-level reduction results of each workgroup in the same grid in the data cache unit, generating a grid-level reduction result of the grid, and writing the grid-level reduction result back to the global memory includes: Acquiring a second destination address storing the grid level reduction result in the global memory; judging whether the data corresponding to the second destination address is cached in the data caching unit or not; If the cache hits, the second atomic operation unit executes an atomic reduction operation on the current value in the hit cache line and the workgroup level reduction result acquired from the local shared memory; if the data is not hit, the data corresponding to the second destination address is loaded to the data cache unit from the global memory, and then the atomic reduction operation is executed.
12. The reduction operation method according to claim 11, wherein the performing, by the second-level hardware reduction module, the atomic reduction operation on the workgroup-level reduction results of each workgroup in the same grid in the data cache unit, generating a grid-level reduction result of the grid, and writing the grid-level reduction result back to the global memory, further comprises: after all the working groups in the grid complete the atom reduction operation, executing the global synchronization operation; And writing the grid level reduction result obtained in the data caching unit back to the second destination address of the global memory.
13. A reduction operation device applied to the computing chip according to any one of claims 1 to 7, the reduction operation device comprising: the thread bundle level reduction unit is used for executing reduction operation on the data of each thread in the same thread bundle through the first-stage hardware reduction module in each calculation unit to generate a thread bundle level reduction result; the work group level reduction unit is used for performing atomic reduction operation on the thread bundle level reduction results of all thread bundles in the same work group in a local shared memory through the first-stage hardware reduction module to obtain work group level reduction results of the work group; and the grid level reduction unit is used for performing atomic reduction operation on the work group level reduction results of all the work groups in the same grid in the data caching unit through the second-level hardware reduction module, generating grid level reduction results of the grid and writing back to the global memory.
14. An electronic device comprising a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling a connected communication between the processor and the memory, the program when executed by the processor implementing the reduction operation method according to any one of claims 8 to 12.
15. Computer can the storage medium is read and the data is read, the computer-readable storage medium stores one or more programs, the one or more programs are executable by one or more processors to implement the reduction operation method of any of claims 8 to 12.

Description

Computing chip, reduction operation method, related device and medium Technical Field The disclosure relates to the field of computer technology, and in particular, to a computing chip, a reduction operation method, a related device and a medium. Background In a massively parallel computing architecture such as GPU (Graphics Processing Unit, graphics processor), grid reduction (Grid Reduce) is a critical parallel computing operation, which is used to aggregate data held by all threads in a computing Grid, and finally generate a global result. This operation is typically applied to high performance computing scenarios such as deep learning gradient synchronization, matrix operations, statistical analysis, etc., where the execution efficiency directly affects the performance of the overall system. Typical grid reduction operations require the gradual merging of values calculated by each of thousands of threads into a single output value by addition, maximum, minimum, etc. reduction. Currently, the industry widely employs Multi-core or Multi-stage (Multi-Kernel/Multi-Pass) methods to implement this process. The method comprises the steps of starting a massive parallel kernel, enabling each Work Group (WG) to internally complete Work Group level reduction (WG reduction) to generate local reduction results, writing the results into an intermediate array in a global memory, starting a second kernel with smaller scale after the previous kernel is executed, reading all local results in the intermediate array, and further executing final global reduction. When the number of partial results is still large, it is even necessary to introduce a third or more stage of kernel calls to complete the reduction task in steps. However, there are a number of inherent drawbacks to such multi-core function or multi-stage implementations. First, each start and stop of the kernel is accompanied by significant scheduling overhead and synchronization delay, and the time cost accumulated by multiple kernel calls severely limits the overall execution efficiency. Secondly, the intermediate result must be transferred through the global memory, which causes a large number of unnecessary memory read-write operations, not only occupies valuable global memory bandwidth, but also becomes a performance bottleneck due to a high-delay access path. In addition, the data volume processed in the subsequent reduction stage is far smaller than that in the initial stage, so that huge parallel computing resources of the GPU cannot be fully utilized, and serious resource idling phenomenon occurs. Disclosure of Invention The embodiment of the disclosure provides a computing chip, a reduction operation method, a related device and a medium, and aims to realize integrated efficient reduction from a thread bundle level to a grid level on the premise of not needing global memory transfer and multiple kernel starts, remarkably reduce memory access delay and call overhead, fully exert the massive parallel computing capability of a GPU and improve the execution efficiency of grid reduction operation. According to an aspect of the present disclosure, there is provided a computing chip, including: the system comprises a plurality of computing units, a plurality of processing units and a processing unit, wherein the computing units are used for executing computing tasks comprising thread bundles and work groups, and each computing unit comprises a loading storage unit and a local shared memory; The data caching unit is used for caching the workgroup level reduction results from the plurality of computing units and executing reduction operation across workgroups; the global memory is used for storing grid level reduction results after the reduction operation of the data caching unit; The load storage unit is integrated with a first-stage hardware reduction module, and is used for executing reduction operation in a thread bundle and a work group and submitting the obtained work group-stage reduction result to the local shared memory; And the data caching unit is integrated with a second-level hardware reduction module which is coupled with the first-level hardware reduction module and is used for executing reduction operation crossing the working group. Optionally, the first stage hardware reduction module includes: the reduction tree hardware unit is used for executing parallel reduction calculation on the data of each thread in the single thread bundle and outputting a thread bundle level reduction result; And the first atomic operation unit is used for executing atomic reduction operation on the thread bundle level reduction results of a plurality of thread bundles in the same working group and storing the obtained working group level reduction results into the local shared memory. Optionally, the second-stage hardware reduction module includes: And the second atomic operation unit is used for executing atomic reduction operation on a plurality of workgroup level