CN-122019214-A - Inter-thread group data synchronization method, device, computer equipment and program product
Abstract
The application relates to a method, a device, computer equipment and a program product for synchronizing data among thread groups. The method comprises the steps of comparing the number of a current thread group with the number of a current thread bundle, writing a calculation result of the current thread bundle into a specified memory of a target thread group by adopting a data writing mode corresponding to the comparison result, sending a first instruction to a data ready synchronization unit of the target thread group, wherein the first instruction is used for indicating the data ready synchronization unit to update a first count value, the first count value is used for judging whether the data ready synchronization unit is in a ready state or not, and for any target thread group, if the data ready synchronization unit is in a ready state, loading data from the specified memory of the target thread group by each thread bundle participating in parallel accumulation operation in the target thread group, and executing accumulation operation based on the loaded data. By adopting the method, the parallel computing execution efficiency can be improved.
Inventors
- Request for anonymity
- Request for anonymity
- Request for anonymity
- Request for anonymity
Assignees
- 上海壁仞科技股份有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260413
Claims (10)
- 1. A method of inter-thread group data synchronization, the method comprising: Comparing the number of the current thread group with the number of the current thread bundle to obtain a comparison result; Writing the calculation result of the current thread bundle into a specified memory of a target thread group by adopting a data writing mode corresponding to the comparison result, wherein the number of the target thread group is consistent with the number of the current thread bundle; Sending a first instruction to a data ready synchronization unit of the target thread group, wherein the first instruction is used for indicating the data ready synchronization unit to update a first count value, and the first count value is used for judging whether the data ready synchronization unit is in a ready state or not; for any one of the target thread groups, if the data ready synchronization unit in the target thread group is in a ready state, each thread bundle participating in parallel accumulation operation in the target thread group loads data from a specified memory of the target thread group, and executes accumulation operation based on the loaded data.
- 2. The method of claim 1, wherein writing the calculation result of the current thread bundle into the specified memory of the target thread group by adopting the data writing mode corresponding to the comparison result comprises: And under the condition that the comparison result indicates that the number of the current thread group is consistent with the number of the current thread group, writing the calculation result of the current thread group into a specified memory of the current thread group by adopting a local data writing instruction.
- 3. The method of claim 1, wherein writing the calculation result of the current thread bundle into the specified memory of the target thread group by adopting the data writing mode corresponding to the comparison result comprises: and under the condition that the comparison result indicates that the number of the current thread group is inconsistent with the number of the current thread group, determining a target thread group according to the number of the current thread group, and writing the calculation result of the current thread group into a specified memory of the target thread group by adopting a remote data writing instruction.
- 4. A method according to claim 3, characterized in that the method further comprises: After finishing data writing by adopting the remote data writing instruction, sending a second instruction to a data ready synchronization unit of the target thread group, wherein the second instruction is used for indicating the data ready synchronization unit to update a byte count value; And in the case that the first count value is updated to 0 and the byte count value is also updated to 0, determining that the data ready synchronization unit in the target thread group is in a ready state.
- 5. The method according to any one of claims 1 to 4, further comprising: After any thread bundle completes accumulation operation, a third instruction is sent to the buffer ready synchronous units of each thread group respectively, the third instruction indicates the buffer ready synchronous units to update a second count value, and when the second count value is updated to 0, the buffer ready synchronous units are determined to be in a ready state.
- 6. The method of claim 5, wherein the initial value of the first count value of the data ready synchronization unit is N, N is the number of thread groups in the cluster, the initial value of the byte count value of the data ready synchronization unit is D x (N-1), wherein D is the number of bytes written once, the initial value of the second count value of the buffer ready synchronization unit is N x M, and M is the number of thread bundles that participate in the sum accumulation operation in the thread groups, wherein N, D, M is a positive integer.
- 7. An inter-thread group data synchronization apparatus, the apparatus comprising: the comparison module is used for comparing the number of the current thread group with the number of the current thread bundle to obtain a comparison result; The data writing module is used for writing the calculation result of the current thread bundle into the specified memory of the target thread group by adopting a data writing mode corresponding to the comparison result, and the number of the target thread group is consistent with the number of the current thread bundle; The first sending module is used for sending a first instruction to the data ready synchronization unit of the target thread group, wherein the first instruction is used for indicating the data ready synchronization unit to update a first count value, and the first count value is used for judging whether the data ready synchronization unit is in a ready state or not; And the operation module is used for loading data from the appointed memory of the target thread group by each thread bundle participating in parallel accumulation operation in the target thread group if the data ready synchronization unit in the target thread group is in a ready state aiming at any target thread group, and executing the accumulation operation based on the loaded data.
- 8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
- 9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
- 10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
Description
Inter-thread group data synchronization method, device, computer equipment and program product Technical Field The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a computer device, and a program product for synchronizing data between thread groups. Background In parallel computing architecture, a plurality of Thread Groups (TG) are typically combined into a Thread Group cluster (TG cluster) as computing execution units, and each TG is mapped into one Computing Unit (CU). Collaborative calculation is needed to be realized among different Tgs in the TG cluster through data communication, and data accumulation is performed after the same result is calculated in parallel by a plurality of Tgs in a typical scene. To avoid duplicate computation, each TG receives and accumulates only the computation data of a particular thread bundle (Warp), and writes its computation result into the group shared memory (Group Shared Memory, GSM) of the target TG while reading the target data of other TGs from GSM. In the related art, the collaborative computing process is implemented by using a Memory Barrier (Memory Barrier) to guarantee the GSM data ready (DATA READY) and the buffer ready (buffer ready), that is, two types of mbar resources are set, namely a data ready synchronization unit (s_data_ready_mbar) and a buffer ready synchronization unit (s_buffer_ready_mbar). However, in the related art, the process of collaborative computing is implemented by means of data ready (DATA READY) and buffer ready (buffer ready), which involves double-layer nested traversal and multiple branch judgment operations, and because conditional jump instructions need to be executed in each traversal, a longer communication delay exists, thereby reducing the execution efficiency of parallel computing. Disclosure of Invention In view of the foregoing, it is desirable to provide a method, apparatus, computer device, and program product for synchronizing data among thread groups that can reduce communication latency to improve the execution efficiency of parallel computing. In a first aspect, the present application provides a method for synchronizing data between thread groups, the method comprising: Comparing the number of the current thread group with the number of the current thread bundle to obtain a comparison result; Writing the calculation result of the current thread bundle into a specified memory of a target thread group by adopting a data writing mode corresponding to the comparison result, wherein the number of the target thread group is consistent with the number of the current thread bundle; Sending a first instruction to a data ready synchronization unit of the target thread group, wherein the first instruction is used for indicating the data ready synchronization unit to update a first count value, and the first count value is used for judging whether the data ready synchronization unit is in a ready state or not; for any one of the target thread groups, if the data ready synchronization unit in the target thread group is in a ready state, each thread bundle participating in parallel accumulation operation in the target thread group loads data from a specified memory of the target thread group, and executes accumulation operation based on the loaded data. In one embodiment, the writing the calculation result of the current thread bundle into the specified memory of the target thread group by adopting the data writing mode corresponding to the comparison result includes: And under the condition that the comparison result indicates that the number of the current thread group is consistent with the number of the current thread group, writing the calculation result of the current thread group into a specified memory of the current thread group by adopting a local data writing instruction. In one embodiment, the writing the calculation result of the current thread bundle into the specified memory of the target thread group by adopting the data writing mode corresponding to the comparison result includes: and under the condition that the comparison result indicates that the number of the current thread group is inconsistent with the number of the current thread group, determining a target thread group according to the number of the current thread group, and writing the calculation result of the current thread group into a specified memory of the target thread group by adopting a remote data writing instruction. In one embodiment, the method further comprises: After finishing data writing by adopting the remote data writing instruction, sending a second instruction to a data ready synchronization unit of the target thread group, wherein the second instruction is used for indicating the data ready synchronization unit to update a byte count value; And in the case that the first count value is updated to 0 and the byte count value is also updated to 0, determining that the d