US-20260127701-A1 - COOPERATIVE EXECUTION OF SUBGROUP OPERATIONS

US20260127701A1US 20260127701 A1US20260127701 A1US 20260127701A1US-20260127701-A1

Abstract

This disclosure provides systems, devices, apparatus, and methods, including computer programs encoded on storage media, for concurrently executing disjoint write operations. A graphics processor may receive a representation of source code. The source code may include a group of operations that write to a shared dataset. The group of operations may include a first subgroup of operations and a second subgroup of operations. The processor may execute the first subgroup of operations as a first concurrent plurality of threads that write disjointly to a set of memory locations of the shared dataset during a first set of time periods. The processor may execute the second subgroup of operations as a second concurrent plurality of threads that write disjointly to a subset of the set of memory locations during a second set of time periods. The first set of time periods and the second set of time periods may not overlap.

Inventors

Adimulam Ramesh Babu
Alfredo Olegario Saucedo
Srihari Babu Alla
Avinash Seetharamaiah
Jonnala gadda Nagendra Kumar

Assignees

QUALCOMM INCORPORATED

Dates

Publication Date: 20260507
Application Date: 20241105

Claims (20)

1 . An apparatus for graphics processing, comprising: a memory; and a processor coupled to the memory and, based at least in part on information stored in the memory, the processor is configured to: receive a representation of source code comprising a group of operations that write to a shared dataset, wherein the group of operations comprises a first subgroup of operations and a second subgroup of operations; execute the first subgroup of operations as a first concurrent plurality of threads that write disjointly to a set of memory locations of the shared dataset during a first set of time periods; and execute the second subgroup of operations as a second concurrent plurality of threads that write disjointly to a subset of the set of memory locations during a second set of time periods, wherein the first set of time periods and the second set of time periods do not overlap.
2 . The apparatus of claim 1 , wherein the first subgroup of operations comprises a plurality of iterations of a write command configured to write to the shared dataset, wherein each iteration of the plurality of iterations corresponds with one thread of the first concurrent plurality of threads, wherein each write command for each thread of the first concurrent plurality of threads writes to a disjoint memory location of the shared dataset with respect to every other write command of the first concurrent plurality of threads during the execution of the first subgroup of operations for each time period of the first set of time periods.
3 . The apparatus of claim 1 , wherein, to execute the first subgroup of operations as the first concurrent plurality of threads that write disjointly to the set of memory locations of the shared dataset during the first set of time periods, the processor is configured to: output a first indication of the first concurrent plurality of threads to a shader processor (SP) for parallel execution of the first concurrent plurality of threads during the first set of time periods, wherein, to execute the second subgroup of operations as the second concurrent plurality of threads that write disjointly to at least the subset of the set of memory locations during the second set of time periods, the processor is configured to: output a second indication of the second concurrent plurality of threads to the SP for parallel execution of the second concurrent plurality of threads during the second set of time periods.
4 . The apparatus of claim 3 , wherein a graphics processing unit (GPU) comprises a plurality of SPs, wherein the plurality of SPs comprises the SP.
5 . The apparatus of claim 1 , wherein the processor is further configured to: assign the first subgroup of operations and the second subgroup of operations to a common workgroup, wherein, to execute the first subgroup of operations as the first concurrent plurality of threads and to execute the second subgroup of operations as the second concurrent plurality of threads, the processor is configured to: output an indication of the common workgroup to a shader processor (SP) for serial execution of each subgroup of operations of the common workgroup and parallel execution of each thread of each subgroup of operations of the common workgroup.
6 . The apparatus of claim 1 , wherein the group of operations comprises an iterative loop having a write function to a shared array of elements, wherein the iterative loop iterates the write function through the shared array of elements.
7 . The apparatus of claim 1 , wherein the group of operations comprises an atomic function that writes to the shared dataset.
8 . The apparatus of claim 7 , wherein the processor is further configured to: determine that the group of operations comprises the atomic function that writes to the shared dataset; and replace the atomic function with a non-atomic function before the execution of the first subgroup of operations and the execution of the second subgroup of operations, wherein, during the execution of the first concurrent plurality of threads, the non-atomic function writes disjointly to the set of memory locations of the shared dataset during the first set of time periods, wherein, during the execution of the second concurrent plurality of threads, the non-atomic function writes disjointly to at least the subset of the set of memory locations during the second set of time periods.
9 . A method for graphics processing, comprising: receiving a representation of source code comprising a group of operations that write to a shared dataset, wherein the group of operations comprises a first subgroup of operations and a second subgroup of operations; executing the first subgroup of operations as a first concurrent plurality of threads that write disjointly to a set of memory locations of the shared dataset during a first set of time periods; and executing the second subgroup of operations as a second concurrent plurality of threads that write disjointly to a subset of the set of memory locations during a second set of time periods, wherein the first set of time periods and the second set of time periods do not overlap.
10 . The method of claim 9 , wherein the first subgroup of operations comprises a plurality of iterations of a write command configured to write to the shared dataset, wherein each iteration of the plurality of iterations corresponds with one thread of the first concurrent plurality of threads, wherein each write command for each thread of the first concurrent plurality of threads writes to a disjoint memory location of the shared dataset with respect to every other write command of the first concurrent plurality of threads during the execution of the first subgroup of operations for each time period of the first set of time periods.
11 . The method of claim 9 , wherein executing the first subgroup of operations as the first concurrent plurality of threads that write disjointly to the set of memory locations of the shared dataset during the first set of time periods comprises: outputting a first indication of the first concurrent plurality of threads to a shader processor (SP) for parallel execution of the first concurrent plurality of threads during the first set of time periods, wherein executing the second subgroup of operations as the second concurrent plurality of threads that write disjointly to at least the subset of the set of memory locations during the second set of time periods comprises: outputting a second indication of the second concurrent plurality of threads to the SP for parallel execution of the second concurrent plurality of threads during the second set of time periods.
12 . The method of claim 11 , wherein a graphics processing unit (GPU) comprises a plurality of SPs, wherein the plurality of SPs comprises the SP.
13 . The method of claim 9 , further comprising: assigning the first subgroup of operations and the second subgroup of operations to a common workgroup, wherein executing the first subgroup of operations as the first concurrent plurality of threads and executing the second subgroup of operations as the second concurrent plurality of threads comprises: outputting an indication of the common workgroup to a shader processor (SP) for serial execution of each subgroup of operations of the common workgroup and parallel execution of each thread of each subgroup of operations of the common workgroup.
14 . The method of claim 9 , wherein the group of operations comprises an atomic function that writes to the shared dataset.
15 . The method of claim 14 , further comprising: replacing the atomic function with a non-atomic function before the execution of the first subgroup of operations and the execution of the second subgroup of operations, wherein, during the execution of the first concurrent plurality of threads, the non-atomic function writes disjointly to the set of memory locations of the shared dataset during the first set of time periods, wherein, during the execution of the second concurrent plurality of threads, the non-atomic function writes disjointly to at least the subset of the set of memory locations during the second set of time periods.
16 . A computer-readable medium storing computer executable code, the code, when executed by a processor, causes the processor to: receive a representation of source code comprising a group of operations that write to a shared dataset, wherein the group of operations comprises a first subgroup of operations and a second subgroup of operations; execute the first subgroup of operations as a first concurrent plurality of threads that write disjointly to a set of memory locations of the shared dataset during a first set of time periods; and execute the second subgroup of operations as a second concurrent plurality of threads that write disjointly to a subset of the set of memory locations during a second set of time periods, wherein the first set of time periods and the second set of time periods do not overlap.
17 . The computer-readable medium of claim 16 , wherein the code, when executed by the processor, causes the processor to: assign the first subgroup of operations and the second subgroup of operations to a common workgroup, wherein, to execute the first subgroup of operations as the first concurrent plurality of threads and to execute the second subgroup of operations as the second concurrent plurality of threads, the code, when executed by the processor, causes the processor to: output an indication of the common workgroup to a shader processor (SP) for serial execution of each subgroup of operations of the common workgroup and parallel execution of each thread of each subgroup of operations of the common workgroup.
18 . The computer-readable medium of claim 16 , wherein the group of operations comprises an atomic function that writes to the shared dataset, wherein the code, when executed by the processor, causes the processor to: determine that the group of operations comprises the atomic function that writes to the shared dataset; and replace the atomic function with a non-atomic function before the execution of the first subgroup of operations and the execution of the second subgroup of operations, wherein, during the execution of the first concurrent plurality of threads, the non-atomic function writes disjointly to the set of memory locations of the shared dataset during the first set of time periods, wherein, during the execution of the second concurrent plurality of threads, the non-atomic function writes disjointly to at least the subset of the set of memory locations during the second set of time periods.
19 . The computer-readable medium of claim 16 , wherein, to execute the first subgroup of operations as the first concurrent plurality of threads that write disjointly to the set of memory locations of the shared dataset during the first set of time periods, the code, when executed by the processor, causes the processor to: output a first indication of the first concurrent plurality of threads to a shader processor (SP) for parallel execution of the first concurrent plurality of threads during the first set of time periods, wherein, to execute the second subgroup of operations as the second concurrent plurality of threads that write disjointly to at least the subset of the set of memory locations during the second set of time periods, the code, when executed by the processor, causes the processor to: output a second indication of the second concurrent plurality of threads to the SP for parallel execution of the second concurrent plurality of threads during the second set of time periods.
20 . The computer-readable medium of claim 16 , wherein the first subgroup of operations comprises a plurality of iterations of a write command configured to write to the shared dataset, wherein each iteration of the plurality of iterations corresponds with one thread of the first concurrent plurality of threads, wherein each write command for each thread of the first concurrent plurality of threads writes to a disjoint memory location of the shared dataset with respect to every other write command of the first concurrent plurality of threads during the execution of the first subgroup of operations for each time period of the first set of time periods.

Description

TECHNICAL FIELD The present disclosure relates generally to processing systems, and more particularly, to one or more techniques for graphics processing. INTRODUCTION Computing devices often perform graphics and/or display processing (e.g., utilizing a graphics processing unit (GPU), a central processing unit (CPU), a display processor, etc.) to render and display visual content. Such computing devices may include, for example, computer workstations, mobile phones such as smartphones, embedded systems, personal computers, tablet computers, and video game consoles. GPUs are configured to execute a graphics processing pipeline that includes one or more processing stages, which operate together to execute graphics processing commands and output a frame. A central processing unit (CPU) may control the operation of the GPU by issuing one or more graphics processing commands to the GPU. Modern day CPUs are typically capable of executing multiple applications concurrently, each of which may need to utilize the GPU during execution. A display processor may be configured to convert digital information received from a CPU to analog values and may issue commands to a display panel for displaying the visual content. A device that provides content for visual presentation on a display may utilize a CPU, a GPU, and/or a display processor. Current techniques may not address efficient execution of iterative loops using multi-core processors. There is a need for improved iterative processing techniques. BRIEF SUMMARY The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later. In an aspect of the disclosure, a method, a computer-readable medium, and an apparatus are provided. The apparatus may include a memory; and at least one processor coupled to the memory and, based at least in part on information stored in the memory, the at least one processor may be configured to receive a representation of source code that includes a group of operations that write to a shared dataset. The group of operations may include a first subgroup of operations and a second subgroup of operations. The at least one processor may be configured to execute the first subgroup of operations as a first concurrent plurality of threads that write disjointly to a set of memory locations of the shared dataset during a first set of time periods. The at least one processor may be configured to execute the second subgroup of operations as a second concurrent plurality of threads that write disjointly to a subset of the set of memory locations during a second set of time periods. The first set of time periods and the second set of time periods may not overlap. In some aspects, the techniques described herein relate to a method for graphics processing, including: receiving a representation of source code including a group of operations that write to a shared dataset, where the group of operations includes a first subgroup of operations and a second subgroup of operations; executing the first subgroup of operations as a first concurrent plurality of threads that write disjointly to a set of memory locations of the shared dataset during a first set of time periods; and executing the second subgroup of operations as a second concurrent plurality of threads that write disjointly to a subset of the set of memory locations during a second set of time periods, where the first set of time periods and the second set of time periods do not overlap. In some aspects, the techniques described herein relate to a method, where the first subgroup of operations includes a plurality of iterations of a write command configured to write to the shared dataset, where each iteration of the plurality of iterations corresponds with one thread of the first concurrent plurality of threads, where each write command for each thread of the plurality of threads writes to a disjoint memory location of the shared dataset with respect to every other write command of the plurality of threads during the execution of the first subgroup of operations for each time period of the first set of time periods. In some aspects, the techniques described herein relate to a method, where executing the first subgroup of operations as the first concurrent plurality of threads that write disjointly to the set of memory locations of the shared dataset during the first set of time periods includes: outputting a first indication of the first concurrent plurality of threads to a shader processor (SP) for parallel execution of the first concurrent plurality of threads during the first set of time periods,