CN-122018995-A - GPU instruction execution method, device, computer equipment, medium and program product

CN122018995ACN 122018995 ACN122018995 ACN 122018995ACN-122018995-A

Abstract

The disclosure provides a GPU instruction execution method, a device, a computer device, a medium and a program product, which relate to the technical field of computers, in particular to the fields of GPU instruction scheduling and execution and the like, and the implementation scheme is that a first instruction and first register data of a first thread bundle are acquired; the method comprises the steps of rearranging bits representing valid in first register data to the front end to generate rearranged first register data, storing a first operand required by an effective thread for executing a first instruction into a first reserved stack based on the rearranged first register data, establishing a mapping table to record the number corresponding relation of each thread before and after rearrangement, transmitting the first operand to a first operation unit for executing first operation, writing an operation result back to a target storage position corresponding to the effective thread based on the mapping table, and transmitting a second operand stored into a second reserved stack to an unoccupied operation unit for executing second operation in the same clock period during the first operation is executed.

Inventors

JIANG TAO
WANG TIANTIAN

Assignees

瀚博半导体(上海)股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260409

Claims (10)

1. A method for executing GPU instructions, the method comprising: Acquiring a first instruction to be executed by a first thread bundle and first register data, wherein the first register data is used for indicating the state of each thread in the first thread bundle under the condition of executing the first instruction respectively, and each bit of the first register data corresponds to each thread in the first thread bundle one by one; traversing each bit of the first register data, rearranging bits indicating that the thread state is valid to the front end and closely arranging the bits to generate rearranged first register data; Storing a first operand required by a thread in the first thread bundle, which is in a valid state and executes the first instruction, into a first reservation stack based on the rearranged first register data; Establishing a mapping table, wherein the mapping table is used for recording the corresponding relation between the number of each thread in the first thread bundle after rearrangement and the original number before rearrangement; Transmitting the first operand in the first reservation stack to at least one first operation unit to perform a first operation corresponding to the first instruction; writing the operation result output by the at least one first operation unit back to a target storage position corresponding to a thread with a valid state in the first thread bundle before rearrangement based on the corresponding relation recorded in the mapping table, and During the period that the first operation is executed, in response to acquiring a second instruction and storing a second operand required for executing the second instruction in a second reserved stack, the second operand is transmitted to at least one second operation unit to execute a second operation corresponding to the second instruction, wherein the at least one second operation unit comprises an unoccupied operation unit.
2. The method of claim 1, wherein storing a first operand in a first reservation stack required by a thread in the first thread bundle whose state is valid to execute the first instruction based on the rearranged first register data comprises: Reading out the first operand required by the thread in the first thread bundle with valid state to execute the first instruction from the original register module based on the rearranged first register data, and The first operand is stored in the first reservation stack.
3. The method of claim 2, wherein the original register module comprises a register file module and a source data cache module, the register file module being comprised of a plurality of independently addressed memory banks, the source data cache module being configured to cache data read from the register file module.
4. A method according to claim 3, wherein the first operand includes at least two operands, and wherein the reading from the original register module the first operand required by a thread in the first thread bundle whose state is valid to execute the first instruction comprises: In response to determining that there are at least two addresses of the first operand that map to a same bank of the plurality of independently addressed banks, reading the required first operand from the source data cache module and the register file module, respectively, and The required first operand is read from the register file module in response to determining that addresses of all of the first operands are mapped to different ones of the plurality of independently addressed banks, respectively.
5. The method according to claim 1, wherein the operation result after the first operation is performed includes at least one sub-operation result, each sub-operation result corresponds to a thread in the first thread bundle with a valid state, and writing the operation result output by the at least one first operation unit back to a target storage location corresponding to the thread in the first thread bundle with a valid state before rearrangement based on the correspondence recorded in the mapping table includes: Searching the mapping table to determine the original number of the threads with valid states in the first thread bundle before rearrangement based on the number of the threads with valid states in the first thread bundle after rearrangement; determining the target storage position to be written back for each sub-operation result based on the original number, and And writing each sub-operation result into the corresponding target storage position.
6. The method of any one of claims 1 to 5, wherein prior to generating the rearranged first register data, the method further comprises: counting bits in the first register data indicating that the thread state is valid to obtain a total number of valid threads, and Determining a number of the at least one first arithmetic unit required to execute the first instruction based on the total number of active threads and a maximum thread load of a single arithmetic unit, wherein the number of first arithmetic units is associated with a ratio of the total number of active threads and the maximum thread load of the single arithmetic unit.
7. A GPU instruction execution apparatus, the apparatus comprising: The data acquisition module is configured to acquire a first instruction to be executed by a first thread bundle and first register data, wherein the first register data is used for indicating the state of each thread in the first thread bundle under the condition of executing the first instruction, and each bit of the first register data corresponds to each thread in the first thread bundle one by one; a data arrangement module configured to traverse each bit of the first register data, rearrange bits indicating that a thread state is valid to a front end and arrange them closely to generate rearranged first register data; A data store module configured to store a first operand required by a thread in the first thread bundle whose state is valid to execute the first instruction in a first reservation stack based on the rearranged first register data; The mapping table establishing module is configured to establish a mapping table, and the mapping table is used for recording the corresponding relation between the number of each thread in the first thread bundle after rearrangement and the original number before rearrangement; A first operation module configured to transmit the first operand in the first reservation stack to at least one first operation unit to perform a first operation corresponding to the first instruction; A result write-back module configured to write back the operation result output by the at least one first operation unit to a target storage location corresponding to a thread in the first thread bundle whose state is valid before rearrangement based on the correspondence recorded in the mapping table, and And a second operation module configured to, during execution of the first operation, in response to fetching a second instruction and a second operand required for executing the second instruction having been stored in a second reservation stack, transmit the second operand to at least one second operation unit to execute a second operation corresponding to the second instruction within the same clock cycle, wherein the at least one second operation unit includes an unoccupied operation unit.
8. A computer device, the computer device comprising: At least one processor; A memory having a computer program stored thereon, wherein the computer program, when executed by the at least one processor, causes the at least one processor to perform the method of any of claims 1-6.
9. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, causes the processor to perform the method of any of claims 1-6.
10. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, causes the processor to carry out the method according to any one of claims 1-6.

Description

GPU instruction execution method, device, computer equipment, medium and program product Technical Field The present disclosure relates to the field of computer technology, and in particular, to the fields of GPU instruction scheduling and execution, and more particularly, to a GPU instruction execution method, apparatus, computer device, computer readable storage medium, and computer program product. Background Currently, mainstream GPUs are based on SIMT (single instruction multiple thread) architecture, and use a thread bundle as a basic execution unit, where multiple threads in one thread bundle execute the same instruction in the same clock cycle. When the program is executed to the branch logic, the GPU adopts a predicate execution mechanism, marks the valid or invalid state of each thread through a predicate register, writes back the operation result of the thread with the valid state into the register, and discards the operation result of the invalid thread. However, this mechanism only filters the invalid result during the write-back phase, and cannot prevent the invalid thread from participating in the complete instruction execution flow. The invalid thread still occupies the computing resources of the arithmetic unit and reserves the storage resources of the stack, resulting in a reduction of the effective utilization rate of the arithmetic unit. Meanwhile, since all the operation units in the same thread bundle must execute the same instruction in the same clock cycle, when some threads are invalid, the operation units occupied by the invalid threads cannot be used for executing other ready instructions, so that the instruction transmitting efficiency is reduced. Therefore, how to optimize the occupation of the operation resources by the invalid thread and improve the instruction execution efficiency of the GPU instruction becomes an important technical research direction. Disclosure of Invention The present disclosure provides a GPU instruction execution method, apparatus, computer device, computer readable storage medium, and computer program product. According to one aspect of the disclosure, a GPU instruction execution method is provided, which includes obtaining a first instruction to be executed by a first thread bundle and first register data, wherein the first register data is used for indicating states of each thread in the first thread bundle under execution of the first instruction, each bit of the first register data corresponds to each thread in the first thread bundle one by one, traversing each bit of the first register data, rearranging bits indicating that the states of the threads are valid to a front end and closely arranging the bits to generate rearranged first register data, storing first operands required by the threads in the first thread bundle for executing the first instruction based on the rearranged first register data in a first reserved stack, establishing a mapping table, wherein the mapping table is used for recording a corresponding relation between numbers of each thread in the first thread bundle after rearrangement and original numbers before rearrangement, transmitting the first operands in the first reserved stack to at least one first arithmetic unit to execute the first instruction, rearranging the bits indicating that the states of the threads are valid to the front end, storing the first operands in the first reserved stack in a first reserved stack, and storing the first operands in the first reserved stack in a first reserved stack in response to at least one first instruction execution unit, and storing the first operands in the first reserved arithmetic unit before the first instruction execution unit, and the first arithmetic unit in the first reserved stack in response to the first instruction execution unit, and the first arithmetic operation unit in the first instruction buffer unit. In some embodiments, storing a first operand in the first reservation stack required for a thread in the first thread bundle in a state of valid to execute the first instruction based on the rearranged first register data includes reading the first operand in the first thread bundle in the state of valid to execute the first instruction from the original register module based on the rearranged first register data and storing the first operand in the first reservation stack. In some embodiments, the original register module includes a register file module composed of a plurality of independently addressed memory banks and a source data cache module for caching data read from the register file module. In some embodiments, the first operand includes at least two operands, and reading the first operand from the original register module that is required for the thread in the first thread bundle to execute the first instruction includes reading the required first operand from the source data cache module and the register file module, respectively, in response to determini