CN-115393171-B - Conflict detection and queuing method for unified dyeing graphic processor shared register file packet mapping
Abstract
The invention relates to a conflict detection and queuing method for grouping mapping of a shared register file of a unified dyeing graphic processor. The method comprises the steps of 1) dividing a physical storage unit into 8 banks, distributing 16 read operand addresses and 8 write operand addresses for operand collection by each Warp, mapping the 16 read operand addresses and the 8 write operand addresses to the 8 banks after decoding, and 2) adopting a hierarchical grouping address mapping method for establishing a pipeline to map by grouping 16 read operand addresses and 8 write operand addresses of each group and establishing the pipeline. The invention mainly provides a conflict detection and queuing mechanism for solving the problem of possible Bank conflict in actual transmission aiming at the access process of a register file.
Inventors
- TIAN ZE
- WANG DANGHUI
- YUE CHEN
Assignees
- 西安翔腾微电子科技有限公司
- 西安翔腾微电子科技有限公司
Dates
- Publication Date
- 20260421
- Application Date
- 20220820
- Priority Date
- 20220820
Claims (3)
- 1. A conflict detection and queuing method for unified dyeing graphic processor shared register file packet mapping is characterized by comprising the following steps: 1) The physical memory unit is divided into 8 banks, each Warp allocates 16 read operand addresses and 8 write operand addresses for operand collection, and the 16 read operand addresses and the 8 write operand addresses are mapped to the 8 banks after being decoded; 2) Adopting a hierarchical grouping address mapping method for establishing a pipeline, and mapping by grouping 16 read operand addresses and 8 write operand addresses of each group and establishing the pipeline; 2.1 16 read operand addresses, 8 write operand addresses, one Group for each 4 addresses, and prescribing the priority of each stage of register set in the read-write mapping operation; 2.2 Analyzing the mapping mode according to the priority relation and listing a truth table, further deducing a logic expression of the mapping mode according to the truth table and drawing a mapping combination logic diagram of each stage; 2.3 Based on the performance of the pipeline, analyzing the generation mode of the read-write effective signal in the write operation mapping process, deducing a combination logic formula through a truth table of the read-write effective signal, and further obtaining a combination logic diagram of the combination logic formula; 2.4 The read operation and the write operation are carried out separately, wherein the read operation needs to be subjected to three-level grouping mapping, and the write operation only needs to be subjected to two-level grouping mapping; during the reading operation in the step 2), the flow is as follows: 3.1 Dividing the 16 read operand addresses into 4 groups, wherein addr_0, addr_1, addr_8, addr_9 are group_0, addr_2, addr_3, addr_10, addr_11 are group_1, addr_4, addr_5, addr_12, addr_13 are group_2, addr_6, addr_7, addr_14, and addr_15 are group_3; 3.2 Each address is decoded by a decoding unit to obtain a corresponding Bank of the address and generate an effective signal, the effective signals are sent to a 0 th level read effective information register set Reg_rd_level_0 after passing through a mapping module Allocate_Logic_0, data stored in the Reg_rd_level_0 is sent to a 1 st level read effective information register set Reg_rd_1 after passing through another mapping module Allocate_Logic_1, the data stored in the Reg_rd_level_1 is sent to a Bank to execute read operation after passing through a last mapping module Allocate_Logic_2, and is simultaneously sent to a2 nd level read effective information register set Reg_rd_level_2 for reordering of read operation numbers; 3.2.1 Decoding 3-bit wide Bank judgment bits in the 16 read operand addresses to obtain a corresponding Bank number of the address and generate an effective signal valid_i_bj, wherein i represents an address number and j represents a Bank number; 3.2.2 Each Group contains 4 valid_i_bjs, judges whether the read address is valid according to the value of the valid_i_bj, maps the read address into 4 registers of Reg_rd_level_0 to generate corresponding valid signals, and prescribes that the smaller the address number is, the higher the priority is; 3.2.3 The data stored in the register group Reg_rd_level_0 is sent to the Reg_rd_level_1 after passing through another mapping module Allocate_Logic_level_1, the Reg_rd_level_1 is divided into 2 groups, each group comprises 8 registers with the bit width of 4 bits and valid signals valid thereof and is used for storing corresponding address information sent by the previous stage, each register of the Reg_rd_level_0 has a valid bit corresponding to the register, if the valid bit is 1, the information in the register is valid, the information needs to be mapped to the Reg_rd_level_1, if the valid bit is 0, the information is invalid, mapping is not needed, and when the information in the Reg_rd_level_0 is mapped to a certain register of the Reg_rd_level_1, the valid bit of the corresponding register in the Reg_rd_level_1 is valid; 3.2.4 The data in the Reg_rd_level_1 is sent to the Bank and the Reg_rd_level_2 after passing through the last mapping module allocation_Logic_level_2, wherein the data sent to the Bank is 7 bits and is used as an access address for executing read operation, the 7bit data is address offset decoded by a real address corresponding to address information in the Reg_rd_level_1, namely a physical Block number is not index information of 4 bits stored in the first two stages, and the data sent to the Reg_rd_level_2 is still index information of 4 bits representing the read address number and is used for executing read operand reordering; 3.2.5 In the process of mapping the reg_rd_level_1 to the reg_rd_level_2, the amount of the remaining effective information in the reg_rd_level_1 in all 8 banks needs to be judged simultaneously for determining pipeline suspension or sending a new rdena signal, if the amount of the remaining effective information in the reg_rd_level_1 in each of all 8 banks is less than or equal to 2, a rdena signal needs to be sent to the outside, which indicates that the data in the next beat of reg_rd_level_1 is all sent to the banks, the pipeline resumes operation, a new group of read addresses needs to be sent to the system for starting mapping, and if the amount of the remaining effective information in the reg_rd_level_1 in any one Bank is greater than 2, the pipeline continues suspension until the condition is met.
- 2. The method for detecting and queuing conflict of the shared register file packet map of the unified dyeing graphic processor according to claim 1, wherein the specific flow during the writing operation in the step 2) is as follows: 3.3 Dividing 8 write operand addresses into 2 groups, wherein addr_0, addr_1, addr_4 and addr_5 are group_0, addr_2, addr_3, addr_6 and addr_7 are group_1; 3.4 Each write address firstly passes through a decoding unit to obtain a corresponding Bank of the address and generate an effective signal, the effective signals pass through a mapping module allocation_Logic_0 and are then sent to a 0 th-level write effective information register set Reg_wr_level_0, data stored in the Reg_wr_level_0 pass through another mapping module allocation_Logic_1 and are then sent to the banks for writing operation, and meanwhile, the quantity of the residual address information in Reg_rd_level_0 in all 8 banks is required to be judged to determine that a write pipeline pauses or sends a new wrena signal.
- 3. The method for conflict detection and queuing for unified graphics processor shared register file packet map of claim 2 wherein said step 3.4) comprises the steps of: 3.4.1 Decoding 3bit wide Bank judgment bits in 8 writing operand address information to obtain a corresponding Bank of the address and generate an effective signal valid_i_bj, wherein i represents an address number and j represents a Bank number, each address i is provided with a connecting line mapped to all banks, when a decoding result is mapped to a certain Bank, valid_i_bj corresponding to the Bank is set to 1, valid_i_bj corresponding to the rest 7 banks is set to 0, and mapping logic of the valid_i_bj corresponding to the Bank is the same as that of mapping logic from address decoding information to reg_rd_level_0 in a read operation; 3.4.2 The method comprises the steps of performing a write operation on write address index information stored in Reg_wr_level_0 through a second mapping module, wherein the write address index information stored in Reg_wr_level_0 is sent to a Bank to perform the write operation after passing through a second mapping module, the data sent to the Bank is 7 bits and is used as an access address to perform the write operation, the 7 bits data is address offset obtained by real address decoding corresponding to the index information in Reg_wr_level_0, namely physical Block numbers, and the 4bit index information stored in Reg_wr_level_0 is not needed any more, each register of Reg_wr_level_0 has a Valid bit corresponding to the register, if the Valid bit is 1, the information in the register is required to be mapped to the Bank to perform the write operation, and the lower priority of the register number in Reg_wr_level_0 is higher than the higher priority of the real address corresponding to the index information in Reg_wr_level_0, and when all registers of Reg_w_level_0 are sent to the Bank to the 1 to perform the write operation, and all registers of Reg_level_0 are sent to the 1; 3.4.3 In the process of mapping the reg_wr_level_0 to the banks for writing, the amount of the remaining effective information in the reg_wr_level_0 in all 8 banks needs to be judged simultaneously for determining pipeline suspension or sending a new wrena signal, if the amount of the remaining effective information in the reg_wr_level_0 in each Bank in all 8 banks is less than or equal to 1, a wrena signal needs to be sent to the outside, which indicates that the data in the next beat of reg_wr_level_0 is all sent to the banks, the pipeline resumes operation, a new group of writing addresses needs to be sent to the system for starting mapping, and if the amount of the remaining effective information in the reg_wr_level_0 in any one Bank is greater than 1, the pipeline continues suspension until the condition is met.
Description
Conflict detection and queuing method for unified dyeing graphic processor shared register file packet mapping Technical Field The invention belongs to the technical field of unified dyeing graphics processors, and particularly relates to a conflict detection and queuing mechanism for shared register file packet mapping. Background With the gradual heat and fire of research and application directions such as data mining, machine learning, high-definition video image processing, big data and the like in recent years, the increase of the traditional CPU performance in a computer cannot meet the increase of the computer computing demands of the applications. In this context, a variety of computing accelerators have been proposed, including graphics processors (Graphics Processing Units, GPUs), field programmable gate arrays (Field Programmable GATE ARRAY, FPGA), and the like. Needless to say, GPU is the most widely used, and for some specific applications, several hundred times the acceleration effect can be achieved with GPU compared to CPU. With the increase in computing demands and the increase in parallelism of GPU thread levels, GPUs have also begun to be used in the field of general purpose computing, and further evolved into general purpose graphics processors General Purpose Computing on Graphics Processing Units, GPGPUs. The present graphics processor refers to more than a special graphics acceleration chip, and more than one SoC chip that can implement operations by large-scale thread-level parallel (THREAD LEVEL PARALLELISM, TLP) computation. Today's computers can directly perform massively parallel computations by virtue of the massive thread-level parallelism of GPUs, which typically make use of all computation-related hardware and use of appropriate algorithms to enable the computation to be greatly accelerated. The unified dyeing array is an operation core of the unified dyeing graphic processor, and the occupied area of the unified dyeing array in the graphic processor layout is very considerable. The stream multi-core processor is a core component for performing texture processing by the GPU with a unified dyeing architecture, the dyeing core is a basic dyeing unit of the unified dyeing array, and the organization of a register file is an important content of the design of the stream multi-core processor. The registers are the most efficient storage components in memory on the GPU, which takes register files (REGISTER FILE, RF) as a unit, in order to reduce the cost of context switching, the GPU deploys larger-scale register file resources for stream processors, GPUs with different computing capacities, and the number of register files on each stream multi-core processor is different. The register file of the GPU is much larger than the Cache, is mainly made of SRAM materials, and occupies a non-negligible area. In the GPU, the number of the dyed cores is large, the number of the sites of each core is large, and the number and the site scale of the register files are considerable, so that the management and the use of the register files are of great significance to the performance of the GPU. In a GPU, the dyeing kernel in each stream multicore processor exclusively shares the RF in the SM, the thread bundles (Warp) are the basic units for GPU scheduling and running, each Warp needs to be allocated a dedicated architecture register file belonging to itself and indexed by Warp id, and each architecture register has a corresponding physical register allocated in the register file. Once a register is allocated, it is not released until after the cooperative thread array to which Warp belongs has completed its execution. Therefore, the allocation and release management of the register file becomes very important, and the conflict detection and resolution strategy of the read-write address is more indispensable. Disclosure of Invention In order to solve the technical problems in the background art, the invention provides a conflict detection and queuing method for the shared register file packet mapping of a unified dyeing graphic processor, and mainly provides a conflict detection and queuing mechanism for solving the problem of memory Bank (Bank) conflict possibly occurring in actual transmission aiming at the access process of a register file. The invention provides a conflict detection and queuing method for grouping and mapping shared register files of a unified dyeing graphic processor, which is characterized by comprising the following steps of: 1) The physical storage unit is divided into 8 banks, each Warp allocates 16 read operand addresses and 8 write operand addresses for operand collection, and the 16 read operand addresses and the 8 write operand addresses are mapped to the 8 banks after being decoded; 2) The hierarchical packet address mapping method for building the pipeline is adopted, and the mapping is carried out by grouping 16 read operand addresses and 8 write operand addresse