CN-121996318-A - Load store circuit and graphics processor

CN121996318ACN 121996318 ACN121996318 ACN 121996318ACN-121996318-A

Abstract

The present disclosure provides a load store circuit and a graphics processor, relating to the technical field of graphics processing. The loading and storing circuit comprises an instruction scheduling module, an address generating module and a data service module, wherein the instruction scheduling module, the address generating module and the data service module are used for setting a buffer zone module comprising an operand buffer zone and an effective data buffer zone and used for caching information required by address calculation and data access, the instruction scheduling module is used for collecting data access instructions and outputting instruction information, the address generating module is used for calculating a target access address according to the instruction information and the operands, and the data service module is used for executing corresponding data loading or storing operation on a target cache unit. According to the scheme, the operand and the effective data are pre-cached, overflow and stagnation caused by inconsistent processing rhythm among modules can be avoided, the parallel processing capacity of a pipeline is improved, and the stability of an access time sequence and the data processing efficiency can be improved through independent address calculation and data access flow.

Inventors

Request for anonymity

Assignees

摩尔线程智能科技(北京)股份有限公司

Dates

Publication Date: 20260508
Application Date: 20251219

Claims (17)

1. The loading and storing circuit is characterized by comprising an instruction scheduling module, an address generating module and a data service module which are connected in sequence, and further comprising a buffer zone module, wherein: the instruction scheduling module is used for receiving a data access instruction and collecting instruction information corresponding to the data access instruction; the address generation module is used for acquiring an operand corresponding to the data access instruction according to the instruction information, and performing address calculation based on the instruction information and the operand to obtain a target access address corresponding to the data access instruction; the data service module is used for executing the access operation of the effective data corresponding to the target access address in the target cache unit; The buffer zone module comprises an operand buffer zone and an effective data buffer zone, wherein the operand buffer zone is used for buffering operands, the operands are used for being input to the address generation module, the effective data buffer zone is used for buffering effective data, and the effective data are used for being input to the data service module.
2. The load store circuit of claim 1, wherein the address generation module comprises an address generation state machine to enter different states according to an instruction type of the data access instruction or a data total amount of data to be processed; The states include a first computing state for performing a first address calculation and a second computing state for performing an address increment calculation based on the first address calculation.
3. The load store circuit of claim 2, wherein the address generation module performs an address calculation based on the instruction information and the operand, comprising: The first address calculation is performed based on the instruction information and the operand in the first calculation state or the address increment calculation is performed based on the instruction information and the operand in the second calculation state, respectively, by time-division multiplexing the same arithmetic logic unit.
4. The load store circuit of claim 2, wherein the states further comprise a wait state for waiting for an operand buffer to provide an operand; the address generation state machine enters different states according to the instruction type of the data access instruction or the total data amount of the data to be processed, and the address generation state machine comprises the following steps: Entering the waiting state according to a first instruction type, wherein the first instruction type comprises an operation instruction, and the operation instruction is used for executing a first preset operation; The first computing state is entered according to a second instruction type, wherein the second instruction type comprises an operation multiplexing instruction, and the operation multiplexing instruction is used for executing a second preset operation for a plurality of times in parallel; And entering the second calculation state according to the fact that the total data amount of the data to be processed in the execution process of the first calculation state is larger than a preset data amount threshold value.
5. The load store circuit of claim 4, wherein the operation instructions comprise at least one of a load instruction, a store instruction, and an atomic operation instruction; the operation multiplexing instruction comprises a load pair instruction and/or a store pair instruction.
6. The load store circuit of claim 4, wherein the states further comprise a first idle state, the first idle state to wait for an instruction trigger; the address generation state machine enters different states according to the instruction type of the data access instruction or the total data amount of the data to be processed, and further comprises at least one of the following steps: Entering the waiting state from the first idle state according to the first instruction type, and entering the first idle state according to the readiness of the operand in the waiting state; entering the first computing state from the first idle state according to the second instruction type; and executing address increment calculation once in each clock cycle in the second calculation state, and incrementing an internal circulation count, wherein the internal circulation count reaches the total number of transactions corresponding to the current instruction of the second instruction type, and enters the first idle state.
7. The load store circuit of claim 1, wherein the data service module comprises a data service state machine for entering different states according to a processing flow of the data access instruction or a data amount of data to be processed.
8. The load store circuit of claim 7, wherein the state comprises an address send state for sending the target access address calculated by the address generation module to the target cache unit; The data service state machine enters different states according to the processing flow of the data access instruction or the total data amount of the data to be processed, and the method comprises the following steps: Transmitting a target access address in each clock cycle according to the address transmitting state, and increasing an internal circulation counter, and converting to a data transmitting state when the value of the internal circulation counter reaches the total number of transactions corresponding to the current instruction; And according to the completion of sending all target access addresses in the address sending state, converting to a second idle state.
9. The load store circuit of claim 7, wherein the state comprises a data send state for performing a data load or store operation corresponding to a target access address; The data service state machine enters different states according to the processing flow of the data access instruction or the total data amount of the data to be processed, and the method comprises the following steps: and switching to a second idle state according to the data access operation completed in the data transmission state.
10. The load store circuit of claim 7, wherein the states comprise a second idle state for waiting for instruction triggers; The data service state machine enters different states according to the processing flow of the data access instruction or the total data amount of the data to be processed, and the method comprises the following steps: and according to the received data access instruction in the second idle state, switching to the second idle state.
11. The load store circuit of claim 1, wherein the instruction dispatch module is further configured to: collecting data block information corresponding to the data access instruction; Arbitration scheduling of data access instructions from a plurality of parallel execution units within the load store circuit to schedule a single data access instruction into the address generation module in a predetermined order from among a plurality of arriving data access instructions, and And sending a task identifier corresponding to the data access instruction of the current arbitration scheduling to a central scheduler of a processor where the loading storage circuit is located, so that the central scheduler coordinates the subsequent processing flows of the address generation module and the data service module based on the task identifier.
12. The load store circuit of any one of claims 1 to 11, wherein the buffer capacities of the operand buffer and the valid data buffer are configured to be target capacities.
13. The load store circuit of claim 1, wherein the data access instructions comprise store instructions, atomic operation instructions, and store pairing instructions; The data service module is further configured to: And processing the effective data by adopting a data packaging format different from the storage instruction aiming at the atomic operation instruction and the storage paired instruction.
14. The load store circuit of claim 1 or 13, wherein the data service module is further configured to: and after receiving the target access address, performing address validity check, and triggering a discarding mechanism when out-of-range access is checked.
15. The load store circuit of claim 1 or 13, wherein the data service module is further configured to: When a specific barrier synchronization instruction or an error barrier instruction is processed, a de-dependency control signal is sent to a common barrier module in a processor where the load store circuit is located.
16. The load store circuit of claim 1 or 13, wherein the data service module is further configured to: And carrying out fusion processing on the target access address from the address generation module and the effective data from the effective data buffer area, generating a unified access request and then sending the unified access request to the target cache unit.
17. A graphics processor, comprising: A load store circuit according to any of claims 1 to 16.

Description

Load store circuit and graphics processor Technical Field The present disclosure relates to the field of graphics processing technology, and in particular, to a load store circuit and a graphics processor. Background Graphics processors (Graphics Processing Unit, GPUs) serve as core components for parallel computing, whose internal Load Store Units (LSUs) assume the critical task of data exchange. As the demands of computing tasks on memory access throughput continue to increase, how to optimize the data processing efficiency of a load store unit has become a focus of attention in the art. In current graphics processor architectures, load store units typically employ a multi-stage pipeline design to achieve pipelined processing of instructions, but such multi-stage pipeline designs result in a delay in data ready time, which becomes more pronounced as the number of parallel threads of the graphics processor increases. Therefore, there is a strong need in the art for a load store unit design that can effectively control processing latency, while guaranteeing functional integrity, and at the same time significantly shortens instruction execution paths to accommodate the high performance requirements of current graphics processors for memory access performance. It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art. Disclosure of Invention An object of the embodiments of the present disclosure is to provide a load store circuit and a graphics processor, so as to improve the parallel processing capability of a pipeline, and improve the stability of access timing and the data processing efficiency. Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure. According to a first aspect of an embodiment of the present disclosure, there is provided a load store circuit, including an instruction dispatch module, an address generation module, and a data service module connected in sequence, and further including a buffer module, where: the instruction scheduling module is used for receiving a data access instruction and collecting instruction information corresponding to the data access instruction; the address generation module is used for acquiring an operand corresponding to the data access instruction according to the instruction information, and performing address calculation based on the instruction information and the operand to obtain a target access address corresponding to the data access instruction; the data service module is used for executing the access operation of the effective data corresponding to the target access address in the target cache unit; The buffer zone module comprises an operand buffer zone and an effective data buffer zone, wherein the operand buffer zone is used for buffering operands, the operands are used for being input to the address generation module, the effective data buffer zone is used for buffering effective data, and the effective data are used for being input to the data service module. In some example embodiments of the present disclosure, based on the foregoing solution, the address generation module includes an address generation state machine for entering different states according to an instruction type of the data access instruction or a data total amount of data to be processed; The states include a first computing state for performing a first address calculation and a second computing state for performing an address increment calculation based on the first address calculation. In some example embodiments of the disclosure, based on the foregoing scheme, the address generation module performs address calculation based on the instruction information and the operand, including: The first address calculation is performed based on the instruction information and the operand in the first calculation state or the address increment calculation is performed based on the instruction information and the operand in the second calculation state, respectively, by time-division multiplexing the same arithmetic logic unit. In some example embodiments of the present disclosure, based on the foregoing, the states further include a wait state for waiting for the operand buffer to provide an operand; the address generation state machine enters different states according to the instruction type of the data access instruction or the total data amount of the data to be processed, and the address generation state machine comprises the following steps: Entering the waiting state according to a first instruction type, wherein the first instruction type comprises an operation instruction, and the operation instruction is used for executing a first preset operation