CN-122018983-A - General purpose processor instruction fetch apparatus and method

CN122018983ACN 122018983 ACN122018983 ACN 122018983ACN-122018983-A

Abstract

The invention discloses an instruction extraction device and method of a general processor, wherein the device comprises an instruction extraction unit, an instruction decoding unit and an instruction distribution unit, and further comprises a branch prediction unit, a decoupling queue, an instruction hardware prefetcher and a micro-operation cache, wherein the branch prediction unit comprises two levels of predictors, each level of predictors can predict branch jump addresses and directions, the decoupling queue is connected with the branch prediction unit and the instruction extraction unit and is used for caching and predicting the branch jump addresses, the instruction hardware prefetcher is coupled with the branch prediction unit and is used for retrieving an instruction prefetching table according to the predicted branch jump addresses and generating a prefetching request, and storing the prefetching instruction into the instruction cache prefetching queue for the instruction extraction unit to extract, the micro-operation cache is arranged at the downstream of the instruction decoding unit and is used for storing decoded instructions and corresponding resource conflict information, and the stored information can be directly extracted by the instruction distribution unit.

Inventors

LI DONGSHENG
ZHANG XIRAN
WU YE
LI WEI

Assignees

南京英麒智能科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260120

Claims (9)

1. A general purpose processor instruction fetching device, comprising an instruction fetching unit, an instruction decoding unit and an instruction distributing unit, characterized in that it further comprises: The branch prediction unit comprises two levels of predictors, wherein each level of predictors can predict the jump address and direction of a branch; A decoupling queue, which is used for connecting the branch prediction unit and the instruction extraction unit and caching the predicted branch jump address; The instruction hardware prefetcher is coupled with the branch prediction unit and is used for retrieving an instruction prefetching table according to the predicted branch jump address, generating a prefetching request and storing a prefetching instruction into an instruction cache prefetching queue for extraction by the instruction extraction unit; The micro-operation cache is arranged at the downstream of the instruction decoding unit and is used for storing decoded instructions and corresponding resource conflict information, and the stored information can be directly extracted by the instruction distributing unit.
2. The apparatus of claim 1, wherein the branch prediction unit comprises a first level predictor and a second level predictor, wherein the first level predictor comprises a regular branch entry array and an indirect branch entry array, wherein the regular branch entry array stores branch instruction information other than indirect branches, wherein the branch instruction information comprises jump addresses and branch type attributes, and wherein the indirect branch entry array stores only type identification and index information of the indirect branch instruction.
3. The apparatus of claim 2, wherein the first level predictor further comprises a return address stack, the return address stack being a circular buffer structure for storing a return address of the function call instruction, and wherein when the first level predictor predicts a return type branch and the return address stack is not empty, the top of stack target address is selected as a prediction result and the top of stack pointer is updated.
4. The apparatus of claim 1, wherein the instruction hardware prefetcher comprises an instruction prefetch table and an instruction prefetch controller, the prefetch table comprising a plurality of entries, each entry comprising a plurality of prefetch target addresses and corresponding cache block flags.
5. The apparatus of claim 1 wherein the instruction prefetch controller indexes the instruction prefetch table with the branch prediction address of the branch prediction unit to retrieve the address of the prefetch candidate and send it to the instruction prefetch queue, discarding the most recently sent candidate address when the instruction prefetch queue is full, and sending the prefetch request to the next level cache only if an instruction cache miss is detected and the miss status holding register is available.
6. The apparatus of claim 4, wherein each entry of the prefetch table supports associating a plurality of prefetch target addresses under the same trigger condition, and wherein a cache block flag bit indicates a number of cache blocks in a prefetch address space.
7. The apparatus of claim 1 wherein the micro-operation cache comprises a tag array and a data array in a multi-way structure, the tag array storing entry tags and entry information, the entry information being used to index information in the data array, the data array storing decoded instruction information and corresponding resource conflict information.
8. The apparatus of claim 7 wherein each row of the data array includes a plurality of decoded instruction information and corresponding resource conflict information, the resource conflict information including conflict information for data hazards and execution unit hardware resources for the instruction dispatch unit to directly read and skip subsequent conflict detection logic.
9. A method for fetching instructions from a general purpose processor, comprising the steps of: s1, a branch prediction unit independently operates based on a current program counter, generates a predicted jump address and writes the predicted jump address into a decoupling queue; S2, the instruction extraction unit reads an instruction extraction target address from the decoupling queue, and queries a micro-operation cache and an instruction cache in parallel according to the address; S3, if the micro-operation cache hits, the stored decoded instruction information and the corresponding resource conflict information are sent to the instruction distribution unit, if the micro-operation cache hits, the instruction cache data are processed by the instruction decoding unit and then sent to the instruction distribution unit, and meanwhile the decoded instruction information and the corresponding resource conflict information are stored to the micro-operation cache; S4, the instruction hardware prefetcher actively queries a prefetch table after the branch prediction address is generated, and initiates a prefetch request when the cache miss and miss state is met and the availability of the register is kept.

Description

General purpose processor instruction fetch apparatus and method Technical Field The present invention relates to computer processors, and more particularly, to a general purpose processor instruction fetch apparatus and method. Background Modern out-of-order superscalar processor pipeline depths are generally over 10 stages, with the front end serving as the core component of instruction supply, the efficiency of which directly determines the overall performance of the processor. The front end mainly performs instruction fetching, branch prediction, decoding and distribution functions, and needs to continuously provide sufficient instruction flow for the back end to maintain high utilization of the execution units. The traditional front-end design usually adopts a coupled architecture, a branch predictor and an instruction fetch unit run synchronously, and a prediction result directly drives a fetch flow. Meanwhile, to improve prediction accuracy, the mainstream scheme relies on a large-capacity Branch Target Buffer (BTB) and a complex prediction algorithm, and is supplemented with an instruction prefetching mechanism based on a history access mode. The decoding stage adopts a multi-path parallel decoder, and decodes a fixed number of instructions every cycle. However, the prior art has the structural defects that firstly, the high-precision branch prediction relies on a large-scale storage array and complex calculation logic, so that the area and the power consumption are obviously increased, the prediction delay and the accuracy are difficult to be compatible, and particularly, the prediction error rate of indirect jump and long-history mode branches is high. Second, the instruction prefetch mechanism lacks accurate linkage with branch prediction, often resulting in cache pollution due to blind prefetching, or a decrease in cache efficiency due to a conservative policy failing to timely cover basic blocks across cache lines. Third, the decoding unit is used as a front-end bandwidth bottleneck, the fixed throughput of the decoding unit cannot match the back-end dynamic execution requirement, and repeated decoding of the same instruction causes a waste of computation power. Fourth, the sub-modules of the front end operate in close coupling, and in the long-delay scenario of instruction cache miss, the branch prediction and prefetch logic is forced to be stopped, so that the speculative execution resources idle, which results in a large amount of power consumption under the condition of low load. The above-mentioned drawbacks are particularly prominent in RISC architecture processors, which have high instruction density and high branching frequency, and the front-end power consumption and area overhead have become key bottlenecks for limiting the energy efficiency improvement. On the premise of ensuring performance, the hardware resource overhead is reduced and the instruction supply efficiency is improved through micro-architecture level optimization, which is a technical problem to be solved in the current industry. Disclosure of Invention The invention aims to provide a general processor instruction extraction device and a general processor instruction extraction method which can reduce hardware resource expenditure and improve instruction supply efficiency on the premise of ensuring performance. The general processor instruction extracting device comprises an instruction extracting unit, an instruction decoding unit and an instruction distributing unit, and further comprises: The branch prediction unit comprises two levels of predictors, wherein each level of predictors can predict the jump address and direction of a branch; A decoupling queue, which is used for connecting the branch prediction unit and the instruction extraction unit and caching the predicted branch jump address; The instruction hardware prefetcher is coupled with the branch prediction unit and is used for retrieving an instruction prefetching table according to the predicted branch jump address, generating a prefetching request and storing a prefetching instruction into an instruction cache prefetching queue for extraction by the instruction extraction unit; The micro-operation cache is arranged at the downstream of the instruction decoding unit and is used for storing decoded instructions and corresponding resource conflict information, and the stored information can be directly extracted by the instruction distributing unit. An asynchronous coordination mechanism among front sub-systems is constructed by systematically integrating a branch prediction unit, a decoupling queue, an instruction hardware prefetcher and a micro-operation cache four-big module. The instruction hardware prefetcher directly couples the branch prediction result, converts the prediction address into an accurate prefetch request, avoids blindness caused by lack of program flow information of the traditional prefetcher, and can not only provide enough instructio