JP-7856877-B2 - Arithmetic processing unit and arithmetic processing method

JP7856877B2JP 7856877 B2JP7856877 B2JP 7856877B2JP-7856877-B2

Inventors

五島正裕
葛毅

Assignees

富士通株式会社
大学共同利用機関法人情報・システム研究機構

Dates

Publication Date: 20260512
Application Date: 20220314

Claims (12)

A queue that holds memory access instructions containing at least one address, When a memory access instruction includes multiple addresses, including discontinuous addresses , a reduced address generation unit generates a reduced address by reducing the bits of the multiple addresses, A collision determination unit that determines a collision between the reduced address and the address held in the queue, Based on the determination result by the collision determination unit, an access control unit controls the processing of memory access instructions held in the queue, A processing unit having a arithmetic processing unit.
The arithmetic processing device according to claim 1, wherein the reduced address generation unit stores the generated reduced addresses in the queue.
The arithmetic processing apparatus according to claim 1 or claim 2, wherein the reduced address generation unit generates the reduced address by reducing a plurality of addresses included in a memory access instruction and a single address included in a memory access instruction according to a reduction rule.
The reduced address generation unit generates the reduced address for each of the multiple address groups obtained by grouping the multiple addresses included in the memory access instruction, The collision determination unit determines whether there is a collision between the reduced addresses of the plurality of address groups and the addresses held in the queue, according to any one of claims 1 to 3.
The reduced address generation unit, A first reduced address generation unit reduces the bits of the multiple addresses for each of the multiple groups, which are divided into a different number or a different grouping method than the aforementioned multiple address groups. It includes a second reduced address generation unit that reduces the bits of multiple addresses for each of the multiple address groups, The first reduced address generated by the first reduced address generation unit is stored in the queue. The arithmetic processing device according to claim 4, wherein the second reduced address generated by the second reduced address generation unit is output to the collision determination unit.
The reduced address generation unit generates reduced addresses that indicate a range of multiple addresses included in a memory access instruction, The collision determination unit determines a collision between an address included in the range indicated by the reduced address and an address held in the queue, according to any one of claims 1 to 3.
The reduced address generation unit, A third reduced address is generated by reducing the bits of the aforementioned multiple addresses, and a fourth reduced address is generated that indicates a range of the multiple addresses. The generated third reduced address and the fourth reduced address, or both, are kept in the queue. The collision determination unit, If a third reduced address is held in the queue, a collision is determined between the third reduced address held in the queue and the third reduced address generated by the reduced address generation unit. The arithmetic processing device according to claim 6, wherein if a fourth reduced address is held in the queue, a collision is determined between the fourth reduced address held in the queue and the fourth reduced address generated by the reduced address generation unit.
The arithmetic processing device according to any one of claims 2 to 7, wherein the reduced address generation unit generates a reduced address that is represented by ternary logic where the bit value at each bit position of the plurality of addresses is all "0", all "1", or undefined.
The arithmetic processing device according to claim 8, wherein the reduced address generation unit generates a reduced address by making the bits lower than the bit position indicating indeterminate indeterminate.
The arithmetic processing apparatus according to claim 8 or 9, wherein the reduced address generation unit generates a reduced address comprising a key address indicating one of the plurality of addresses and a mask vector represented by the exclusive OR of the bit values at each bit position of the plurality of addresses.
The collision detection unit has a plurality of collision detection circuits that each determine whether a reduced address is in conflict with a plurality of addresses held in the queue. Each of the above-mentioned collision detection circuits is A negative exclusive OR circuit that calculates the negative exclusive OR of the bits of the key address contained in the reduced address and the reduced address held in the queue, A first OR circuit calculates the logical OR of the bits of the mask vector contained in the reduced address and the reduced address held in the queue, A second OR circuit calculates the bitwise OR of the output of the negative exclusive OR circuit and the output of the first OR circuit, The circuit comprises a logic AND circuit that calculates the logical AND of all bits of the output of the second logic OR circuit, The arithmetic processing device according to claim 10, wherein an address collision is detected when the output of the logical AND circuit is "1".
A method for performing arithmetic operations on an arithmetic processing unit having a queue that holds memory access instructions including at least one address, The reduced address generation unit of the arithmetic processing unit generates a reduced address by reducing the bits of the multiple addresses when a memory access instruction includes multiple addresses that are discontinuous . The collision detection unit of the aforementioned processing unit determines whether there is a collision between the reduced address and the address held in the queue. A processing method comprising an access control unit of the processing unit controlling the processing of memory access instructions held in the queue based on the determination result by the collision determination unit.

Description

This invention relates to an arithmetic processing unit and an arithmetic processing method. In arithmetic processing units (ACUs) with SIMD (Single Instruction Multiple Data) arithmetic capabilities, processing performance is improved by executing calculations on multiple data points in parallel. For example, multiple data points used in calculations on multiple data points are read in parallel from memory using vector load instructions. In other words, ACUs with SIMD arithmetic capabilities have an architecture that optimizes data transfer. For example, in this type of arithmetic processing unit, a method for managing address collisions is known by executing a check instruction to determine whether a memory address in an address hazard state exists during the execution of a vector operation (see, for example, Patent Document 1). Furthermore, a method is known for integrating requests by determining the overlap of addresses within a single line during the execution of a vector gather instruction, and notifying the scalar arithmetic unit of the accumulated value of the overlap of addresses across multiple lines (see, for example, Patent Document 2). Additionally, a method is known for holding the subsequent memory access instruction if an overlap is detected between the address range of a vector scatter instruction with a region specification and the address of a subsequent memory access instruction (see, for example, Patent Document 3). Special table 2019-517060 publicationJapanese Patent Publication No. 2020-52862Japanese Patent Publication No. 2002-24205 A block diagram showing an example of the main components of a processing unit in one embodiment.This is an explanatory diagram showing an example of the change in the state of the payload in Figure 1.Figure 1 is an explanatory diagram showing an example of how the reduced address generation unit generates reduced addresses.This is an explanatory diagram showing an example of an address range represented by the reduced address in Figure 3.This is an explanatory diagram showing an example of the address determination operation by each match determination circuit in the match determination unit of Figure 1.An example of another processing unit is shown in the image.Block diagram showing an example of a processing unit in another embodiment.Figure 7 is an explanatory diagram showing an example of a payload and an example of a method for generating a reduced address by the reduced address generation unit.Figure 7 is a circuit diagram showing an example of a matching detection circuit.A block diagram showing an example of the main components of a processing unit in another embodiment.A block diagram showing an example of the main components of a processing unit in another embodiment.A block diagram showing an example of the main components of a processing unit in another embodiment.Figure 12 is an explanatory diagram showing an example of the operation of the matching detection circuit. The embodiments will be described below with reference to the drawings. Figure 1 shows an example of an arithmetic processing unit in one embodiment. The arithmetic processing unit 1 shown in Figure 1 is, for example, a processor such as a CPU (Central Processing Unit) capable of executing SIMD arithmetic instructions. The arithmetic processing unit 1 includes a load/store queue 2, an access control unit 8, and a data cache 9. The load/store queue 2 includes a reduced address generation unit 3, a payload 4, and a match determination unit 5. Figure 1 shows some of the elements used for memory access. In practice, the arithmetic processing unit 1 may also include an instruction cache, an instruction decoder, a scheduler such as a reservation station, a register file, and an arithmetic unit including an arithmetic unit capable of executing SIMD arithmetic instructions (not shown). The arithmetic processing unit 1, which has a scheduler such as a reservation station, may execute instructions in an order different from the order decoded by the instruction decoder (i.e., the order of instructions written in the program). Therefore, to ensure that load and store instructions are committed in the correct order, a load/store queue 2 is provided to detect address collisions. Address collisions are explained in Figure 2. Load and store instructions include a single address or multiple addresses. The reduced address generation unit 3 generates a reduced address CAD by reducing multiple addresses AD (AD0-AD7) when a memory access instruction MA, such as a load instruction or store instruction, contains multiple addresses AD. For example, the reduced address generation unit 3 reduces multiple addresses included in a vector load instruction or vector store instruction issued by the scheduler. For example, vector load and store instructions include consecutive address vector load and store instructions where addresses are consecutive in ascending or descending order, and stride vector load an