EP-3794443-B1 - SYSTEM AND METHOD OF LOADING AND REPLICATION OF SUB-VECTOR VALUES

EP3794443B1EP 3794443 B1EP3794443 B1EP 3794443B1EP-3794443-B1

Inventors

MAHURIN, ERIC
PLONDKE, ERICH
HOYLE, DAVID

Dates

Publication Date: 20260513
Application Date: 20190507

Claims (15)

A processor system (100) comprising a processor (101) coupled to a memory (102), the processor (101) comprising: a processor memory system including a second level cache (104) and a first level cache (106); a vector register file (173); a vector register (172) outside of the vector register file (173) and configured to load data from the second level cache (104) responsive to a special purpose load instruction (170) without using the first level cache (106) as an intermediate stage and without using the ports of the vector register file (173); replication circuitry (176) configured to, responsive to a vector instruction (184) , replicate a selected sub-vector value from the vector register (172), and provide the replicated sub-vector values to inputs of vector multiply-accumulate circuitry (178); and an execution unit (116) configured to execute the special purpose load instruction (170).
The processor system (100) of claim 1, wherein the replication circuitry (176) includes a multiplexor having an input coupled to the vector register (172) and an output coupled to the vector multiply-accumulate circuitry (178), the multiplexor configured to select any sub-vector value from the vector register (172) and to output multiple copies of the selected sub-vector value to the vector multiply-accumulate circuitry (178).
The processor system (100) of claim 1, wherein the special purpose load instruction (170) is configured to cause loading of multiple scalar values in parallel from the second level cache (104) into the vector register (172) without transferring the multiple scalar values through the first level cache (106).
The processor system (100) of claim 1, further comprising a second vector register (174) included in the vector register file (173), wherein the vector instruction (184) corresponds to a vector multiply-accumulate instruction; and wherein the vector multiply-accumulate circuitry (178) is configured to perform, responsive to the vector instruction (184), a vector multiply-accumulate operation using replicated sub-vector values and using sub-vector values in the second vector register (174).
The processor system (100) of claim 1, wherein the vector register file (173) includes a second vector register (174) and a third vector register (180) which do not have the ability to replicate selected data.
The processor system (100) of claim 4, wherein the replication circuitry (176) is further configured to replicate a second sub-vector value from the vector register (172) in parallel with replicating the selected sub-vector value.
The processor system (100) of claim 6, wherein the vector multiply-accumulate circuitry (178) is configured to perform a second vector operation in parallel with performing the vector multiply-accumulate operation, the second vector operation using the second replicated sub-vector value.
The processor system (100) of claim 6, wherein the replication circuitry (176) is configured to apply an offset to a position in the vector register (172) of the selected sub-vector value to select a position in the vector register (172) of the second sub-vector value.
The processor system (100) of claim 8, wherein the position of the selected sub-vector value is indicated by a loop parameter of a convolutional filter operation.
A method of operating a processor system (100) comprising a processor (101), the method comprising: loading data from a second level cache (104) of the processor (101) into a vector register (172) outside of a vector register file (173) of the processor (101) responsive to a special purpose load instruction (170) without using a first level cache (106) of the processor (101) as an intermediate stage and without using the ports of the vector register file (173); and responsive to a vector instruction (184): replicating a selected sub-vector value from the vector register (172); and providing the replicated sub-vector values to inputs of vector multiply-accumulate circuitry (178).
The method of claim 10, further comprising accessing a scalar value from a scalar register (126), the scalar value indicating the selected sub-vector value.
The method of claim 10, wherein the loading comprises loading of multiple scalar values in parallel from the second level cache (104) into the vector register (172) without transferring the multiple scalar values through the first level cache (106).
The method of claim 10, further comprising, responsive to the vector instruction (184): performing a vector operation using the replicated sub-vector values and sub-vector values in a second vector register (174) included in the vector register file (173); and storing results of the vector operation into a third vector register (180) included in the vector register file (173).
The method of claim 13, further comprising replicating a second sub-vector value from the vector register (172) in parallel with replicating the selected sub-vector value.
A non-transitory computer-readable medium comprising instructions that, when executed by the processor (101) of claim 1, cause the processor (101) to perform the method according to any one of claims 10 to 14.

Description

I. Field The present disclosure is generally related to processors, and more specifically related to loading and replication of data for vector processing. II. Description of Related Art Advances in technology have resulted in more powerful computing devices. For example, computing devices such as laptop and desktop computers and servers, as well as wireless computing devices such as portable wireless telephones, have improved computing capabilities and are able to perform increasingly complex operations. Increased computing capabilities have also enhanced device capabilities in various other applications. For example, vehicles may include processing devices to enable global positioning system operations or other location operations, self-driving operations, interactive communication and entertainment operations, etc. Other examples include household appliances, security cameras, metering equipment, etc., that also incorporate computing devices to enable enhanced functionality, such as communication between internet-of-things (IoT) devices. A computing device may include one or more digital signal processors (DSPs), image processors, or other processing devices that perform vector processing includes performing multiple instances of a common operation (e.g., a multiply operation) to process multiple elements of vector data in parallel. For example, a vector may include multiple sub-vector values (e.g., individual elements within the vector), such as 32 four-byte values. In an illustrative multiply operation, for each four-byte value of the vector, the first byte is multiplied by a first one-byte value, the second byte is multiplied by a second one-byte value, the third byte is multiplied by a third one-byte value, and the fourth byte is multiplied by a fourth one-byte value. The four multiplication products are added together and the resulting sum is added to a corresponding four-byte value in a destination vector register. To enable all of the resulting 128 multiplications to be performed simultaneously, each of the four one-byte values are read from a scalar register and are replicated (e.g., multiple copies of the four one-byte values are output at substantially the same time, also referred to herein as "splat" or "broadcast") from the scalar register to inputs of vector multiplication circuitry. However, loading the four one-byte values into the scalar register can cause a processing bottleneck due to the scalar register being loaded via conventional processor operations that involve multiple transfers of the data (e.g., loading the four-byte value from memory to a second-level (L2) cache, from the L2 cache to a first-level (L1) cache, and from the L1 cache to a scalar register in a register file). US 2011/040822 A1 discloses mechanisms for performing a complex matrix multiplication operation. A vector load operation is performed to load a first vector operand of the complex matrix multiplication operation to a first target vector register. The first vector operand comprises a real and imaginary part of a first complex vector value. A complex load and splat operation is performed to load a second complex vector value of a second vector operand and replicate the second complex vector value within a second target vector register. The second complex vector value has a real and imaginary part. A cross multiply add operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the complex matrix multiplication operation. The partial product is accumulated with other partial products and a resulting accumulated partial product is stored in a result vector register. US 2013/339664 A1 discloses an apparatus that includes an execution unit to execute a first instruction and a second instruction. The execution unit includes input register space to store a first data structure to be replicated when executing the first instruction and to store a second data structure to be replicated when executing the second instruction. The first and second data structures are both packed data structures. Data values of the first packed data structure are twice as large as data values of the second packed data structure. The first data structure is four times as large as the second data structure. The execution unit also includes replication logic circuitry to replicate the first data structure when executing the first instruction to create a first replication data structure, and, to replicate the second data structure when executing the second instruction to create a second replication data structure. US 2005/193050 A1 discloses obtaining partial products, to perform multiplication of matrices in a vector processing system, by dot multiplication of vector registers containing multiple copies of elements of a first matrix and vector registers containing values from rows of a second matrix. The dot products obtained from this dot multiplicat