US-12619761-B2 - Instruction execution that broadcasts and masks data values at different levels of granularity

US12619761B2US 12619761 B2US12619761 B2US 12619761B2US-12619761-B2

Abstract

An apparatus is described that includes an execution unit to execute a first instruction and a second instruction. The execution unit includes input register space to store a first data structure to be replicated when executing the first instruction and to store a second data structure to be replicated when executing the second instruction. The first and second data structures are both packed data structures. Data values of the first packed data structure are twice as large as data values of the second packed data structure. The execution unit also includes replication logic circuitry to replicate the first data structure when executing the first instruction to create a first replication data structure, and, to replicate the second data structure when executing the second data instruction to create a second replication data structure. The execution unit also includes masking logic circuitry to mask the first replication data structure at a first granularity and mask the second replication data structure at a second granularity. The second granularity is twice as fine as the first granularity.

Inventors

Elmoustapha Ould-Ahmed-Vall
Robert Valentine
Jesus Corbal
Bret L. Toll
Mark J. Charney

Assignees

INTEL CORPORATION

Dates

Publication Date: 20260505
Application Date: 20241227

Claims (20)

1 . A processor core comprising: a decode unit to decode an instruction, the instruction having fields to specify a location in a memory of a Y*X-bit packed data structure having Y X-bit elements, having a field to specify a mask register as a source of a mask, and having a field to specify a destination vector register; and execution circuitry coupled with the decode unit, the execution circuitry to perform operations corresponding to the instruction, including to: load at least one X-bit element of the Y*X-bit packed data structure; generate a masked replication data structure from the Y*X-bit packed data structure based on applying the mask at an X-bit element granularity and based on using zeroed masking to zero masked out elements; and store a result including the masked replication data structure in the destination vector register, wherein a length of the masked replication data structure is a multiple of Y*X bits, and wherein the length of the masked replication data structure is same as a length of the destination vector register.
2 . The processor core of claim 1 , wherein X is either 32 or 64, and wherein Y*X is either 128 or 256.
3 . The processor core of claim 1 , wherein X is 32 and Y is 4.
4 . The processor core of claim 1 , wherein X is 32 and Y is 8.
5 . The processor core of claim 1 , wherein X is 64 and Y is 2.
6 . The processor core of claim 1 , wherein X is 64 and Y is 4.
7 . The processor core of claim 1 , wherein the instruction has fields to specify a base and an index corresponding to the location in the memory of the Y*X-bit packed data structure.
8 . The processor core of claim 1 , wherein the execution circuitry, to perform the operations corresponding to the instruction, is not to load a masked out element of the Y*X-bit packed data structure.
9 . The processor core of claim 1 , wherein the mask register is in a set of registers having a register that cannot be used as a mask.
10 . The processor core of claim 1 , wherein the processor core also allows using the mask register for merged masking in which masked out elements retain initial values they had prior to the merged masking.
11 . The processor core of claim 1 , wherein the mask register is one of a set of eight mask registers, and wherein the mask register is a 64-bit mask register.
12 . The processor core of claim 1 , wherein the execution circuitry includes: replication circuitry to replicate a data structure; and masking circuitry to apply a mask to the data structure.
13 . The processor core of claim 1 , wherein the processor core is a reduced instruction set computing (RISC) processor core.
14 . The processor core of claim 1 , wherein X is either 32 or 64, and wherein Y*X is either 128 or 256, wherein the instruction has fields to specify a base and an index corresponding to the location in the memory of the Y*X-bit packed data structure, wherein the execution circuitry, to perform the operations corresponding to the instruction, is not to load a masked out element of the Y*X-bit packed data structure, wherein the mask register is in a set of registers having a register that cannot be used as a mask, and wherein the mask register is a 64-bit mask register.
15 . A method comprising: decoding an instruction, the instruction having fields specifying a location in a memory of a Y*X-bit packed data structure having Y X-bit elements, having a field specifying a mask register as a source of a mask, and having a field specifying a destination vector register; and performing operations corresponding to the instruction, including: loading at least one X-bit element of the Y*X-bit packed data structure; generating a masked replication data structure from the Y*X-bit packed data structure based on applying the mask at an X-bit element granularity and based on using zeroed masking to zero masked out elements; and storing a result including the masked replication data structure in the destination vector register, wherein a length of the masked replication data structure is a multiple of Y*X bits, and wherein the length of the masked replication data structure is same as a length of the destination vector register.
16 . The method of claim 15 , wherein the instruction has fields to specify a base and an index corresponding to the location in the memory of the Y*X-bit packed data structure, and one of: X is 32 and Y is 4; X is 32 and Y is 8; X is 64 and Y is 2; or X is 64 and Y is 4.
17 . The method of claim 16 , wherein to perform the operations corresponding to the instruction includes not loading a masked out element of the Y*X-bit packed data structure.
18 . The method of claim 17 , wherein the mask register is one of a set of eight mask registers, and wherein the mask register is a 64-bit mask register.
19 . A system comprising: a dynamic random access memory (DRAM); and a processor coupled with the DRAM, the processor comprising: a decode unit to decode an instruction, the instruction having fields to specify a location in a memory of a Y*X-bit packed data structure having Y X-bit elements, having a field to specify a mask register as a source of a mask, and having a field to specify a destination vector register; and execution circuitry coupled with the decode unit, the execution circuitry to perform operations corresponding to the instruction, including to: load at least one X-bit element of the Y*X-bit packed data structure; generate a masked replication data structure from the Y*X-bit packed data structure based on applying the mask at an X-bit element granularity and based on using zeroed masking to zero masked out elements; and store a result including the masked replication data structure in the destination vector register, wherein a length of the masked replication data structure is a multiple of Y*X bits, and wherein the length of the masked replication data structure is same as a length of the destination vector register.
20 . The system of claim 19 , further comprising a mass storage device, wherein X is either 32 or 64, and wherein Y*X is either 128 or 256, wherein the instruction has fields to specify a base and an index corresponding to the location in the memory of the Y*X-bit packed data structure, and wherein the execution circuitry, to perform the operations corresponding to the instruction, is not to load a masked out element of the Y*X-bit packed data structure.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS The present application is a continuation application claiming priority from U.S. patent application Ser. No. 18/357,066 filed Jul. 21, 2023, which is a continuation application claiming priority from U.S. patent application Ser. No. 17/677,958 filed Feb. 22, 2022, now U.S. Pat. No. 11,709,961, which is a continuation application claiming priority from U.S. patent application Ser. No. 16/730,844 filed Dec. 30, 2019, now U.S. Pat. No. 11,301,581, which is a continuation application claiming priority from U.S. patent application Ser. No. 16/141,283 filed Sep. 25, 2018, now U.S. Pat. No. 10,909,259, which is a continuation application claiming priority from U.S. patent application Ser. No. 15/245,113 filed Aug. 23, 2016, now U.S. Pat. No. 10,083,316, which is a continuation application claiming priority from U.S. patent application Ser. No. 13/976,433 filed Jun. 26, 2013, now U.S. Pat. No. 9,424,327, which is a U.S. National Phase Application under 35 U.S.C. § 371 of International Application No. PCT/US2011/067095 filed Dec. 23, 2011, all of which are incorporated herein by reference in their entirety. FIELD OF INVENTION The present invention pertains to the computing sciences generally, and, more specifically to an instruction execution that broadcasts and masks data values at different levels of granularity. BACKGROUND FIG. 1 shows a high level diagram of a processing core 100 implemented with logic circuitry on a semiconductor chip. The processing core includes a pipeline 101. The pipeline consists of multiple stages each designed to perform a specific step in the multi-step process needed to fully execute a program code instruction. These typically include at least: 1) instruction fetch and decode; 2) data fetch; 3) execution; 4) write-back. The execution stage performs a specific operation identified by an instruction that was fetched and decoded in prior stage(s) (e.g., in step 1) above) upon data identified by the same instruction and fetched in another prior stage (e.g., step 2) above). The data that is operated upon is typically fetched from (general purpose) register storage space 102. New data that is created at the completion of the operation is also typically “written back” to register storage space (e.g., at stage 4) above). The logic circuitry associated with the execution stage is typically composed of multiple “execution units” or “functional units” 103_1 to 103_N that are each designed to perform its own unique subset of operations (e.g., a first functional unit performs integer math operations, a second functional unit performs floating point instructions, a third functional unit performs load/store operations from/to cache/memory, etc.). The collection of all operations performed by all the functional units corresponds to the “instruction set” supported by the processing core 100. Two types of processor architectures are widely recognized in the field of computer science: “scalar” and “vector”. A scalar processor is designed to execute instructions that perform operations on a single set of data, whereas, a vector processor is designed to execute instructions that perform operations on multiple sets of data. FIGS. 2A and 2B present a comparative example that demonstrates the basic difference between a scalar processor and a vector processor. FIG. 2A shows an example of a scalar AND instruction in which a single operand set, A and B, are ANDed together to produce a singular (or “scalar”) result C (i.e., AB=C). By contrast, FIG. 2B shows an example of a vector AND instruction in which two operand sets, A/B and D/E, are respectively ANDed together in parallel to simultaneously produce a vector result C, F (i.e., A.AND.B=C and D.AND.E=F). As a matter of terminology, a “vector” is a data element having multiple “elements”. For example, a vector V=Q, R, S, T, U has five different elements: Q, R, S, T and U. The “size” of the exemplary vector V is five (because it has five elements). FIG. 1 also shows the presence of vector register space 104 that is different that general purpose register space 102. Specifically, general purpose register space 102 is nominally used to store scalar values. As such, when, the any of execution units perform scalar operations they nominally use operands called from (and write results back to) general purpose register storage space 102. By contrast, when any of the execution units perform vector operations they nominally use operands called from (and write results back to) vector register space 107. Different regions of memory may likewise be allocated for the storage of scalar values and vector values. Note also the presence of masking logic 104_1 to 104_N and 105_1 to 105_N at the respective inputs to and outputs from the functional units 103_1 to 103_N. In various implementations, only one of these layers is actually implemented—although that is not a strict requirement. For any instruction that employs masking, input masking