EP-4738108-A1 - PROCESSOR PERFORMANCE ACCELERATION USING HARDWARE-ENHANCED MULTIPLY-ACCUMULATE STREAMING

EP4738108A1EP 4738108 A1EP4738108 A1EP 4738108A1EP-4738108-A1

Abstract

Systems and methods are provided for processor performance acceleration using hardware-enhanced multiply-accumulate streaming. In examples, a dispatcher of a processor dispatches each of two or more multiply-accumulate ("MAC") or arithmetic logic unit ("ALU") instructions (and corresponding input data values), which are directed to a pipeline processing system and received in two or more consecutive clock cycles, to one of a set of input registers among a plurality of sets of input registers based on a sub-stream among a plurality of sub-streams, into which the two or more MAC or ALU instructions have been divided. The input data values for the plurality of sub-streams are processed by a MAC device or an ALU device in consecutive clock cycles, with output values from each sub-stream being stored in a sub-stream accumulator for that sub-stream, the accumulated value of which are added to a pipeline accumulator after all sub-streams have been processed.

Inventors

BHOJARAJA, Dushyanth
THAJUDEEN, TARIQ AHMED
LOU, Dennis Clayton
RODRIGUES, PEDRO H. M.
HAN, KYUNG-NAM
ALEXANDER, KHARY JASON

Assignees

Microsoft Technology Licensing, LLC

Dates

Publication Date: 20260506
Application Date: 20251006

Claims (15)

A processor-implemented method (500), comprising: receiving (520), by a dispatcher of a processor, two or more first multiplier-accumulator "MAC" instructions in two or more consecutive clock cycles of the processor, each first MAC instruction being directed to a first pipeline processing system among one or more pipeline processing systems of the processor to process input data using a corresponding first MAC operation among two or more first MAC operations; in response to receiving the two or more first MAC instructions, dispatching (530), by the dispatcher, each of the two or more first MAC instructions and corresponding each of two or more sets of input data values to one of a set of input registers among a plurality of sets of input registers based on a sub-stream among a plurality of sub-streams, wherein a number of sets of the plurality of sets of input registers corresponds to a number of sub-streams for processing the two or more first MAC instructions, the two or more sets of input data values for the plurality of sub-streams being processed by a MAC device of a processing engine in consecutive clock cycles, wherein an output value from the MAC device corresponding to each sub-stream is stored in a MAC accumulator register for that sub-stream, among a plurality of MAC accumulators corresponding to the plurality of sub-streams; adding (540), by the first pipeline processing system, a MAC value stored in the MAC accumulator register corresponding to each sub-stream to an accumulated MAC value stored in a pipeline accumulator register as the MAC device completes MAC operations for that sub-stream; in response to receiving, in a clock cycle following receipt of the two or more first MAC instructions, one of a pipeline bubble or a second MAC instruction directed to a second pipeline processing system among the one or more pipeline processing systems, initiating (545), by the dispatcher, a pipeline complete phase in which subsequent MAC instructions that are received by the dispatcher are directed away from the first pipeline processing system, wherein the pipeline bubble corresponds to an absence of a MAC instruction; and after the MAC values corresponding to all of the plurality of sub-streams have been added to the accumulated MAC value stored in the pipeline accumulator register, outputting (550), by the first pipeline processing system, the accumulated MAC value.
The processor-implemented method of claim 1, further comprising: receiving, by the processor and from a compiler, machine code; and decoding, by the processor, the machine code into the two or more first MAC instructions.
The processor-implemented method of claim 1 or claim 2, wherein the MAC device is a vector MAC "VMAC" device, wherein the two or more first MAC instructions are two or more first VMAC instructions, wherein the two or more first MAC operations are two or more first VMAC operations, wherein the MAC value and the accumulated MAC value are VMAC values, wherein each MAC accumulator register is a VMAC accumulator register, wherein the VMAC device includes a single instruction multiple data "SIMD" engine having a width corresponding to a number of concurrent VMAC operations that can be processed at a time, wherein the method further comprises: dividing, by the dispatcher, the two or more first MAC instructions and corresponding two or more sets of input data values into the plurality of sub-streams based on a combination of: VMAC operational latency in terms of a number of clock cycles of the processor that is used to complete a single VMAC operation among the two or more first VMAC operations; and the width of the SIMD.
The processor-implemented method of claim 3, further comprising: determining, by the dispatcher, whether there are dependencies within any of the two or more VMAC operations corresponding to the two or more first VMAC instructions; wherein dividing the two or more first VMAC instructions into the plurality of sub-streams is further based on dependencies identified within first VMAC operations among the two or more first VMAC operations, with dependent first VMAC operations being dispatched to the same sub-stream.
The processor-implemented method of claim 3, wherein each set of input data values among the two or more sets of input data values includes a first input data value and a second input data value; wherein dispatching each of the two or more first VMAC instructions and corresponding each of the two or more sets of input data values to the one of the set of input registers comprises: sending, by the dispatcher, the first input data value to a first input register of a corresponding sub-stream for storage and sending the second input data value to a second input register of the corresponding sub-stream for storage; wherein the method further comprises: performing a processing cycle including processing of a set of VMAC operations for each sub-stream in turn, one sub-stream at a time, until all sub-streams in the plurality of sub-streams have each had one set of VMAC operations among a plurality of sets of VMAC operations for that sub-stream processed by the VMAC device; and repeating the processing cycle for a next set of VMAC operations for each sub-stream, until processing of the two or more VMAC instructions have completed; wherein processing of the set of VMAC operations for each sub-stream in each processing cycle comprises, for each VMAC operation among the set of VMAC operations: multiplying, using a multiplier of the VMAC device, the first input data value from the first input register corresponding to that sub-stream with the second input data value from the second input register corresponding to that sub-stream, to produce a resultant product value for that sub-stream; and adding, using an adder of the VMAC device, the resultant product value for that sub-stream to an accumulated value that is stored in the VMAC accumulator register corresponding to that sub-stream, to produce a resultant sum value for that sub-stream that is stored in the VMAC accumulator register; wherein the multiplying and adding of the other VMAC operations among the set of VMAC operations for that sub-stream in that processing cycle are performed concurrently; and after the plurality of sets of VMAC operations for each sub-stream have been processed, outputting, by the VMAC accumulator register for that sub-stream, the resultant sum value as the VMAC value for that sub-stream.
The processor-implemented method of claim 3, further comprising: performing a compound operation by processing a combination VMAC operation using the accumulated VMAC value from the first pipeline processing system as one of two or more inputs for the combination VMAC operation.
The processor-implemented method of claim 6, wherein each of the two or more VMAC operations includes one of a multiplication operation, a division operation, a sum operation, a subtraction operation, a squaring operation, or an inverse operation, wherein the combination VMAC operation includes one of a mean operation, a variance operation, a standard deviation operation, a square root operation, a SoftMax operation, or a LayerNorm operation.
A processor (102a) having hardware components comprising: a dispatcher (108) including a first state machine (110); and a first pipeline processing system (112) among one or more pipeline processing systems, including: a plurality of processing engines (114), each processing engine including a multiplier-accumulator "MAC" device, which includes a MAC accumulator register; and a pipeline accumulator register (128); wherein the dispatcher performs first operations based on logic of the first state machine, the first operations comprising: receiving (520) two or more first MAC instructions in two or more consecutive clock cycles of the processor, each first MAC instruction being directed to the first pipeline processing system to process input data using a corresponding first MAC operation among two or more first MAC operations; in response to receiving the two or more first MAC instructions, dividing (525) the two or more first MAC instructions and corresponding two or more sets of input data values into a plurality of sub-streams based on MAC operational latency in terms of a number of clock cycles of the processor that is used to complete a single MAC operation among the two or more first MAC operations; and dispatching (530) each of the two or more first MAC instructions and corresponding each of the two or more sets of input data values to one of a set of processing engines among the plurality of processing engines based on a sub-stream into which that first MAC instruction was divided, wherein a number of processing engines of the set of processing engines corresponds to a number of sub-streams into which the two or more first MAC instructions are divided, the two or more sets of input data values for the plurality of sub-streams being processed by the MAC devices of the set of processing engines in consecutive clock cycles, wherein an output value from the MAC device of each processing engine is stored in the MAC accumulator register for that processing engine; in response to receiving, in a clock cycle following receipt of the two or more first MAC instructions, one of a pipeline bubble or a second MAC instruction directed to a second pipeline processing system among the one or more pipeline processing systems, initiating (545) a pipeline complete phase in which subsequent MAC instructions that are received by the dispatcher are directed away from the first pipeline processing system, wherein the pipeline bubble corresponds to an absence of a MAC instruction; and wherein the first pipeline processing system performs second operations comprising: adding (540) a MAC value stored in the MAC accumulator register of each of the set of processing engines to an accumulated MAC value stored in the pipeline accumulator register as each of the set of processing engines completes its MAC operations; and after the MAC values from all of the set of processing engines have been added to the accumulated MAC value stored in the pipeline accumulator register, outputting (550) the accumulated MAC value.
The processor of claim 8, wherein the first operations further comprise: determining whether there are dependencies within any of the two or more MAC operations corresponding to the two or more first MAC instructions; wherein dividing the two or more first MAC instructions into the plurality of sub-streams is based on dependencies identified within first MAC operations among the two or more first MAC operations, with dependent first MAC operations being dispatched to the same sub-stream.
The processor of claim 8, wherein the two or more first MAC instructions are decoded from machine code that is received by the processor from a compiler.
The processor of claim 8, wherein the MAC device for each processing engine further includes: a first input register; and a second input register; wherein each set of input data values among the two or more sets of input data values includes a first input data value and a second input data value; wherein dispatching each of the two or more first MAC instructions and corresponding each of the two or more sets of input data values to the one of the set of processing engines comprises: sending the first input data value to the first input register of a corresponding MAC device of that processing engine for storage and sending the second input data value to the second input register of the corresponding MAC device for storage.
The processor of claim 11, wherein the MAC device for each processing engine further includes: a multiplier; and an adder; wherein each processing engine performs third operations, the third operations comprising: multiplying, using the multiplier, the first input data value from the first input register with the second input data value from the second input register, to produce a resultant product value; adding, using the adder, the resultant product value to an accumulated value that is stored in the MAC accumulator register, to produce a resultant sum value that is stored in the MAC accumulator register; repeating the multiplying and adding until all MAC instructions and corresponding sets of input data values that are dispatched to that processing engine have been processed; and outputting, by the MAC accumulator register, the resultant sum value as the MAC value.
The processor of claim 8, wherein the MAC device is a scalar MAC device, wherein each of the two or more first MAC operations is a scalar MAC operation, wherein the MAC value and the accumulated MAC value are scalar MAC values.
The processor of claim 8, wherein the MAC device is a vector MAC "VMAC" device, wherein the two or more first MAC instructions are two or more first VMAC instructions, wherein each of the two or more first MAC operations is a VMAC operation, wherein the MAC value and the accumulated MAC value are VMAC values.
A processor (102) having hardware components comprising: a dispatcher (108) including a first state machine (110); and a first pipeline processing system (112) among one or more pipeline processing systems, including: a plurality of processing engines (114), each processing engine including an arithmetic logic unit "ALU" device, which includes an ALU accumulator register; and a pipeline accumulator register (128); wherein the dispatcher performs first operations based on logic of the first state machine, the first operations comprising: receiving (720) two or more first ALU instructions in two or more consecutive clock cycles of the processor, each first ALU instruction being directed to the first pipeline processing system to process input data using a corresponding first ALU operation among two or more first ALU operations; in response to receiving the two or more first ALU instructions, dispatching (730) each of the two or more first ALU instructions and corresponding each of the two or more sets of input data values to one of a set of processing engines among a set of processing engines based on a sub-stream among a plurality of sub-streams, wherein a number of processing engines of the set of processing engines corresponds to a number of sub-streams that is used to process the two or more first ALU instructions, the two or more sets of input data values for the plurality of sub-streams being processed by the ALU devices of the set of processing engines in consecutive clock cycles, wherein an output value from the ALU device of each processing engine is stored in the ALU accumulator register for that processing engine; in response to receiving, in a clock cycle following receipt of the two or more first ALU instructions, one of a pipeline bubble or a second ALU instruction directed to a second pipeline processing system among the one or more pipeline processing systems, initiating (745) a pipeline complete phase in which subsequent ALU instructions that are received by the dispatcher are directed away from the first pipeline processing system, wherein the pipeline bubble corresponds to an absence of an ALU instruction; and wherein the first pipeline processing system performs second operations comprising: adding (740) an ALU value stored in the ALU accumulator register of each of the set of processing engines to an accumulated ALU value stored in the pipeline accumulator register as each of the set of processing engines completes its ALU operations; and after the ALU values from all of the set of processing engines have been added to the accumulated ALU value stored in the pipeline accumulator register, outputting (750) the accumulated ALU value.

Description

BACKGROUND With the growing popularity and increasing use of artificial intelligence ("AI") systems (such as generative AI systems like large language models ("LLMs")), the number of AI and/or machine learning ("ML") tasks continues to increase exponentially. AI/ML tasks heavily employ multiply-accumulate ("MAC") operations and/or other arithmetic logic unit ("ALU") operations. As MAC and/or ALU operations increase in complexity with the growth of the generative AI systems, the number of clock cycles (or latency) for completing each MAC or ALU operation increases. Due to such latency, processors typically have to wait for completion of the MAC or ALU operation before processing the next MAC or ALU operation. Performance of the processor is thus impacted. It is with respect to this general technical environment to which aspects of the present disclosure are directed. In addition, although relatively specific problems have been discussed, it should be understood that the examples should not be limited to solving the specific problems identified in the background. SUMMARY This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description section. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended as an aid in determining the scope of the claimed subject matter. The currently disclosed technology, among other things, provides for processor performance acceleration using hardware-enhanced multiply-accumulate streaming. In examples, a dispatcher of a processor receives two or more first MAC or ALU instructions in two or more consecutive clock cycles of the processor. Each first MAC or ALU instruction is directed to a first pipeline processing system among one or more pipeline processing systems of the processor to process input data using either a corresponding first MAC operation among two or more first MAC operations or a corresponding first ALU operation among two or more first ALU operations. The dispatcher dispatches each of the two or more first MAC or ALU instructions and corresponding each of two or more sets of input data values to one of a set of input registers among a plurality of sets of input registers based on a sub-stream among a plurality of sub-streams. In some cases, a number of sets of the plurality of sets of input registers corresponds to a number of sub-streams for processing the two or more first MAC or ALU instructions, the two or more sets of input data values for the plurality of sub-streams being processed by a MAC device or an ALU device of a processing engine in consecutive clock cycles. In some instances, an output value from the MAC device or the ALU device corresponding to each sub-stream is stored in an accumulator register for that sub-stream, among a plurality of accumulators corresponding to the plurality of sub-streams. In some examples, a sub-stream accumulated value that is stored in the accumulator register corresponding to each sub-stream is added to an accumulated value that is stored in a pipeline accumulator as the MAC device completes MAC operations or the ALU device completes ALU operations for that sub-stream. In response to receiving, in a clock cycle following receipt of the two or more first MAC instructions, one of a pipeline bubble that corresponds to an absence of a MAC/ALU instruction or a second MAC/ALU instruction directed to a second pipeline processing system among the one or more pipeline processing systems, the dispatcher initiates a pipeline complete phase in which subsequent MAC/ALU instructions that are received by the dispatcher are directed away from the first pipeline processing system. After the accumulated values corresponding to all of the plurality of sub-streams have been added to the accumulated value that is stored in the pipeline accumulator register, the first pipeline processing system outputs the accumulated value. The details of one or more aspects are set forth in the accompanying drawings and description below. Other features and advantages will be apparent from a reading of the following detailed description and a review of the associated drawings. It is to be understood that the following detailed description is explanatory only and is not restrictive of the invention as claimed. BRIEF DESCRIPTION OF THE DRAWINGS A further understanding of the nature and advantages of particular embodiments may be realized by reference to the remaining portions of the specification and the drawings, which are incorporated in and constitute a part of this disclosure. Figs. 1A and 1B depict example systems for implementing processor performance acceleration using hardware-enhanced multiply-accumulate streaming.Figs. 2A and 2B depict example data flows that are each managed by a dispatcher(s) when implementing processor performance acceleration using hardware-enhanced multiply-accumulate streaming.Fig.