US-20260127000-A1 - STREAMING ENGINE WITH EARLY EXIT FROM LOOP LEVELS SUPPORTING EARLY EXIT LOOPS AND IRREGULAR LOOPS

US20260127000A1US 20260127000 A1US20260127000 A1US 20260127000A1US-20260127000-A1

Abstract

A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements. A steam head register stores data elements next to be supplied to functional units for use as operands. Upon a stream break instruction specifying one of the nested loops, the stream engine ends a current iteration of the loop. If the specified loop was not the outermost loop, the streaming engine begins an iteration of a next outer loop. If the specified loop was the outermost nested loop, the streaming engine ends the stream. The streaming engine places a vector of data elements in order in lanes within a stream head register. A stream break instruction is operable upon a vector break.

Inventors

Joseph Zbiciak

Assignees

TEXAS INSTRUMENTS INCORPORATED

Dates

Publication Date: 20260507
Application Date: 20251104

Claims (19)

1 .- 7 . (canceled)
8 . An electronic device comprising: a memory; and a memory controller coupled to the memory and the processor and configured to: responsive to a first instruction, begin fetching a set of data elements from the memory; and responsive to a second instruction received while the fetching of the set of data elements is occurring, skipping the fetching of at least a portion of the set of data elements.
9 . The electronic device of claim 8 , wherein: the first instruction defines the set of data elements using a plurality of nested loops including an innermost loop level and an outermost loop level, wherein the nested loop in which the fetching is occurring is an active loop; and skipping the fetching of at least a portion of the set of data elements responsive to the second instruction comprises skipping the fetching of at least one data element corresponding to the active loop.
10 . The electronic device of claim 9 , wherein skipping the fetching of at least one data element corresponding to the active loop comprises skipping all remaining data elements of the active loop following an occurrence of a next vector boundary.
11 . The electronic device of claim 10 , wherein: the memory controller comprises a storage circuit configured to store the data elements fetched from the memory; the storage circuit is divided into a plurality of lanes; and the occurrence of the next vector boundary is when a particular number of lanes are filled.
12 . The electronic device of claim 11 , wherein the particular number of lanes is determined based on a vector length parameter associated with the first instruction.
13 . The electronic device of claim 11 , wherein the particular number of lanes is less than all of the plurality of lanes.
14 . The electronic device of claim 13 , wherein the memory controller is configured to fill remaining lanes beyond the particular number of lanes with a pad value and mark the remaining lanes invalid.
15 . The electronic device of claim 11 , comprising a processor having a functional units, wherein the memory controller is configured to supply the data elements stored in the storage circuit to the functional unit of the processor.
16 . The electronic device of claim 15 , wherein the memory is part of a hierarchical memory system that includes a level one (L1) cache and a level two (L2) cache, and wherein the memory is the L2 cache.
17 . The electronic device of claim 9 , comprising, when the active loop is not the outermost loop, after skipping the fetching of at least one data element corresponding to the active loop, resuming the fetching of the set of data elements by fetching data elements corresponding to a nested loop having a next outer loop level with respect to the active loop.
18 . The electronic device of claim 9 , wherein, when the active loop is the outermost loop, skipping the fetching of at least one data element corresponding to the active loop comprises skipping all remaining data elements of the set of data elements following a next vector boundary.
19 . A method comprising: receiving a first instruction to fetch a set of data elements from a memory; responsive to receiving the first instruction, using a memory controller to begin fetching the set of data elements from a memory; receiving a second instruction while the fetching of the set of data elements from the memory is occurring; and responsive to receiving the second instruction, skipping the fetching of at least a portion of the set of data elements.
20 . The method of claim 19 , wherein: the first instruction defines the set of data elements using a plurality of nested loops including an innermost loop level and an outermost loop level, wherein the nested loop in which the fetching is occurring is an active loop; and skipping the fetching of at least a portion of the set of data elements responsive to the second instruction comprises skipping the fetching of at least one data element corresponding to the active loop.
21 . The method of claim 20 , wherein skipping the fetching of at least one data element corresponding to the active loop comprises: detecting an occurrence of a next vector boundary; and skipping all remaining data elements of the active loop following the occurrence of the next vector boundary.
22 . The method of claim 21 , comprising: storing the data elements fetched from the memory into a storage circuit divided into a plurality of lanes; and detecting the occurrence of the next vector boundary comprises determining when a particular number of lanes are filled.
23 . The method of claim 22 , wherein the particular number of lanes is determined based on a vector length parameter associated with the first instruction.
24 . The method of claim 22 , wherein the particular number of lanes is less than all of the plurality of lanes.
25 . The method of claim 20 , comprising, when the active loop is not the outermost loop, after skipping the fetching of at least one data element corresponding to the active loop, resuming the fetching of the set of data elements by fetching data elements corresponding to a nested loop that has a next outer loop level with respect to the active loop.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 18/361,985, filed Jul. 31, 2023, which is a continuation of U.S. patent application Ser. No. 17/163,639, filed Feb. 1, 2021, now U.S. Pat. No. 11,714,646 which is a continuation of U.S. patent application Ser. No. 15/636,669, filed Jun. 29, 2017, now U.S. Pat. No. 10,908,901, each of which is incorporated by reference herein in its entirety. U.S. patent application Ser. No. 15/636,669 is an improvement over U.S. patent application Ser. No. 14/331,986, filed Jul. 15, 2014, now U.S. Pat. No. 9,606,803, entitled HIGHLY INTEGRATED SCALABLE, FLEXIBLE DSP MEGAMODULE ARCHITECTURE, which claims priority from U.S. Provisional Patent Application Ser. No. 61/846,148 filed Jul. 15, 2013. TECHNICAL FIELD OF THE INVENTION The technical field of this invention is digital data processing and more specifically control of streaming engine used for operand fetching. BACKGROUND OF THE INVENTION Modern digital signal processors (DSP) faces multiple challenges. Workloads continue to increase, requiring increasing bandwidth. Systems on a chip (SOC) continue to grow in size and complexity. Memory system latency severely impacts certain classes of algorithms. As transistors get smaller, memories and registers become less reliable. As software stacks get larger, the number of potential interactions and errors becomes larger. Memory bandwidth and scheduling are a problem for digital signal processors operating on real-time data. Digital signal processors operating on real-time data typically receive an input data stream, perform a filter function on the data stream (such as encoding or decoding) and output a transformed data stream. The system is called real-time because the application fails if the transformed data stream is not available for output when scheduled. Typical video encoding requires a predictable but non-sequential input data pattern. Often the corresponding memory accesses are difficult to achieve within available address generation and memory access resources. A typical application requires memory access to load data registers in a data register file and then supply to functional units which perform the data processing. SUMMARY OF THE INVENTION This invention is a digital data processor having a streaming engine which recalls from memory a stream of an instruction specified sequence of a predetermined number of data elements in plural nested loops for use in order by data processing functional units. A predetermined coding in an operand field of an instruction specifies stream data as an operand for that instruction. Each data element has a predetermined size and data type. Data elements are packed in lanes of the defined data width in a vector stream head register. The streaming engine ends data recall upon a stream end instruction or recall of all data elements in the stream. A stream start instruction begins stream recall and specifies the parameters of the data stream. Each stream start instruction preferably specifies a number of enabled nested loops within a predetermined maximum. If a stream break instruction specifies a loop that is greater than the number of enabled loops, the streaming engine ends the stream. Preferably the stream break instruction is effective upon a next vector boundary when the stream head register lanes are filled. A vector length unit may limit lane use to less than the full data width of the stream head register. In that event, filling all the lanes of the stream head register occurs when all lanes within the vector length are filled. Each stream start instruction preferably may specify a transpose disabled mode or a transpose enabled mode. An address generator swaps the parameters for the inner most loop and the next inner most loop when transpose is enabled. In the same manner, the streaming engine swaps the parameters for the inner most loop and the next inner most loop when transpose is enabled. Thus when transpose is enabled a stream break instruction specifying the inner most loop ends a current iteration of the next inner most nested loop. Also when transpose is enabled a stream break instruction specifying the next inner most loop ends the current iteration of the inner most loop. BRIEF DESCRIPTION OF THE DRAWINGS These and other aspects of this invention are illustrated in the drawings, in which: FIG. 1 illustrates a dual scalar/vector datapath processor according to one embodiment of this invention; FIG. 2 illustrates the registers and functional units in the dual scalar/vector datapath processor illustrated in FIG. 1; FIG. 3 illustrates a global scalar register file; FIG. 4 illustrates a local scalar register file shared by arithmetic functional units; FIG. 5 illustrates a local scalar register file shared by multiply functional units; FIG. 6 illustrates a local scalar register file shared by load/store units; FIG. 7 illustrates a global vector register file; FIG.