CN-115904501-B - Stream engine with selectable multi-dimensional circular addressing in each dimension

CN115904501BCN 115904501 BCN115904501 BCN 115904501BCN-115904501-B

Abstract

The present invention relates to a stream engine with selectable multi-dimensional circular addressing in each dimension. A stream engine (125,2700) for use in a digital data processor specifies a fixed read-only data stream defined by a plurality of nested loops. An address generator (1901) generates addresses of data elements of the nested loops. The stream header register (2718,2728) stores data elements that are subsequently provided to the functional unit for use as operands. The stream template register (2800) independently specifies a linear address or a cyclic address pattern for each of the nested loops.

Inventors

J. Zweisiak

Assignees

德克萨斯仪器股份有限公司

Dates

Publication Date: 20260512
Application Date: 20171220
Priority Date: 20161220

Claims (20)

1. A method, comprising: a series of data elements are read from a memory using a plurality of nested loops, wherein for each nested loop, the reading comprises: receiving parameters including a first cyclic addressing block size parameter and a second cyclic addressing block size parameter; Determining a respective addressing mode selected as one of a linear addressing mode, a first cyclic addressing mode having a first cyclic block size determined based on the first cyclic addressing block size parameter but not the second cyclic addressing block size parameter, or a second cyclic addressing mode having a second cyclic block size determined based on both the first cyclic addressing block size parameter and the second cyclic addressing block size parameter, and Reading data elements associated with the nested loops using the respective addressing modes, and The series of data elements is output to a processing core.
2. The method of claim 1, wherein determining the respective addressing mode of each nested loop comprises reading a plurality of fields from a stream definition template associated with the series of data elements, wherein each of the plurality of fields specifies an addressing mode of a respective one of the nested loops.
3. The method of claim 2, wherein the flow definition template is stored in a register.
4. The method of claim 2, wherein the stream definition template comprises a first cyclic block size field comprising the first cyclic addressing block size parameter (CBK 0) and a second cyclic block size field comprising the second cyclic addressing block size parameter (CBK 1).
5. The method of claim 4, wherein the second cyclic block size is determined based on a sum of CBK0 and CBK 1.
6. The method of claim 4, wherein the second cyclic block size is determined to be equal to CBK0+CBK1+1.
7. The method according to claim 1, wherein: Selecting the first cyclic block size from a block size of 512 bytes (B) to 512 Kilobytes (KB), and The second cyclic block size is selected from a block size of 1 Megabyte (MB) to 64 Gigabytes (GB).
8. The method according to claim 1, wherein: the first cyclic block size is selected from a block size of 512 bytes, 1 Kilobyte (KB), 2 KB, 4 KB, 8 KB, 16 KB, 32 KB, 64 KB, 128 KB, 256 KB, 512 KB, 1 Megabyte (MB), 2 MB, 4 MB, 8 MB, and 16 MB, and The second cyclic block size is selected from a block size of 1KB, 2KB, 4KB, 8KB, 16KB, 32KB, 64KB, 128KB, 256KB, 512KB, 1M, 2MB, 4MB, 8MB and 16MB, 32MB, 64MB, 128MB, 256MB, 512MB, 1 Gigabyte (GB), 2GB, 4GB, 8GB, 16GB, 32GB and 64GB.
9. A data processing apparatus, comprising: A processing core; a memory; a register configured to store a stream definition template comprising a plurality of addressing mode fields, and A stream engine unit coupled to the processing core and the memory and configured to receive a plurality of data elements from the memory using a plurality of nested loops and to provide the plurality of data elements as a data stream to the processing core, wherein for each nested loop, receiving the plurality of data elements from the memory comprises: receiving parameters including a first cyclic addressing block size parameter and a second cyclic addressing block size parameter; Determining a respective addressing mode selected as one of a linear addressing mode, a first cyclic addressing mode having a first cyclic block size determined based on the first cyclic addressing block size parameter but not the second cyclic addressing block size parameter, or a second cyclic addressing mode having a second cyclic block size determined based on both the first cyclic addressing block size parameter and the second cyclic addressing block size parameter, and The determined respective addressing modes are used to cause data elements associated with the nested loops to be read from the memory.
10. The data processing apparatus of claim 9, wherein each of the addressing pattern fields of the flow definition template corresponds to a respective one of the nested loops, and wherein the determination of the respective addressing pattern of each nested loop is based on a value in the corresponding addressing pattern field of the flow definition template.
11. The data processing apparatus according to claim 10, wherein the stream definition template comprises a first cyclic block size field comprising the first cyclic addressing block size parameter (CBK 0) and a second cyclic block size field comprising the second cyclic addressing block size parameter (CBK 1).
12. The data processing apparatus of claim 11, wherein the second cyclic block size is determined based on a sum of CBK0 and CBK 1.
13. The data processing apparatus of claim 11, wherein the second cyclic block size is determined to be equal to CBK0+CBK1+1.
14. The data processing apparatus of claim 9, wherein: Selecting the first cyclic block size from a block size of 512 bytes (B) to 512 Kilobytes (KB), and The second cyclic block size is selected from a block size of 1 Megabyte (MB) to 64 Gigabytes (GB).
15. The data processing apparatus of claim 9, wherein: the first cyclic block size is selected from a block size of 512 bytes, 1 Kilobyte (KB), 2 KB, 4 KB, 8 KB, 16 KB, 32 KB, 64 KB, 128 KB, 256 KB, 512 KB, 1 Megabyte (MB), 2 MB, 4 MB, 8 MB, and 16 MB, and The second cyclic block size is selected from a block size of 1KB, 2KB, 4KB, 8KB, 16KB, 32KB, 64KB, 128KB, 256KB, 512KB, 1M, 2MB, 4MB, 8MB and 16MB, 32MB, 64MB, 128MB, 256MB, 512MB, 1 Gigabyte (GB), 2GB, 4GB, 8GB, 16GB, 32GB and 64GB.
16. An apparatus, comprising: a stream template register configured to store a stream template, the stream template comprising a set of addressing mode indicators, the set of addressing mode indicators comprising a respective addressing mode indicator for each cycle of a set of nested cycles; An address generator coupled to the stream template register to receive the set of addressing pattern indicators, wherein the address generator comprises: For each cycle of the set of nested loops, control word circuitry associated with a respective cycle of the set of nested loops configured to provide, in response to the respective addressing mode indicator of the cycle, a respective control word specifying a memory block size of the respective cycle, the memory block size selected as one of a null value, a first block size, or a second block size, wherein: the first block size being determined based on a first cyclic addressing parameter instead of a second cyclic addressing parameter, and The second block size is determined based on both the first cyclic addressing parameter and the second cyclic addressing parameter, and An adder circuit coupled to the control word circuit to receive the respective control word and configured to cycle through memory regions according to the respective control word when the memory block size is selected as one of the first block size or the second block size such that the address generator provides a set of addresses representing the set of nested loops using the selected memory block size, and A memory interface is coupled to the address generator to receive the set of addresses and is configured to retrieve a set of data associated with the set of addresses from memory.
17. The apparatus of claim 16, wherein: The stream template further includes the first block size and the second block size, and Each of the set of addressing mode indicators selects between the first block size and a function of both the first block size and the second block size.
18. The apparatus of claim 17, wherein for each loop in the set of nested loops, the respective control word circuit comprises: An adder coupled to add the first block size and the second block size to provide a sum; a multiplexer coupled to select between the first block size and the sum to provide a table index, and A look-up table unit coupled to the multiplexer and configured to provide the respective control word in response to the table index.
19. The apparatus of claim 18, wherein, for each loop of the set of nested loops, the adder is configured to add the first block size, the second block size, and 1 to produce the sum.
20. The apparatus of claim 17, wherein each of the set of addressing mode indicators further selects between circular addressing based on the first block size, circular addressing based on a function of the first block size and the second block size, and linear addressing.

Description

Stream engine with selectable multi-dimensional circular addressing in each dimension The application is a divisional application of China patent application 201711379621.5 filed on 12 months 20 2017, entitled "stream Engine with selectable multi-dimensional circular addressing in each dimension". RELATED APPLICATIONS This patent application is an improvement over U.S. patent application Ser. No. 14/331,986, entitled "highly Integrated scalable Flexible DSP million Module architecture (HIGHLY INTEGRATED SCALABLE, FLEXIBLE DSP MEGAMODULE ARCHITECTURE), filed on 7.15.2014, which claims priority from U.S. provisional patent application Ser. No. 61/846,148 filed on 7.15.2013. Technical Field The technical field of the present invention is digital data processing, and more particularly control of a stream engine (STREAMING ENGINE) for operand retrieval. Background Modern Digital Signal Processors (DSPs) face multiple challenges. The workload is increasing and the bandwidth needs to be increased. The size and complexity of system-on-chip (SOC) is increasing. Memory (memory) system latency severely affects certain classes of algorithms. As transistors get smaller, memories and registers become less reliable. As software stacks become larger, the number of potential interactions and errors becomes larger. Memory bandwidth and scheduling are problems for digital signal processors operating on real-time data. Digital signal processors operating on real-time data typically receive an input data stream, perform a filtering function (e.g., encoding or decoding) on the data stream, and output a converted data stream. The system is called real-time because the application fails if the converted data stream is not available for output at the time of scheduling. Typical video coding requires a predictable but non-continuous input data pattern. Among the available address generation and memory access resources, the corresponding memory access is often difficult to implement. A typical application requires memory access to load data registers in a data register file (REGISTER FILE) and then to provide to functional units that perform data processing. Disclosure of Invention The present invention is a stream engine for use in a digital signal processor. A fixed sequence of data streams is specified by storing corresponding parameters in a control register. The data stream includes a plurality of nested loops. Once started, the data stream is read-only and cannot be written to. The functional unit using the stream data has a first instruction type that reads only the data and a second instruction type that reads the data and causes the stream engine to push the stream. This generally corresponds to the need for real-time filtering operations. The stream engine includes an address generator that generates addresses for data elements and a stream header register (STREAM HEAD REGISTER)/stream header register (STREAM HEAD REGISTER) that stores data elements that are to be subsequently provided to the functional unit for use as operands. Each of the plurality of nested loops has an independently specified linear addressing mode or a loop addressing mode having a specified loop block size. The corresponding addressing mode and loop block size are specified by corresponding fields in the stream definition template stored in the stream definition template register. The stream definition template also specifies other aspects of the predefined data stream, including data size and data type. The preferred embodiment comprises two independently defined data streams. The two data streams may be read or read/advanced independently by a set of Very Long Instruction Word (VLIW) functional units. Drawings These and other aspects of the invention are illustrated in the accompanying drawings, wherein: FIG. 1 illustrates a dual scalar/vector datapath processor in accordance with one embodiment of the present invention; FIG. 2 illustrates registers and functional units in the dual scalar/vector datapath processor shown in FIG. 1; FIG. 3 illustrates a global scalar register file; FIG. 4 shows a local landmark register file shared by arithmetic functional units; FIG. 5 shows a local landmark register file shared by multiplication functional units; FIG. 6 illustrates a local landmark register file shared by load/store units; FIG. 7 illustrates a global vector register file; FIG. 8 shows a predicate register file; FIG. 9 illustrates a local vector register file shared by arithmetic functional units; FIG. 10 illustrates a local vector register file shared by multiplication and related functional units; FIG. 11 illustrates a pipeline stage of a central processing unit in accordance with a preferred embodiment of the present invention; FIG. 12 shows 16 instructions of a single fetch packet; FIG. 13 illustrates an example of instruction encoding of instructions for use with the present invention; FIG. 14 shows bit encoding of a condition code expansion slo