US-12619553-B2 - Method and apparatus to sort a vector for a bitonic sorting algorithm

US12619553B2US 12619553 B2US12619553 B2US 12619553B2US-12619553-B2

Abstract

A method is provided that includes performing, by a processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction; and storing the sorted vector in a storage location.

Inventors

Timothy David Anderson
Mujibur Rahman

Assignees

TEXAS INSTRUMENTS INCORPORATED

Dates

Publication Date: 20260505
Application Date: 20230320

Claims (20)

1 . A method comprising: first sorting, by a processing circuit in response to executing a first instruction, a first half of a set of vector values in a first direction; second sorting, by the processing circuit in response to executing a second instruction, a second half of the set of vector values in a second direction; generating, by the processing circuit in response to the first sorting and the second sorting, a first sorted set of vector values; moving, by a vector permutation logic circuit in the processing circuit, a first half of the first sorted set of vector values to a second half of a first output vector; moving, by the vector permutation logic circuit in the processing circuit, a second half of the first sorted set of vector values to a first half of the first output vector; comparing, by the processing circuit in response to executing a third instruction, the first sorted set of vector values with the first output vector to generate a second output vector and a third output vector; moving, by the vector permutation logic circuit in the processing circuit, a second half of the third output vector to a second half of a fourth output vector; moving, by the vector permutation logic circuit in the processing circuit, a first half of the second output vector to a first half of the fourth output vector; third sorting, by the processing circuit in response to executing the first instruction, a first half of the fourth output vector in the first direction; fourth sorting, by the processing circuit in response to executing the first instruction, a second half of the fourth output vector in a fourth direction; and outputting, by the processing circuit in response to the third sorting and the fourth sorting, a second sorted set of vector values.
2 . The method of claim 1 , wherein the first set of vector values includes a first 16-bit data element, and wherein the second set of vector values includes a second 16-bit data element.
3 . The method of claim 1 , wherein the first half of the first sorted set of vector values is a low half of the first sorted set of vector values, wherein the second half of the first output vector is a high half of the first output vector, wherein the second half of the first sorted set of vector values is a high half of the first sorted set of vector values, and wherein the first half of the first output vector is a low half of the first output vector.
4 . The method of claim 1 , wherein comparing the first sorted set of vector values with the first output vector to generate the second output vector comprises outputting a smaller value between each corresponding lane of the first sorted set of vector values and the first output vector.
5 . The method of claim 1 , wherein comparing the first sorted set of vector values with the first output vector to generate the third output vector comprises outputting a larger value between each corresponding lane of the first sorted set of vector values and the first output vector.
6 . The method of claim 1 , wherein the first half of the second output vector is a low half of the second output vector, wherein the first half of the fourth output vector is a low half of the fourth output vector, wherein the second half of the third output vector is a high half of the third output vector, and wherein the second half of the fourth output vector is a high half of the fourth output vector.
7 . The method of claim 1 , wherein the first direction is an increasing order, and wherein the second direction is a decreasing order.
8 . The method of claim 1 , wherein the first direction is a decreasing order, and wherein the second direction is an increasing order.
9 . A non-transitory computer-readable medium having executable instructions stored thereon, configured to be executable by one or more processors for causing the one or more processors to: perform a first sort of a first half of a set of vector values in a first direction in response to executing a first instruction; perform a second sort of a second half of the set of vector values in a second direction in response to executing a second instruction; generate a first sorted set of vector values based on the first sort and the second sort; move, using a vector permutation logic circuit, a first half of the first sorted set of vector values to a second half of a first output vector; move, using the vector permutation logic circuit, a second half of the first sorted set of vector values to a first half of the first output vector; compare the first sorted set of vector values with the first output vector to generate a second output vector and a third output vector in response to executing a third instruction; move, using the vector permutation logic circuit, a second half of the third output vector to a second half of a fourth output vector; move, using the vector permutation logic circuit, a first half of the second output vector to a first half of the fourth output vector; perform a third sort of the first half of the fourth output vector in a third direction in response to executing the first instruction; perform a fourth sort of the second half of the fourth output vector in a fourth direction in response to executing the first instruction; and generate a second sorted set of vector values based on the third sort and the fourth sort.
10 . The non-transitory computer-readable medium of claim 9 , wherein the first set of vector values includes a first 16-bit data element; and wherein the second set of vector values includes a second 16-bit data element.
11 . The non-transitory computer-readable medium of claim 9 , wherein the first half of the first sorted set of vector values is a low half of the first sorted set of vector values, wherein the second half of the first output vector is a high half of the first output vector, wherein the second half of the first sorted set of vector values is a high half of the first sorted set of vector values, and wherein the first half of the first output vector is a low half of the first output vector.
12 . The non-transitory computer-readable medium of claim 9 , wherein the instruction to compare the first sorted set of vector values with the first output vector to generate the second output vector comprise an instruction to output a smaller value between each corresponding lane of the first sorted set of vector values and the first output vector.
13 . The non-transitory computer-readable medium of claim 9 , wherein the instruction to compare the first sorted set of vector values with the first output vector to generate the third output vector comprise an instruction to output a larger value between each corresponding lane of the first sorted set of vector values and the first output vector.
14 . The non-transitory computer-readable medium of claim 9 , wherein the first half of the second output vector is a low half of the second output vector, wherein the first half of the fourth output vector is a low half of the fourth output vector, wherein the second half of the third output vector is a high half of the third output vector, and wherein the second half of the fourth output vector is a high half of the fourth output vector.
15 . The non-transitory computer-readable medium of claim 9 , wherein the first direction is an increasing order, and wherein the second direction is a decreasing order.
16 . The non-transitory computer-readable medium of claim 9 , wherein the first direction is a decreasing order, and wherein the second direction is an increasing order.
17 . The method of claim 1 , wherein an opcode of the first instruction is different from an opcode of the second instruction.
18 . The non-transitory computer-readable medium of claim 9 , wherein an opcode of the first instruction is different from an opcode of the second instruction.
19 . A system comprising: a processing circuit configurable to: perform a first sort of a first half of a set of vector values in a first direction in response to executing a first instruction; perform a second sort of a second half of the set of vector values in a second direction in response to executing a second instruction; and generate a first sorted set of vector values in response to the first sort and the second sort; and a vector permutation logic circuit coupled to the processing circuit and configurable to: move a first half of the first sorted set of vector values to a second half of a first output vector; and move a second half of the first sorted set of vector values to a first half of the first output vector, wherein the processing circuit is configurable to compare the first sorted set of vector values with the first output vector to generate a second output vector and a third output vector in response to executing a third instruction, wherein the vector permutation logic circuit is configurable to: move a second half of the third output vector to a second half of a fourth output vector; and move a first half of the second output vector to a first half of the fourth output vector, and wherein the processing circuit is configurable to: perform a third sort of a first half of the fourth output vector in the first direction; perform a fourth sort of a second half of the fourth output vector in a fourth direction; and output a second sorted set of vector values in response to the third sorting and the fourth sorting.
20 . The system of claim 19 , wherein an opcode of the first instruction is different from an opcode of the second instruction.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 17/665,958, filed Feb. 7, 2022 (currently pending and scheduled to grant as U.S. Pat. No. 11,609,862), which is a continuation of U.S. patent application Ser. No. 16/589,133, filed Sep. 30, 2019 (now U.S. Pat. No. 11,281,464), which claims benefit of U.S. Provisional Patent Application No. 62/852,870, filed May 24, 2019, which is incorporated herein by reference in its entirety. BACKGROUND Digital signal processors (DSP) are optimized for processing streams of data that may be derived from various input signals, such as sensor data, a video stream, a voice channel, radar signals, biomedical signals, etc. Digital signal processors operating on real-time data typically receive an input data stream, perform a filter function on the data stream (such as encoding or decoding) and output a transformed data stream. The system is called real-time because the application fails if the transformed data stream is not available for output when scheduled. Typical video encoding requires a predictable but non-sequential input data pattern. A typical application requires memory access to load data registers in a data register file and then supply data from the data registers to functional units which perform the data processing. One or more DSP processing cores can be combined with various peripheral circuits, blocks of memory, etc. on a single integrated circuit (IC) die to form a system on chip (SoC). These systems can include multiple interconnected processors that share the use of on-chip and off-chip memory. A processor can include some combination of instruction cache (ICache) and data cache (DCache) to improve processing. Furthermore, multiple processors with shared memory can be incorporated in a single embedded system. The processors can physically share the same memory without accessing data or executing code located in the same memory locations or can use some portion of the shared memory as common shared memory. SUMMARY Embodiments of the present disclosure relate to methods and apparatus for vector sorting. In one aspect, a method is provided that includes performing, by a processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction, and storing the sorted vector in a storage location. In one aspect, a processor is provided that includes comparator logic configured to compare values in lanes of a vector responsive to a vector sort instruction, and vector sort logic configured to sort the values as indicated by the vector sort instruction to generate a sorted vector based on results of comparing the values by the comparator logic, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates an example dual scalar/vector data path processor; FIG. 2 illustrates the registers and functional units in the dual scalar/vector data path processor illustrated in FIG. 1; FIG. 3 illustrates a global scalar register file; FIG. 4 illustrates a local scalar register file shared by arithmetic functional units; FIG. 5 illustrates a local scalar register file shared by multiply functional units; FIG. 6 illustrates a local scalar register file shared by load/store units; FIG. 7 illustrates a global vector register file; FIG. 8 illustrates a predicate register file; FIG. 9 illustrates a local vector register file shared by arithmetic functional units; FIG. 10 illustrates a local vector register file shared by multiply and correlation functional units; FIG. 11 illustrates pipeline phases of a processing unit; FIG. 12 illustrates sixteen instructions of a single fetch packet; FIG. 13 illustrates an example of the instruction coding; FIG. 14 illustrates bit coding of a condition code extension slot 0; FIG. 15 illustrates bit coding of a condition code extension slot 1; FIG. 16 illustrates bit coding of a constant extension slot 0; FIG. 17 is a partial block diagram illustrating constant extension; FIG. 18 illustrates carry control for SIMD operations; FIG. 19 illustrates a conceptual view of streaming engines; FIG. 20 illustrates a sequence of formatting operations; FIG. 21 illustrates an example of lane allocation in a vector; FIG. 22 illustrates an example of lane allocation in a vector; FIG. 23 illustrates a basic two-dimensional (2D) stream; FIG. 24 illustrates the order of elements within the example stream of FIG. 23; FIG. 25 illustrates extracting a smaller rectangle from a larger rectangle