Search

US-12620050-B1 - Decoding acceleration with hardware decoder

US12620050B1US 12620050 B1US12620050 B1US 12620050B1US-12620050-B1

Abstract

Techniques to speed up decoding of compressed data objects may include offloading a decoding function to a decoder accelerator. The techniques may include a processor parsing the header including one or more code tables from a compressed data object, loading code data for one or more code tables into the decoder accelerator, and providing the encoded data from the compressed data object to the decoder accelerator via a decoder bus interface. The decoder accelerator decodes the encoded data into decoded data blocks. The processor then receives the decoded data blocks from the decoder accelerator via the decoder bus interface, generates pre-transformation data blocks based on the decoded data blocks by performing inverse domain transformation, and converts the pre-transformation data blocks into a decompressed data object.

Inventors

  • Xiaodan Tan
  • Paul Gilbert Meyer

Assignees

  • AMAZON TECHNOLOGIES, INC.

Dates

Publication Date
20260505
Application Date
20220531

Claims (20)

  1. 1 . A computing system to accelerate JPEG (Joint Photographic Experts Group) image decoding, the computing system comprising: a processor; and a decoder accelerator coupled to the processor via a decoder bus interface, wherein the processor is operable to: parse a JPEG file to obtain a header including one or more quantization tables and a plurality of Huffman code tables; load the plurality of Huffman code tables into the decoder accelerator via the decoder bus interface; transmit encoded data in the JPEG file to the decoder accelerator via the decoder bus interface; receive a set of accelerator-decoded JPEG data blocks from the decoder accelerator via the decoder bus interface; dequantize the set of accelerator-decoded JPEG data blocks that were previously decoded by the decoder accelerator using the one or more quantization tables to generate a set of dequantized JPEG data blocks; perform inverse discrete cosine transform on the set of dequantized JPEG data blocks that were previously decoded by the decoder accelerator to generate a set of spatial JPEG data blocks; and perform color conversion on the set of spatial JPEG data blocks to generate image data from the JPEG file, and wherein the decoder accelerator is operable to: store the plurality of Huffman code tables provided from the processor; receive the encoded data in the JPEG file from the processor; generate the accelerator-decoded JPEG data blocks using the Huffman code tables; and transmit the accelerator-decoded JPEG data blocks to the processor.
  2. 2 . The computing system of claim 1 , wherein the processor is operable to perform the inverse discrete cosine transform on the set of dequantized JPEG data blocks in parallel with the decoder accelerator decoding another set of JPEG data blocks.
  3. 3 . The computing system of claim 1 , wherein the decoder bus interface is a packetized interface supporting a plurality of packet types including a starting packet type to initialize the decoder accelerator for a new JPEG file, a load table packet type to load the plurality of Huffman code tables into the decoder accelerator, and a send bitstream data packet type to transmit the encoded data in the JPEG file to the decoder accelerator, wherein each packet being transmitted from the processor to the decoder accelerator includes a packet type identifier to identify the packet type of the packet.
  4. 4 . The computing system of claim 3 , wherein each load table packet is transmitted with a table identifier to indicate whether the corresponding load table packet is loading an AC luminance Huffman code table, a DC luminance Huffman code table, an AC chrominance Huffman code table, or a DC chrominance Huffman code table.
  5. 5 . The computing system of claim 4 , wherein the table identifier indicates whether a code table corresponding to a load table packet is associated with luminance or chrominance.
  6. 6 . The computing system of claim 1 , wherein the processor is operable to fuse the accelerator-decoded JPEG data blocks from the decoder accelerator to match a data width of single instruction multiple data (SIMD) instructions of the processor.
  7. 7 . A hardware decoder accelerator comprising: a decoder bus interface operable to communicate with a processor; and decoder acceleration circuitry coupled to the decoder bus interface, wherein the decoder acceleration circuitry is operable to: store a plurality of code tables provided from the processor via the decoder bus interface; receive encoded data of a compressed data object from the processor via the decoder bus interface; generate accelerator-decoded data blocks using the plurality of code tables; and transmit the accelerator-decoded data blocks to the processor for post-processing operations, wherein the post-processing operations of the processor include performing dequantization and inverse discrete cosine transformation on the accelerator-decoded data blocks obtained from the hardware decoder accelerator.
  8. 8 . The hardware decoder accelerator of claim 7 , wherein the decoder acceleration circuitry is operable to generate the accelerator-decoded data blocks while the processor is performing the post-processing operations in parallel on previous accelerator-decoded data blocks provided from the hardware decoder accelerator.
  9. 9 . The hardware decoder accelerator of claim 7 , wherein the decode bus interface is a packetized interface supporting a plurality of packet types.
  10. 10 . The hardware decoder accelerator of claim 9 , wherein the plurality of packet types includes a starting packet type to identify a start packet from the processor to initialize the hardware decoder accelerator, wherein the start packet includes information indicating a number of packets being used to transmit the compressed data object.
  11. 11 . The hardware decoder accelerator of claim 9 , wherein the plurality of packet types includes a load table packet type from the processor to load the plurality of code tables into the hardware decoder accelerator.
  12. 12 . The hardware decoder accelerator of claim 9 , wherein the plurality of packet types includes a send bitstream data packet type from the processor to transfer the encoded data to the hardware decoder accelerator.
  13. 13 . The hardware decoder accelerator of claim 7 , wherein the decoder acceleration circuitry is operable to: receive a plurality of load table packets, wherein each load table packet includes a table identifier to indicate which of the plurality of code tables that the load table packet corresponds to.
  14. 14 . The hardware decoder accelerator of claim 13 , wherein the table identifier indicates whether a code table corresponding to a load table packet is associated with an AC component or a DC component.
  15. 15 . The hardware decoder accelerator of claim 13 , wherein each load table packet includes a table segment identifier to indicate which portion of code data of a corresponding code table is being provided in the load table packet.
  16. 16 . The hardware decoder accelerator of claim 13 , wherein the plurality of code tables includes a code table having code data that includes a lookahead table associated with the code table.
  17. 17 . The hardware decoder accelerator of claim 13 , wherein the plurality of code tables includes a code table having code data that includes a set of maximum values associated with the code table, wherein each of the set of maximum values corresponds to a maximum value for a different code length.
  18. 18 . The hardware decoder accelerator of claim 8 , wherein the post-processing operations being performed in parallel include an upsampling operation.
  19. 19 . The hardware decoder accelerator of claim 8 , wherein the post-processing operations being performed in parallel include a color space conversion operation.
  20. 20 . The hardware decoder accelerator of claim 7 , wherein the encoded data is from a JPEG image file.

Description

BACKGROUND Various techniques can be used to compress large data objects such as images and video to reduce storage space. Compressing a data object can also reduce the transmission time of the data object because fewer data bits are transmitted as compared to the uncompressed data object. An example of an effective compression technique is entropy encoding in which the most common data symbol occurring in the data object is encoded using the least number of bits. BRIEF DESCRIPTION OF THE DRAWINGS Various embodiments in accordance with the present disclosure will be described with reference to the drawings, in which: FIG. 1 illustrates a block diagram of an example of a computing system to accelerate data decoding; FIG. 2 illustrates a conceptual diagram of an example of interactions between a processor and an accelerator; FIG. 3 illustrates an example of an image encoding process; FIG. 4 illustrates an example of data blocks organization; FIG. 5 illustrates a block diagram of an example of an integrated circuit; FIG. 6 illustrates a block diagram of another example of an integrated circuit; FIG. 7 illustrates a conceptual diagram of an example of an inverse transform; FIG. 8 illustrates a conceptual diagram of an example of a fuse input machine instruction; FIG. 9 illustrates a block diagram of example logic to implement a fuse input machine instruction; FIG. 10 illustrates a block diagram of example logic to implement a vector-to-workspace machine instruction; FIG. 11 illustrates a conceptual diagram of an example of a workspace-to-vector machine instruction; FIG. 12 illustrates a conceptual diagram of an example of a saturate-and-store machine instruction; FIG. 13 illustrates a timing diagram of an example of a decoding process; FIG. 14 illustrates a flow diagram of an example of a decoding process; FIG. 15 illustrates a flow diagram of another example of a decoding process; FIG. 16 illustrates a flow diagram of a further example of a decoding process; FIG. 17 illustrates a flow diagram of an example of a machine instruction execution process; FIG. 18 illustrates a conceptual diagram of an example of an instruction queue; and FIG. 19 illustrates an example of a computing device, according to certain aspects of the disclosure. DETAILED DESCRIPTION Processor architectures can be designed to support single instruction multiple data (SIMD) instructions to increase computational parallelism to achieve higher compute throughput. A SIMD instruction is a single machine instruction (may also be referred to as a processor instruction), which when executed by the processor, operates on multiple data elements simultaneously. A processor can support such instructions by implementing parallel computational data paths. For example, a processor may include a number of arithmetic logic units (ALUs) that can operate in parallel to perform computations on a corresponding number of data elements simultaneously. Although such parallel hardware architecture can significantly improve compute throughput, processors employing such architecture may provide little improvement when decoding data objects that have been encoded using certain techniques such as entropy encoding. This is because decoding such data is mostly a serialized process due to the variable length codes appearing in the compressed data. Such encoding techniques make it difficult to determine where the next code begins in the data stream without having first decoded the previous code. As such, the input data cannot be easily split up for parallel processing because it is unclear where to partition the input data to keep the code words intact. The techniques disclosed herein provide a hardware decoder to accelerate the decoding process of compressed data objects. The hardware decoder can be coupled with a processor that supports SIMD instructions to process data objects that can benefit from both parallel and serial processing. Processing steps which are parallelizable can be performed in the processor to take advantage of the SIMD instructions, and processing steps such as decoding steps that are serial in nature can be performed in the hardware decoder. The hardware decoder can operate in a pipelined manner with the processor to reduce latency and improve throughout. In some implementations, machine instructions can be implemented to assist with preparing and organizing the decoded data provided from the hardware decoder for consumption by the SIMD instructions in the processor. In the following description, various embodiments will be described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the embodiments. However, it will also be apparent to one skilled in the art that the embodiments may be practiced without the specific details. Furthermore, well-known features may be omitted or simplified in order not to obscure the embodiments being described. FIG. 1 illustrates an example of