CN-121984516-A - Hardware acceleration device and method for high-flux lossless data compression

CN121984516ACN 121984516 ACN121984516 ACN 121984516ACN-121984516-A

Abstract

The invention discloses a hardware acceleration device and a method for high-flux lossless data compression, and relates to the field of integrated circuit design. The device comprises a front-end LZ77 compression module, a middle buffer module and a rear-end entropy coding module, wherein the front-end LZ77 compression module is configured to search and match input data, the middle buffer module is configured to realize decoupling of front-end processing time domain and rear-end processing time domain through an asynchronous symbol buffer, the rear-end entropy coding module is configured to receive output of the middle buffer module, construct a microcode packet middle representation containing codeword information, and analyze the microcode packet to generate a serial compressed bit stream. The invention can effectively eliminate invalid memory reading and pipeline stopping, obviously improve the throughput and energy efficiency ratio of the hardware compressor on the premise of not sacrificing the compression ratio, and is suitable for unloading acceleration of high-performance solid state disk controllers, intelligent network cards and cloud servers.

Inventors

XIAO HAO
XU HAN

Assignees

合肥工业大学

Dates

Publication Date: 20260505
Application Date: 20251224

Claims (7)

1. A hardware acceleration device for high throughput lossless data compression, comprising: The front-end LZ77 compression module comprises a near-field matching engine, a far-field matching engine and a speculative arbiter; the near field matching engine is configured to search and match input data in a current period and a previous period window; the speculative arbiter is configured to selectively initiate a reading operation on the multi-Bank hash table data storage area according to the near field matching length output by the near field matching engine and the far field matching length output by the metadata activity value table; the middle buffer module is connected between the front-end LZ77 compression module and the rear-end entropy coding module and is configured to realize decoupling of front-end processing time domain and rear-end processing time domain through an asynchronous symbol buffer; the back-end entropy coding module is configured to receive the output of the intermediate buffer module, construct a microcode packet intermediate representation encapsulated with code words and code lengths, and parse the microcode packet to generate a serial compressed bit stream.
2. The apparatus of claim 1, wherein the near field matching engine employs a full register logic based architecture comprising a shift register set having a depth of twice the parallel window width configured to store current clock cycle data and last clock cycle data simultaneously to eliminate cross cycle boundary detection dead zones, and wherein the near field matching engine calculates the matching result using a channel based parallel comparison array.
3. The apparatus of claim 1, wherein the hash computation unit concurrently computes hash indexes of all byte positions in a parallel window in a single clock cycle, and accesses the metadata activity value table and the multi-Bank hash table data storage area accordingly, and wherein the compression attribute metadata in the metadata activity value table comprises at least a Bank index, a true matching length, and a valid bit.
4. The apparatus of claim 1, wherein the asynchronous symbol buffer employs ping-pong buffer control logic configured to redirect the symbol stream output by the front-end LZ77 compression module to a free memory bank having a capacity to accommodate at least Deflate data blocks defined by two largest transmission units when the back-end entropy encoding module is in a Huffman tree construction state.
5. The apparatus of claim 1, wherein the data structure of the microcode packet comprises an effective mask field and a multi-slot payload field, wherein the back-end entropy encoding module comprises a microcode generator and a bitstream packer configured as a finite state machine that sequentially parses the effective mask field in the microcode packet, loads codewords in effective slots into a barrel shifter, and performs an accumulation shift operation according to a corresponding code length to generate a continuous compressed bitstream.
6. A method for implementing an algorithm based on Deflate of the apparatus of any one of claims 1 to 5, comprising: Combining near field searching with far field presumption logic based on metadata activity value table, calculating hash value of input data to index metadata activity value table, synchronously recording matching length as metadata when data is written back, and preferentially checking the metadata to arbitrate whether to read multi-Bank hash table data storage area or not when searching, initiating memory reading only when metadata indicates effective matching, otherwise multiplexing near field result; The middle transmission step is that the data flow is distributed to a statistic path and a storage path in parallel by utilizing a two-way architecture, a current data block is written into a first statistic library and an asynchronous symbol buffer, the data block is immediately switched to a second statistic library after finishing, and meanwhile, a Huffman tree is built by utilizing the data of the first statistic library and the data is read back from the asynchronous symbol buffer for encoding; and the back-end coding step, namely mapping a plurality of symbols into a microcode packet packaged with code words and code lengths in parallel in a single period, analyzing the microcode packet slot by utilizing a finite state machine, and splicing the variable-length code words into an output bit stream through a shift operation.
7. An electronic device comprising a memory configured to store a computer program, instructions or a configuration bitstream, a processor coupled to the memory, a communication interface configured to enable communication between the electronic device and other devices, wherein the processor is configured to run the computer program, instructions or load the configuration bitstream to implement the steps of the method of claim 6.

Description

Hardware acceleration device and method for high-flux lossless data compression Technical Field The invention relates to the technical field of integrated circuit design and high-performance data processing, in particular to a lossless data compression hardware acceleration device used in a data center, an edge computing node and a high-bandwidth network interface controller. Background With the explosive growth of cloud computing, artificial Intelligence (AI) and large-scale data centers, the contradiction between data transmission bandwidth and storage cost is increasingly prominent. In order to increase storage density and relieve network bandwidth pressure, lossless data compression techniques have become a standard function in storage controllers, intelligent network cards, and edge computing nodes. Among them, deflate algorithm is a mainstream choice in industry because it is widely applied in GZIP, PNG and HTTP compression standards. However, as network interface rates evolve, conventional general purpose processor (CPU) based pure software compression schemes have been difficult to meet the real-time requirements of line speed processing. Even with high performance server CPUs, their single core compression throughput is typically limited by instruction set efficiency, which can only be on the order of hundreds of MB/s, with a large performance gap from the bandwidth requirements of modern high-speed networks. Therefore, how to design a hardware acceleration scheme of a high-efficiency lossless compression algorithm has become a key technical problem to be solved. Although most of the existing hardware compression accelerators improve throughput to a certain extent, when facing high-throughput scenes, the following three core technical bottlenecks are still faced, and further improvement of performance and optimization of energy efficiency ratio are restricted: 1. The core of the LZ77 algorithm is to perform the longest string match within a history sliding window (e.g., 32 KB). To improve throughput, currently mainstream hardware architecture generally uses a multi-byte parallel window (PWS, for example, processing 32 bytes per cycle) in combination with a multi-Bank hash table to perform searching. However, existing conventional hardware architectures typically employ a "read-before-compare" mechanism in which after all possible candidate locations are located by hash index, the historical data for those locations must be read out entirely from on-chip memory (e.g., BRAM) and compared with the current data in full bytes. In a high parallelism design, this means that multiple memory read operations may need to be initiated per clock cycle. Moreover, such a blind read mechanism may eventually be determined to be an invalid match due to the existence of hash collisions, a large number of memory read operations. Such invalid reads not only occupy memory bandwidth, but also result in wasted dynamic power consumption. 2. The Deflate standard supports dynamic huffman coding, which requires that the entire data block must be scanned first to count symbol frequencies, an optimal huffman tree constructed therefrom, and the same data block finally encoded using the tree. This "statistics first, tree building later, recoding" data dependence can limit the rate at which data is processed. In conventional pipeline designs, when the front end LZ77 module outputs data, if the back end is doing a tree building operation, the front end must stop working to wait for the back end to be ready, or the back end cannot encode until the front end statistics are waiting to be completed. This pipeline stall due to data dependency reduces the utilization of hardware resources, making it difficult for the system to maintain full throughput operation in dynamic encoding mode. 3. The front end of the hardware compressor typically outputs fixed-length symbols of high parallelism (e.g., 32 symbols per cycle), while the back end huffman encoder outputs a variable length bitstream. The variable length code words with variable bit width generated in each period are seamlessly spliced into a compact bit stream in a single period, and a barrel shifter and bit width alignment logic of an ultra-deep combinational logic level are needed. As parallelism increases, the complexity of the portion of logic increases, often becoming a critical path for the overall system design. This limits the highest operating frequency that can be achieved by the hardware system, making it difficult for the processing speed of the back-end coding module to match the high throughput of the front-end LZ77 module, becoming a short board that limits overall system performance. Disclosure of Invention The invention aims to solve the technical problems and provides a hardware acceleration device and a hardware acceleration method for high-throughput lossless data compression. The invention obviously improves the throughput, the energy efficiency ratio and the wor