CN-121984517-A - Dictionary learning-based multi-source heterogeneous data compression method and device

CN121984517ACN 121984517 ACN121984517 ACN 121984517ACN-121984517-A

Abstract

A multi-source heterogeneous data compression method and device based on dictionary learning comprises the steps of (1) collecting multi-source data files from a plurality of compression devices, performing dictionary learning on all collected multi-source data files to obtain corresponding compression dictionaries, (2) clustering the similar compression dictionaries with data distribution characteristics by combining a DBSCAN algorithm, dividing the similar compression dictionaries into a plurality of file sets with similar data characteristics, generating corresponding static compression dictionaries by secondary dictionary learning on all data files in the file sets, (3) selecting proper static dictionaries to perform data compression according to the data distribution characteristics of data blocks by using a ZSTD (ZSTD) compressor in an initial stage of data compression, performing dynamic dictionary learning on the compressed data to obtain new compression dictionaries, and performing dictionary switching when the compression performance of the new dictionaries exceeds a preset increment, (4) searching for corresponding compression dictionaries in a compression dictionary resource pool by using a ZSTD (ZSTD) to decompress when compression is completed to be transmitted to a decompression end, obtaining original data and checking the decompression of the compressed data by using a check code of the data. According to the method, the data characteristics of the multi-source heterogeneous data are adapted through the dictionary learning method, so that the compression effect of the data is further improved.

Inventors

GAO YI
ZHANG JIANAN
DONG WEI
LV JIAMEI

Assignees

浙江大学

Dates

Publication Date: 20260505
Application Date: 20251226

Claims (9)

1. A multi-source heterogeneous data compression method based on dictionary learning is characterized by comprising the following steps: (1) Firstly, collecting multi-source data files from a plurality of compression devices, and performing dictionary learning on all the collected multi-source data files by using a ZSTD compressor to obtain corresponding compression dictionaries which are used as input of subsequent dictionary clustering; (2) Clustering similar compression dictionaries with data distribution characteristics by combining a DBSCAN algorithm, dividing the similar compression dictionaries into a plurality of file sets with similar data characteristics, and generating corresponding static compression dictionaries by learning secondary dictionaries of all data files in the file sets for use in an initial stage in a subsequent compression process so as to relieve compression rate reduction caused by cold start in the initial stage; (3) In the initial stage of data compression, a ZSTD compressor is used for selecting a proper static dictionary according to the data distribution characteristics of the data blocks to perform data compression, and dynamic dictionary learning is performed on the compressed data to obtain a new compression dictionary; (4) When the compression is completed and transmitted to a decompression end, a ZSTD decompressor is used for decompressing the compression dictionary by searching a corresponding compression dictionary in a compression dictionary resource pool through a compression dictionary ID sequence in a compression data block frame header during decompression, so that original data are obtained and verified through a verification code of the data, and the decompression of the compression data is completed.
2. The method for compressing heterogeneous data based on dictionary learning as recited in claim 1, wherein the step (1) specifically comprises: firstly, collecting multi-source heterogeneous data files from compression equipment of different scenes, wherein the file types comprise common types; and (1.2) performing dictionary learning on all the collected files one by using a ZSTD compressor to obtain a compression dictionary of each file for subsequent compression dictionary clustering.
3. The method of multi-source heterogeneous data compression based on dictionary learning of claim 2, wherein the number of multi-source heterogeneous data files in the step (1.1) is greater than 200.
4. The method for compressing heterogeneous data based on dictionary learning as recited in claim 1, wherein the step (2) specifically comprises: utilizing the data distribution characteristics of byte quantity, high-frequency symbols and entropy values in a compression dictionary, and utilizing a DBSCAN algorithm to realize the clustering of different compression files so as to form a plurality of file sets with similar data distribution; and (2.2) dictionary learning is carried out on all original file data in each file set, and finally a plurality of static compression dictionaries with different data distribution characteristics are generated and used for relieving compression performance reduction caused by cold start in the initial stage of compression.
5. The method for compressing heterogeneous data based on dictionary learning as recited in claim 1, wherein the step (3) specifically comprises: in the initial stage of compression, firstly, selecting the most suitable static compression dictionary from the static dictionaries for compression according to the data distribution characteristics of the data blocks to be compressed; In the compression process, when the compressed data reach the preset proportion of the whole file, the ZSTD compressor carries out dictionary learning on the compressed file data through other threads to obtain a compression dictionary of the file; And (3.3) after the learning of the new compression dictionary is completed, comparing the performance performances of the two dictionaries on the new compression data block by using a ZSTD compressor, and switching to the new dictionary to compress the subsequent data when the new dictionary is continuous in compression performance and exceeds a preset increment. And otherwise, continuing to compress by using the original compression dictionary.
6. The method of multi-source heterogeneous data compression based on dictionary learning of claim 1, wherein the preset proportion in the step (3) is 20% and the preset increment is 10%.
7. The method for compressing heterogeneous data based on dictionary learning as recited in claim 1, wherein the step (4) specifically comprises: firstly, analyzing a frame head part of a data block to be decompressed by using a ZSTD decompressor to obtain a compression dictionary ID sequence in the frame head; Based on the dictionary ID sequence obtained by analysis, searching a corresponding compression dictionary from a dictionary compression pool by using a ZSTD decompressor, decompressing based on the dictionary, and restoring the dictionary into original data; And (4.3) finally, carrying out data verification on the decompressed data, if the verification result is consistent with the verification code of the head part of the sequence frame, correctly decompressing, otherwise, reporting decompression errors.
8. A dictionary learning-based multi-source heterogeneous data compression device, comprising a memory and one or more processors, wherein the memory stores executable code, and the one or more processors are configured to implement the dictionary learning-based multi-source heterogeneous data compression method of any one of claims 1-7 when the executable code is executed.
9. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, implements a dictionary learning-based multi-source heterogeneous data compression method as claimed in any one of claims 1 to 7.

Description

Dictionary learning-based multi-source heterogeneous data compression method and device Technical Field The invention provides a dictionary learning-based multi-source heterogeneous data compression method and device. Technical Field With the development of the fields of IoT, industrial internet and the like, multi-source heterogeneous data (including numerical values, images, texts and the like) becomes a core support for scene application, and has large data volume and high real-time transmission requirement, and strict requirements are put on compression efficiency. However, the traditional compression method has the obvious defects that firstly, a single-mode algorithm (such as a DEFLATE adaptive value and a JPEG adaptive image) needs to be frequently switched, the complexity of a system is increased, the comprehensive compression efficiency is low, secondly, the characteristics of heterogeneous data are not fully excavated by a general compression framework and are associated with cross modes, redundancy elimination is incomplete, thirdly, the generation frequency and density difference of a fixed block division strategy and multi-source data are not matched, and the compression rate and the instantaneity are difficult to balance. These problems result in low compression efficiency and high resource consumption of the traditional method, and become bottlenecks for field scale landing, and a high-efficiency compression scheme for adapting to multi-source heterogeneous data is needed. Disclosure of Invention In order to overcome the defect of low efficiency of multi-source heterogeneous data compression and provide a more efficient compression method, the invention provides a multi-source heterogeneous data compression method and device based on dictionary learning. According to the method, the data distribution characteristics of the heterogeneous data are learned by means of dictionary learning, and the matched dictionary is generated so as to improve the compression effect of the data. In general, the compression framework of the system consists of a static dictionary pre-training module, a dictionary self-adaption and compression module and a decompression module. The static dictionary pre-training module is used for providing a trained static dictionary for data compression in the initial stage of compression and relieving the compression rate reduction caused by cold start in the initial stage. The dictionary self-adaption and compression module selects an optimal static dictionary based on the data distribution characteristics of the data block in the initial stage of compression, and simultaneously carries out dynamic dictionary training on the compressed data to obtain a dictionary which is more suitable for the data block and carries out dictionary switching when necessary. The decompression module retrieves the corresponding compression dictionary based on the compression block and performs a decompression process to obtain the original data. The first aspect of the invention relates to a dictionary learning-based multi-source heterogeneous data compression method, which comprises the following steps: (1) Firstly, collecting multi-source data files from a plurality of compression devices, and performing dictionary learning on all the collected multi-source data files by using a ZSTD compressor to obtain corresponding compression dictionaries which are used as input of subsequent dictionary clustering; (2) Clustering similar compression dictionaries with data distribution characteristics by combining a DBSCAN algorithm, dividing the similar compression dictionaries into a plurality of file sets with similar data characteristics, and generating corresponding static compression dictionaries by learning secondary dictionaries of all data files in the file sets for use in an initial stage in a subsequent compression process so as to relieve compression rate reduction caused by cold start in the initial stage; (3) In the initial stage of data compression, a ZSTD compressor is used for selecting a proper static dictionary according to the data distribution characteristics of the data blocks to perform data compression, and dynamic dictionary learning is performed on the compressed data to obtain a new compression dictionary; (4) When the compression is completed and transmitted to a decompression end, a ZSTD decompressor is used for decompressing the compression dictionary by searching a corresponding compression dictionary in a compression dictionary resource pool through a compression dictionary ID sequence in a compression data block frame header during decompression, so that original data are obtained and verified through a verification code of the data, and the decompression of the compression data is completed. Wherein, step (1) specifically includes: firstly, collecting multi-source heterogeneous data files from compression equipment of different scenes, wherein the file types comprise common