CN-121984518-A - Self-adaptive coding compression method and system for columnar database

CN121984518ACN 121984518 ACN121984518 ACN 121984518ACN-121984518-A

Abstract

The invention relates to a self-adaptive coding compression method and system for a columnar database, belongs to the technical field of data compression, and solves the problems of low compression rate and query performance caused by solidification of a compression algorithm and strong artificial dependence. The method comprises the steps of partitioning collected original data, extracting multidimensional feature vectors from each dynamic column in each data block, inputting the multidimensional feature vectors of each dynamic column into a trained prediction model to obtain a prediction result, wherein the prediction result comprises a compression algorithm selection probability, a compression ratio prediction value and decompression overhead level, selecting an optimal compression algorithm for the dynamic column through a multi-objective decision method based on the prediction result of each dynamic column, and further executing compression operation. And the compression speed and the precision of mass data are improved.

Inventors

She Kangping
Zhong Maoheng

Assignees

北京凝思软件股份有限公司

Dates

Publication Date: 20260505
Application Date: 20251229

Claims (10)

1. The self-adaptive coding compression method for the columnar database is characterized by comprising the following steps of: partitioning the collected original data, and extracting a multidimensional feature vector from each dynamic column in each obtained data block; inputting the multidimensional feature vector of each dynamic column into a trained prediction model to obtain a prediction result, wherein the prediction result comprises a compression algorithm selection probability, a compression ratio prediction value and decompression overhead level; and selecting an optimal compression algorithm for each dynamic column by a multi-objective decision method based on the prediction result of each dynamic column, and further executing compression operation.
2. The method for adaptively compressing a column database code according to claim 1, the method is characterized in that the method comprises the steps of: Partitioning original data based on a preset time window, entity identifiers and a quantity threshold value to form data blocks containing a single entity and sequence numbers in a single time window, wherein the sequence numbers are generated by sequentially dividing the data blocks according to the quantity threshold value; a globally unique block identifier is generated for each data block.
3. The method according to claim 2, wherein the data block includes a static column attribute and its value, and a dynamic column attribute and its value sequence, and the multidimensional feature vector includes a statistical feature and a semantic feature.
4. The method of claim 3, wherein the statistical features are obtained by calculating a unique value ratio, monotonicity score, differential variance, and average run length from a sequence of values of dynamic columns, and the semantic features are obtained by converting the values of a plurality of static columns into normalized scalar values via a predefined mapping table.
5. The adaptive coding compression method of the columnar database according to claim 1, wherein the prediction model sequentially comprises a feature extraction module, a gating routing module, a multi-branch prediction module and a weighted fusion module, wherein the feature extraction module is used for mapping multi-dimensional feature vectors of each dynamic column into high-dimensional feature vectors and then transmitting the high-dimensional feature vectors into the gating routing module and the multi-branch prediction module, the gating routing module calculates routing weights of each branch in the multi-branch prediction module according to the high-dimensional feature vectors and transmits the routing weights to the weighted fusion module, and each branch in the multi-branch prediction module carries out multi-task prediction in parallel according to the same high-dimensional feature vectors and generates respective initial prediction results and then transmits the initial prediction results to the weighted fusion module, and the weighted fusion module carries out weighted fusion on the respective initial prediction results and the routing weights of each branch in the multi-branch prediction module to generate final prediction results.
6. The adaptive coding compression method of the columnar database according to claim 1 or 5, wherein the prediction model is obtained by training a minimization weighted multitasking loss function by adopting a multitasking learning framework, and the loss function comprises compression algorithm classification loss, compression ratio prediction loss, decompression overhead classification loss and routing distribution regularization loss.
7. The method of claim 1, wherein the selecting an optimal compression algorithm for the dynamic column by a multi-objective decision method comprises: selecting probability based on a compression algorithm output by the prediction model, and screening a plurality of candidate algorithms with highest probability; filtering the candidate algorithm according to preset business constraint; If the filtered candidate algorithms are empty, the default compression algorithm is taken as the optimal compression algorithm, if the filtered candidate algorithms are only 1, the candidate algorithms are directly taken as the optimal compression algorithm, otherwise, the comprehensive score is calculated according to the compression ratio predicted value and decompression overhead level of the filtered candidate algorithms, and the candidate algorithm with the highest comprehensive score is selected as the dynamic column optimal compression algorithm.
8. The method of claim 7, wherein the composite score is calculated from the compression ratio predictor and decompression overhead level of the filtered candidate algorithm by the following formula: , Wherein, the Representing the filtered first The number of candidate algorithms is chosen such that, The compression ratio predicted value is represented by a predetermined value, Representing the estimated time-consuming of the decompression overhead level map, And Respectively representing the compression ratio with the greatest history and the decompression time for normalization; And The compression efficiency weight and the query performance weight are respectively represented and are both greater than 0.
9. The method for adaptively compressing a column database according to claim 3, wherein said performing a compression operation comprises: generating a corresponding table construction statement for each data block according to the optimal compression algorithm of each dynamic column, wherein the compression algorithm attribute adopted by the display specification of each dynamic column is in a column type, and the table display specification storage mode is in a column type; And creating a column type storage table based on the table-building sentence, calling a corresponding coding compression library to compress the value sequence of each dynamic column according to the optimal compression algorithm of each dynamic column, and then importing the value sequence into the column type storage table to finish storage.
10. A columnar database adaptive code compression system, comprising: the data feature extraction module is used for partitioning the acquired original data and extracting multidimensional feature vectors from each dynamic column in each obtained data block; the compression algorithm prediction module is used for inputting the multidimensional feature vector of each dynamic column into a trained prediction model to obtain a prediction result, wherein the prediction result comprises a compression algorithm selection probability, a compression ratio prediction value and decompression overhead level; and the data compression storage module is used for selecting an optimal compression algorithm for each dynamic column through a multi-objective decision method based on the prediction result of each dynamic column, and further executing compression operation.

Description

Self-adaptive coding compression method and system for columnar database Technical Field The invention relates to the technical field of data compression, in particular to a self-adaptive coding compression method and system for a columnar database. Background The column database can remarkably reduce I/O overhead and improve query efficiency by storing and organizing data in columns instead of rows in the scenes of data analysis, batch query and the like, so that the column database is widely applied to systems of data warehouse, real-time analysis and the like, but mass data still has the core problems of high storage cost and influence on performance due to compression/decompression delay. In the prior art, the coding compression scheme of the column database mainly adopts a static fixed strategy, namely a coding mode is uniformly designated for a specific data type or a whole table (for example, parquet defaults to use SNAPPY general compression for all columns, clickHouse adopts Delta difference coding for fixing numerical columns), and a coding mode is manually preset based on analysis of data characteristics by operation and maintenance personnel. However, the following drawbacks exist in the prior art, which makes it difficult to achieve an optimal balance between storage efficiency and query performance in complex and variable data scenarios: The method can not be used for finely adapting data features of different columns (such as high-base discrete features of a user ID column and continuous time sequence features of a power voltage column), so that the compression rate is low or the decompression calculation cost is too high, and according to complex situations of a quantity table and thousands of columns, operation and maintenance personnel are relied on to manually analyze the data features and select coding strategies, so that the efficiency is low, and strategy selection mismatching is easily caused by experience deficiency. Moreover, when data characteristics change with service development or time lapse, static policies cannot be perceived and automatically adjusted, resulting in significant degradation of the compression rate or query performance of the earlier-stage effective compression policies. Disclosure of Invention In view of the above analysis, the embodiment of the invention aims to provide a method and a system for adaptively compressing a column database, which are used for solving the problems of low compression rate and query performance caused by solidification of a compression algorithm and strong artificial dependence. In one aspect, an embodiment of the present invention provides a method for adaptively compressing a column database, including the following steps: partitioning the collected original data, and extracting a multidimensional feature vector from each dynamic column in each obtained data block; Inputting the multidimensional feature vector of each dynamic column into a trained prediction model to obtain a prediction result, wherein the prediction result comprises a compression algorithm selection probability, a compression ratio prediction value and decompression overhead level; And selecting an optimal compression algorithm for the dynamic columns by a multi-objective decision method based on the prediction result of each dynamic column, and further executing compression operation. Based on a further improvement of the method, the method for partitioning the collected original data comprises the following steps: Partitioning original data based on a preset time window, entity identifiers and a quantity threshold value to form data blocks containing single entities and sequence numbers in a single time window, wherein the sequence numbers are generated by sequentially dividing the data blocks according to the quantity threshold value; a globally unique block identifier is generated for each data block. Based on the further improvement of the method, the data block comprises the attribute and the value of the static column and the attribute and the value sequence of the dynamic column, and the extraction of the multidimensional feature vector comprises the statistical feature and the semantic feature. Based on a further improvement of the method, the statistical features are obtained by calculating a unique value proportion, monotonicity score, differential variance and average run length from a sequence of values of the dynamic columns, and the semantic features are obtained by converting the values of the plurality of static columns into normalized scalar quantities through a predefined mapping table. The prediction model sequentially comprises a feature extraction module, a gating routing module, a multi-branch prediction module and a weighted fusion module, wherein the feature extraction module is used for mapping multi-dimensional feature vectors of each dynamic column into high-dimensional feature vectors and then transmitting the high-dimensional feature