CN-116226281-B - Automatic database partitioning method and system based on depth map compression algorithm
Abstract
The invention discloses an automatic database partitioning method and system based on a depth map compression algorithm, wherein the method comprises the steps of obtaining the workload of query sentences in an input database, a data set and the query relation of a relation table; the method comprises the steps of constructing a graph model for selecting partitioning columns by utilizing the extracted workload characteristics, data characteristics and query relations, extracting self characteristics and sub-graph structure information of each data column node of the graph model by utilizing a depth graph compression model, inputting the self characteristics and the sub-graph structure information of each data column node into a partitioning column selection model to obtain the selected partitioning columns, and partitioning the table of a database based on the partitioning columns. The invention can effectively estimate the quality of the database partition.
Inventors
- LI GUOLIANG
- ZHOU XUANHE
Assignees
- 清华大学
Dates
- Publication Date
- 20260505
- Application Date
- 20230214
Claims (10)
- 1. An automatic database partitioning method based on a depth map compression algorithm is characterized by comprising the following steps: Acquiring the workload and the data set of query sentences in an input database and the query relation of a relation table; Constructing a graph model for selecting dividing columns by using the extracted workload characteristics, the data characteristics and the query relation, wherein the graph model takes the data columns as nodes, the node characteristics comprise table size, row selectivity, scanning operation frequency and aggregation operation frequency, edges are constructed among the nodes according to connection predicates among the data columns in the query statement, and edge weights are determined based on the occurrence frequency of the connection predicates and the predicate base; Extracting self features and sub-graph structure information of each data column node of the graph model by using a depth graph compression model, wherein the depth graph compression model performs feature extraction on a k-hop sub-graph structure of each node based on graph convolution, and maps the sub-graph structure information into a low-dimensional feature vector through forward propagation; Inputting the self-feature and sub-image structure information of each data column node into a partition column selection model to obtain a selected partition column, wherein the self-feature and sub-image structure information of each data column node is analyzed by utilizing correlation degree based on the low-dimensional feature vector to obtain a vertex set of preset conditions, the vertex set of the preset conditions comprises vertices, of which the correlation degree between each data column node and other nodes is larger than a first preset threshold value and the compression density is larger than a second preset threshold value, and table partitioning of a database is performed based on the partition column.
- 2. The method of claim 1, wherein the query performance of the database under the partition policies selected based on the partition columns is evaluated using a trained evaluation model to obtain query efficiency under different partition policies, the method further comprising: Offline training, namely acquiring system logs of different databases to generate a first data line graph, estimating the performance of the first data line graph by using an evaluation model, and updating graph weights in the evaluation model based on a performance estimation result and a gradient-descending loss value to train an initial evaluation model; And (3) online training, namely inputting the selected division columns into the initial evaluation model to generate a second data line graph, estimating delay and throughput based on the second data line graph, and optimizing network weights in the initial evaluation model according to delay and throughput estimation results to obtain a trained evaluation model.
- 3. The method of claim 2, wherein constructing a graph model for selecting a partitioning column using the extracted workload characteristics and data characteristics and the query relationship comprises: Respectively calculating the query characteristics and data characteristics of each data column node in the relation table, and And constructing edges between corresponding data column nodes according to the related data columns and the calling states in the query statement to construct a graph model.
- 4. A method according to claim 3, wherein extracting self-features and sub-graph structure information of each data column node of the graph model using a depth map compression model comprises: Performing feature selection and compression on the sub-graph structure of each data column node of the graph model by using a depth graph compression model to obtain self feature and sub-graph structure information of each data column node, and Mapping the sub-graph structure information to the low-dimensional feature vector through forward propagation to obtain a mapping result.
- 5. The method of claim 4, wherein inputting the self-feature and sub-graph structure information of each data column node into the partition column selection model to obtain the selected partition column comprises: and obtaining the division columns based on the vertex set.
- 6. An automatic database partitioning system based on a depth map compression algorithm, comprising: the data collection module is used for obtaining the workload of the query statement in the input database, the data set and the query relation of the relation table; The data preprocessing module is used for constructing a graph model for selecting dividing columns by utilizing the extracted workload characteristics, the data characteristics and the query relation, wherein the graph model takes the data columns as nodes, the node characteristics comprise table size, row selectivity, scanning operation frequency and aggregation operation frequency, edges are constructed among the nodes according to connection predicates among the data columns in the query statement, and edge weights are determined based on the occurrence frequency of the connection predicates and the predicate base; the depth map compression module is used for extracting self features and sub-graph structure information of each data column node of the map model by utilizing a depth map compression model, wherein the depth map compression model is used for carrying out feature extraction on a k-hop sub-map structure of each node based on map convolution and mapping the sub-graph structure information into a low-dimensional feature vector through forward propagation; The partitioning column selecting module is used for inputting the self-feature and sub-image structure information of each data column node into the partitioning column selecting model to obtain a selected partitioning column, wherein the self-feature and sub-image structure information of each data column node is analyzed by utilizing correlation degree based on the low-dimensional feature vector to obtain a vertex set of a preset condition, the vertex set of the preset condition comprises vertices, the correlation degree of each data column node and other nodes is larger than a first preset threshold value, and the compression density is larger than a second preset threshold value, and table partitioning of the database is carried out based on the partitioning column.
- 7. The system of claim 6, wherein after the partition column selection module, the system further comprises a performance evaluation model for evaluating query performance of the database under partition policies selected based on the partition column using the trained evaluation model to obtain query efficiency under different partition policies, comprising: Offline training, namely acquiring system logs of different databases to generate a first data line graph, estimating the performance of the first data line graph by using an evaluation model, and updating graph weights in the evaluation model based on a performance estimation result and a gradient-descending loss value to train an initial evaluation model; And (3) online training, namely inputting the selected division columns into the initial evaluation model to generate a second data line graph, estimating delay and throughput based on the second data line graph, and optimizing network weights in the initial evaluation model according to delay and throughput estimation results to obtain a trained evaluation model.
- 8. The system of claim 7, wherein the data preprocessing module comprises a data column encoding module and a graph generating module, wherein, The data column coding module is used for respectively calculating the query characteristic and the data characteristic of each data column node in the relation table, and And the graph generation module is used for constructing edges between corresponding data column nodes according to the related data columns and the calling states in the query statement so as to construct a graph model.
- 9. The system of claim 8, wherein the depth map compression module is further configured to: and carrying out feature selection and compression on the sub-graph structure of each data column node of the graph model to obtain self features and sub-graph structure information of each data column node, and mapping the sub-graph structure information to a low-dimensional feature vector through forward propagation to obtain a mapping result.
- 10. The system of claim 9, wherein the partition column selection module is further configured to: and obtaining the division columns based on the vertex set.
Description
Automatic database partitioning method and system based on depth map compression algorithm Technical Field The invention relates to the technical field of information retrieval, in particular to an automatic database partitioning method and system based on a depth map compression algorithm. Background Given a set of tables, a database partition aims at selecting a partition key (single or multiple columns) for each table and using that key to assign rows to different partitions. Database partitioning is critical in distributed databases in order to meet query performance requirements (e.g., high throughput, low latency). However, database partitioning is an NP problem, selecting partition keys from a large number of columns (e.g., millions of columns in a commercial database) to optimize query performance is very costly. A conventional exhaustive approach (e.g., dynamic programming) would enumerate each possible column combination and select the highest performing combination as the partition key, which is time consuming and difficult to use in practice. Heuristic methods search the maximum spanning tree of the table on the pattern graph, but cannot achieve high quality partitioning because they mainly take into account foreign key constraints between tables and cannot capture dependencies (e.g., equivalence links) between columns, greatly affecting query performance. Deep Reinforcement Learning (DRL) based database partitioning also has some limitations. First, it cannot capture rich data features (e.g., data distribution), column features (e.g., connection selectivity) and column dependencies, which greatly affects the advantage of using columns as partitioning keys. Second, it is difficult to accommodate different scenarios (e.g., different queries and data sets) because it only supports limited query templates and cannot capture data features. Third, they need to evaluate the quality of the partitioning strategy. The method of generating partitions and then evaluating the quality of the partitions is time consuming. For example, training the reinforcement learning model takes more than 13 hours. In addition, simple cost models do not effectively estimate the quality of a database partition, resulting in poor partition quality. Disclosure of Invention The present invention aims to solve at least one of the technical problems in the related art to some extent. Therefore, the invention aims to provide an automatic database partitioning method based on a depth map compression algorithm, which can improve query performance by selecting a partitioning strategy and can effectively estimate the quality of a database partition. It is another object of the present invention to provide an automatic database partitioning system based on a depth map compression algorithm. In order to achieve the above objective, in one aspect, the present invention provides an automatic database partitioning method based on a depth map compression algorithm, including: Acquiring the workload and the data set of query sentences in an input database and the query relation of a relation table; constructing a graph model for selecting the partition columns by using the extracted workload characteristics and the data characteristics and the query relation; Extracting self-characteristics and sub-graph structure information of each data column node of the graph model by using a depth graph compression model; And inputting the self characteristics and the sub-graph structure information of each data column node into a division column selection model to obtain a selected division column, and carrying out table partition on the database based on the division column. The automatic database partitioning method based on the depth map compression algorithm according to the embodiment of the invention can also have the following additional technical features: Further, in one embodiment of the present invention, the query performance of the database under the partition policy selected based on the partition column is evaluated by using a trained evaluation model to obtain the query efficiency under different partition policies, and the method further includes: Offline training, namely acquiring system logs of different databases to generate a first data line graph, estimating the performance of the first data line graph by using an evaluation model, and updating graph weights in the evaluation model based on a performance estimation result and a gradient-descending loss value to train an initial evaluation model; And (3) online training, namely inputting the selected division columns into the initial evaluation model to generate a second data line graph, estimating delay and throughput based on the second data line graph, and optimizing network weights in the initial evaluation model according to delay and throughput estimation results to obtain a trained evaluation model. Further, in an embodiment of the present invention, the constructing a grap