CN-116860789-B - Data distribution optimization method and distributed database system

CN116860789BCN 116860789 BCN116860789 BCN 116860789BCN-116860789-B

Abstract

The application discloses a data distribution optimization method and a distributed database system, which comprise the steps of obtaining data distribution, query plans and temporary data redistribution conditions of nodes of the distributed database system, calculating an optimal distribution key by using a preset optimization algorithm, optimizing query performance by selecting a distribution column based on the query plans and a corrected distribution column, obtaining a query request, optimizing related query by using the optimal distribution key and the corrected distribution column to schedule related query tasks to proper data nodes, and completing data redistribution by using a scheduling result of query task scheduling and the optimal distribution key so as to transfer data with affinity to the same data node. The application can optimize the query plan and operation and improve the query efficiency by calculating the optimal distribution key and dynamically adjusting the selection of the distribution list.

Inventors

ZOU RENLI
WAN XIANGBIN
GAO XUEYU
MIAO JIAN
LV XINJIE

Assignees

瀚高基础软件股份有限公司

Dates

Publication Date: 20260508
Application Date: 20230724

Claims (7)

1. A method for optimizing data distribution, comprising: Acquiring data distribution, query plans and temporary data redistribution conditions of all nodes of a distributed database system; calculating an optimal distribution key by using a preset optimization algorithm based on the data distribution, the query plan and the temporary data redistribution condition, and Selecting a distribution list of the correction correlation table based on the query plan so as to optimize the query performance; acquiring a query request, and optimizing the related query by utilizing the optimal distribution key and the corrected distribution list so as to schedule the related query task to a proper data node; Based on the scheduling result of the query task scheduling and the optimal distribution key, finishing data redistribution so as to migrate the data with affinity to the same data node; Based on the data distribution, the query plan and the temporary data redistribution situation, calculating an optimal distribution key by using a preset optimization algorithm comprises the following steps: Acquiring relevant key value indexes hit by the query plan in the distributed database system, and dispersing based on the relevant key value indexes to obtain hit probability of the relevant key value indexes, and Counting the redistribution condition of temporary data in the distributed database system, and recording a counting result; and calculating according to the weight of a preset proportion based on hit probability of the relevant key value index and the statistical result to obtain the optimal distribution key.
2. The data distribution optimization method of claim 1, further comprising receiving the optimized query task and performing a query operation on the corresponding data node, returning the result to the query requester.
3. The data distribution optimization method according to claim 1, wherein the optimal distribution key is obtained by weight calculation according to a preset proportion based on hit probability of the relevant key value index, and the statistical result: The distribution key ranking score=alpha related key value index+beta temporary data redistribution state index+gamma table association condition index, wherein alpha, beta and gamma are corresponding weight factors, the related key value index comprises hit times of keywords, indexes used in inquiry and filtering conditions of inquiry, the temporary data redistribution state index comprises creation and use conditions of temporary tables and distribution uniformity of data, and the table association condition index comprises association types among tables and use frequency of association fields.
4. The data distribution optimization method of claim 3, wherein modifying the distribution column selection of the correlation table based on the query plan to optimize query performance comprises introducing a query execution time index and a data traffic index into the distribution key ranking score.
5. The data distribution optimization method of claim 1, wherein the optimal distribution key is re-evaluated and revised in the event of a change in data distribution and query patterns.
6. A distributed database system comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements the steps of the data distribution optimization method of any of claims 1 to 5.
7. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the data distribution optimization method according to any of claims 1 to 5.

Description

Data distribution optimization method and distributed database system Technical Field The present application relates to the field of database technologies, and in particular, to a data distribution optimization method and a distributed database system. Background In modern distributed database systems, data nodes store the actual data and are distributed in slices by some sort of hash algorithm. Distributed systems typically provide a method to handle physical tilting to ensure that data is evenly distributed across each node. However, in query execution, it is often necessary to perform associative query operations, which require that the relevant data have some affinity, i.e., be distributed on the same node. Focusing on physical tilting alone does not avoid frequent cross-node queries, thereby affecting query efficiency. Currently, some distributed database systems support the functionality of temporal data redistribution. This means that the system can temporarily migrate data with affinity to the same node as needed according to the query plan and optimization strategy to reduce the number of queries across nodes and improve query efficiency. However, the existing system lacks the ability to intelligently process data distribution, and cannot be dynamically modified and optimized according to query plans, temporary data redistribution conditions, and list association conditions, so that optimal query performance cannot be achieved. Disclosure of Invention The embodiment of the application provides a data distribution optimization method and a distributed database system, which are used for optimizing a query plan and operation and improving query efficiency. The embodiment of the application provides a data distribution optimization method, which comprises the following steps: Acquiring data distribution, query plans and temporary data redistribution conditions of all nodes of a distributed database system; calculating an optimal distribution key by using a preset optimization algorithm based on the data distribution, the query plan and the temporary data redistribution condition, and Selecting a distribution list of the correction correlation table based on the query plan so as to optimize the query performance; acquiring a query request, and optimizing the related query by utilizing the optimal distribution key and the corrected distribution list so as to schedule the related query task to a proper data node; And finishing data redistribution based on the scheduling result of the query task scheduling and the optimal distribution key so as to migrate the data with affinity to the same data node. Optionally, the method further comprises the steps of receiving the query task subjected to the optimization processing, executing the query operation on the corresponding data node, and returning the result to the query requester. Optionally, based on the data distribution, the query plan and the temporary data redistribution, calculating the optimal distribution key by using a preset optimization algorithm includes: Acquiring relevant key value indexes hit by the query plan in the distributed database system, and dispersing based on the relevant key value indexes to obtain hit probability of the relevant key value indexes, and Counting the redistribution condition of temporary data in the distributed database system, and recording a counting result; and calculating according to the weight of a preset proportion based on hit probability of the relevant key value index and the statistical result to obtain the optimal distribution key. Optionally, the optimal distribution key is obtained by calculating the weight according to the preset proportion based on the hit probability of the relevant key value index and the statistical result, so as to satisfy the following conditions: distribution key ranking score=α×related key value index+β×temporary data redistribution state index+γ×table association condition index The related key value indexes comprise hit times of keywords, indexes used in inquiry and filtering conditions of inquiry, the temporary data redistribution state indexes comprise creation and use conditions of temporary tables and distribution uniformity of data, and the table association condition indexes comprise association types among tables and use frequencies of association fields. Optionally, modifying the distribution list selection of the correlation table based on the query plan to optimize the query performance includes: And introducing a query execution time index and a data transmission quantity index into the distribution key ranking part. Optionally, in the event of a change in data distribution and query pattern, the optimal distribution key is re-evaluated and revised. The embodiment of the application also provides a distributed database system, which comprises a processor and a memory, wherein the memory is stored with a computer program, and the computer program realizes the steps