CN-121979891-A - Index tuning method for big data analysis

CN121979891ACN 121979891 ACN121979891 ACN 121979891ACN-121979891-A

Abstract

The invention relates to an index tuning method for big data analysis, belongs to the technical field of database indexes, and solves the problem that the tuning effect can not be influenced by processing both index selection and space allocation in the prior art. The method comprises the steps of extracting global state data, wherein the global state data comprises a work load set, a candidate index set and data block distribution characteristics, constructing a hierarchical reinforcement learning model, training the hierarchical reinforcement learning model based on the global state data to obtain a trained hierarchical reinforcement learning model, obtaining state data to be optimized, inputting the state data to be optimized into the trained hierarchical reinforcement learning model to obtain a selected index, and distributing storage space for the selected index, wherein the hierarchical reinforcement learning model comprises a high-level strategy network for index selection and a low-level strategy network for distributing space for the selected index. An efficient tuning of the data skip index is achieved.

Inventors

Tong Yulai
MIAO JIALE
LIU MEIHAN
YAN QIUYAN
ZHANG XIANGHE

Assignees

中国矿业大学

Dates

Publication Date: 20260505
Application Date: 20260408

Claims (10)

1. An index optimization method for big data analysis is characterized by comprising the following steps: Extracting global state data, wherein the global state data comprises a workload set, a candidate index set and data block distribution characteristics; Constructing a hierarchical reinforcement learning model, wherein the hierarchical reinforcement learning model comprises a high-level strategy network for index selection and a low-level strategy network for space allocation for the selected indexes; training the layered reinforcement learning model based on the global state data to obtain a trained layered reinforcement learning model; and acquiring state data to be optimized, inputting the state data to be optimized into the trained hierarchical reinforcement learning model to obtain a selected index, and distributing a storage space for the selected index.
2. The big data analysis oriented index tuning method of claim 1, wherein the training of the low level policy network is performed by: s311, randomly selecting one workload from the workload set as the current workload, and randomly selecting one index from the candidate index set as the current index; S312, constructing current first state data based on the current workload and the current index; S313, inputting the current first state data into the low-level policy network to obtain a first action, calculating a reward value and obtaining the first state data of the next time step; S314, if the calculated reward value is the same as the reward value of the previous k time steps, judging whether the network training of the lower layer strategy reaches the preset times, if so, ending the training, otherwise, completing the space allocation of the current index, returning to the step S311, and if the calculated reward value is different from the reward value of the previous k time steps, returning to the step S313 by taking the first state data of the next time step as the current first state data.
3. The index tuning method for big data analysis according to claim 2, wherein the rewards obtained by the lower layer policy network at the t-th time step are calculated by using the following formula : ; ; Wherein, the Indicating the relative load reduction amount, Representing the overhead of executing the current workload on the index set created at time step t-1 of the low level policy network, Representing the overhead of executing the current workload on the index set created at the t-th time step of the low-level policy network, The created index set representing the t-th time step of the low-level policy network, The created index set representing the t-1 time step of the low level policy network, Representing the initial overhead of the device, Representing the amount of change in the storage overhead, The index representing the t-th time step of the lower level policy network is created from the storage space size obtained from the other index, Representing penalty coefficients.
4. The big data analysis oriented index tuning method of claim 2, wherein the current first state data includes load characteristics related to a current index, distribution characteristics of a data block where the current index is located, an attribute column where the current index is located, current storage resource information, and current generated index information.
5. The method for optimizing index for big data analysis of claim 1, wherein the action space of the low-level policy network comprises increasing a storage space for a current index, decreasing the storage space for the current index, and reallocating between indexes.
6. The big data analysis oriented index tuning method of claim 1, wherein the high-level policy network is trained in the following manner: s321, randomly selecting one workload from the workload set as a current workload; S322, constructing current second state data based on the current workload, and inputting the current second state data into the high-level policy network to obtain a second action; S323, obtaining a storage space allocated for the current index corresponding to the second action based on the trained low-level strategy network, and calculating a reward value; S324, if no effective action can be executed, judging whether the training of the high-level strategy network reaches the preset training times, if so, ending the training, otherwise, returning to the step S321.
7. The big data analysis oriented index tuning method of claim 6, wherein the prize value of the higher layer policy network is calculated using the formula: ; ; Wherein, the Indicating the relative load reduction amount, Representing the overhead of executing the current workload on the index set created at time step t-1 of the higher layer policy network, Representing the overhead of executing the current workload on the index set created at the t-th time step of the higher layer policy network, Representing the initial overhead of the device, The storage space size obtained from other indexes when the index representing the t-th time step of the high-level strategy network is created; Represents a penalty coefficient and, Representing the amount of storage overhead change.
8. The index tuning method for big data analysis according to claim 6, wherein the higher-level policy network adopts a PPO model with an attention layer; And splicing the selection features of the current workload on each data block to serve as keys and values of the attention layer, and performing attention calculation.
9. The big data analysis oriented index tuning method of claim 8, wherein the selection feature of the current workload on each data block is obtained by: Determining a bit column of the current workload on each attribute column based on the partition number of each attribute column and the numerical range of each partition, and obtaining a histogram of the current workload; And performing AND operation on the histogram of the current workload and the histogram of each data block to obtain the selection characteristics of the current workload on each data block.
10. The big data analysis oriented indexing optimization method of claim 1, wherein the distribution characteristics of the data blocks include a maximum value, a minimum value, a mean value, and a histogram for each data block.

Description

Index tuning method for big data analysis Technical Field The invention relates to the technical field of database indexing, in particular to an index tuning method for big data analysis. Background With the development of data-intensive applications such as artificial intelligence, the ever-increasing mass data size has made access to and processing of external data a new performance bottleneck. In addition, such applications deployed in the cloud end typically separate the storage resources from the computing resources, so as to implement independent expansion of multiple resources, further exacerbating the cost of cloud end data access. Specifically, such applications typically utilize cloud vendor-provided virtual machines to build a computing cluster as the computing side and deploy thereon large data analysis and machine learning applications. Meanwhile, an object storage service is used as a storage side for storing data required for analysis, model training and the like. The massive data scale and the separated resource deployment mode introduce extra network reading cost, and high data reading cost is caused while the resource expandability is improved, so that the system performance is seriously influenced. To reduce data access overhead, such data-intensive applications typically use columnar data storage formats to reduce data access costs by avoiding access to irrelevant attribute columns when performing tasks. On this basis, the data is split horizontally into blocks of data, which are typically the smallest units of I/O and analysis processing, to further mitigate data read overhead. In order to improve data read efficiency, these applications often incorporate data skip (DATA SKIPPING) indexing to improve data access efficiency by avoiding reading extraneous data blocks. The data skip index needs to balance the storage overhead and the filtering performance, so that the data access cost is reduced by improving the data filtering performance while avoiding high maintenance cost, and therefore, the technology is also important to optimizing the system performance in novel data intensive scenes such as large data analysis and the like. Compared with refined index structures such as B+ trees, the technology has adjustable storage cost, and can be adjusted according to specific scenes and specific user requirements. To avoid wasting storage resources, after determining the index structure, the system needs to select the appropriate attribute column and create an index for it (i.e., index tuning). The existing index tuning technology focuses on optimizing the index tuning effect or computing efficiency, but is limited to index technologies such as a B+ tree with fixed storage cost, and the like, so that the processing of both index selection and space allocation cannot be realized, and the proper storage space is difficult to skip the index for the data with variable storage cost, so that the tuning effect is influenced. Disclosure of Invention In view of the above analysis, the embodiment of the invention aims to provide an index tuning method for big data analysis, which is used for solving the problem that the existing method can not realize the effect of processing and affecting tuning at the same time of index selection and space allocation. On the one hand, the embodiment of the invention provides an index tuning method for big data analysis, which comprises the following steps: Extracting global state data, wherein the global state data comprises a workload set, a candidate index set and data block distribution characteristics; Constructing a hierarchical reinforcement learning model, wherein the hierarchical reinforcement learning model comprises a high-level strategy network for index selection and a low-level strategy network for space allocation for the selected indexes; training the layered reinforcement learning model based on the global state data to obtain a trained layered reinforcement learning model; and acquiring state data to be optimized, inputting the state data to be optimized into the trained hierarchical reinforcement learning model to obtain a selected index, and distributing a storage space for the selected index. Based on the further improvement of the method, the training of the low-level strategy network is performed in the following way: s311, randomly selecting one workload from the workload set as the current workload, and randomly selecting one index from the candidate index set as the current index; S312, constructing current first state data based on the current workload and the current index; S313, inputting the current first state data into the low-level policy network to obtain a first action, calculating a reward value and obtaining the first state data of the next time step; S314, if the calculated reward value is the same as the reward value of the previous k time steps, judging whether the network training of the lower layer strategy reaches the preset ti