CN-122019836-A - Index recommendation method and system for cloud data analysis system

CN122019836ACN 122019836 ACN122019836 ACN 122019836ACN-122019836-A

Abstract

The invention discloses an index recommendation method and an index recommendation system for a cloud data analysis system, which belong to the technical field of index recommendation and comprise the steps of obtaining characteristic representations of query tasks on data blocks for each current load, fusing the characteristic representations as input characteristics of the corresponding loads, carrying out vectorization representation on text information of each load and the input characteristics of each load together with current storage state information, inputting a trained index candidate evaluation model to obtain scores of all index candidates, wherein the index candidate represents a scheme for creating indexes for one or more columns of one data block, the index candidate evaluation model is a neural network model based on an attention mechanism, creating indexes for the corresponding data blocks after screening out index candidates with the highest score, updating storage state information, carrying out the next circulation if storage cost constraint is met and the index which can be created exists, and otherwise, ending the circulation. The invention can fully utilize the scarce memory resource to accelerate the index scanning effect.

Inventors

WANG HUA
Tong Yulai
LIU JIAZHEN
LIU FENGRUI
XU JINHUI
ZHOU KE

Assignees

华中科技大学

Dates

Publication Date: 20260512
Application Date: 20241112

Claims (7)

1. An index recommendation method for a cloud data analysis system is characterized by comprising the following steps: (S1) for each current load, acquiring characteristic representations of each query task on each data block, and fusing the characteristic representations as input characteristics of the corresponding load; (S2) vectorizing text information of each load, input characteristics of each load and current storage state information of the cloud data analysis system to obtain a current system state characteristic vector; (S3) inputting the current system state feature vector into a trained index candidate evaluation model to obtain scores of all index candidates, wherein the index candidates represent a scheme for creating indexes for one or more columns of a data block, and the index candidate evaluation model is a neural network model based on an attention mechanism; (S4) screening out index candidates with highest scores, and creating indexes for one or more columns of corresponding data blocks according to the index candidates; And (S5) updating the storage state information, if the preset storage overhead constraint is met and the index which can be created exists, turning to (S1) to carry out the next cycle, and otherwise, ending the cycle.
2. The index recommendation method for a cloud data analysis system according to claim 1, wherein the characteristic representation w b of the query task q on the data block b includes a proportion of the amount of data accessed by the query task q in the data block b to the total amount of data in the data block, and a data range of the data accessed by the query task q in the data block b.
3. The index recommendation method for a cloud data analysis system according to claim 2, wherein the obtaining manner of the feature representation w b of the query task q on the data block b includes: The method comprises the steps of respectively obtaining a bitmap h b of a data block b and a bitmap h q of a query task q, wherein each bit in the bitmap h b corresponds to one data range, the corresponding bit of the data range of each data record in the data block b is 1, the rest bits are 0, each bit in the bitmap h q corresponds to one data range, the corresponding bit of the data range related to a predicate of the query task q is 1, the rest bits are 0, the lengths of the bitmap h b and the bitmap h q are equal, and the data ranges corresponding to the same bit are the same; Performing bitwise and operation on the bitmap h b and the bitmap h q to obtain a joint bitmap h u in which the data range of the data accessed by the query task q in the data block b is recorded; Calculating the ratio p of the number of 1 in the combined bitmap h u to the number of 1 in the bitmap h b to obtain the proportion of the data quantity accessed by the query task q in the data block b to the total data quantity of the data block; The ratio p and the union bitmap h u are combined to obtain a feature representation w b of the query task q on data block b.
4. The index recommendation method for a cloud data analysis system according to any one of claims 1 to 3, wherein the neural network model based on an attention mechanism comprises an attention layer, a linear layer and an output layer; the attention layer takes the vectorized load text information as Query, takes vectorized load input characteristics as Key and Value, evaluates the correlation degree between each Key and Query as the weight of the corresponding Value, and performs weighted summation on each Value; the linear layer takes the input of the attention layer and the storage state information after vectorization as input; and the output layer is used for carrying out SoftMax processing on the output of the linear layer to obtain the score of each index candidate.
5. A computer program product comprising a computer program which, when executed by a processor, implements the index recommendation method for a cloud data analysis system according to any one of claims 1 to 4.
6. The index recommendation method for the cloud data analysis system is characterized by comprising a stored computer program, wherein the computer program is executed by a processor and controls equipment where the computer readable storage medium is located to execute the index recommendation method for the cloud data analysis system according to any one of claims 1-4.
7. An index recommendation system for a cloud data analysis system, comprising: A computer readable storage medium storing a computer program; and a processor configured to read a computer program stored in the computer-readable storage medium, and execute the index recommendation method for a cloud data analysis system according to any one of claims 1 to 4.

Description

Index recommendation method and system for cloud data analysis system Technical Field The invention belongs to the technical field of index recommendation, and particularly relates to an index recommendation method and an index recommendation system for a cloud data analysis system. Background In modern cloud computing environments, design strategies that separate storage and computing architectures are commonly adopted in the field of data analysis. The architecture enables computing resources (e.g., computing clusters) to be independently extended to accommodate changing resource requirements by physically or logically decoupling the two from storage resources. Furthermore, in order to optimize data transfer efficiency and reduce the frequency of input/output (I/O) operations, a data block, typically containing tens of thousands to millions of data records from a single data set, is defined as the basic unit of access to a remote storage resource, while also being the smallest unit of data filtering and processing. The design not only improves the data throughput, but also improves the performance of the whole system by reducing the number of I/O operations. Index is widely used in traditional database scenarios as an important technique to improve the effectiveness of data scanning in databases. However, in a cloud data analysis scenario with a huge data size, the secondary index occupies too much scarce memory, so that it is difficult to fully utilize the data in the scenario, and thus it is difficult to fully develop the performance bottleneck of the system. The index recommendation technology is one of important means for improving the utilization rate of the memory, and can alleviate the problem of overlarge index cost to a certain extent. Research into such work in the academia can be broadly divided into two categories. One class is the index recommendation algorithm based on heuristic strategies, such as Drop, relaxation, and Extend, etc. The core strategy of the recommendation algorithm can be divided into two types, namely, starting from an index empty set, increasing new candidate indexes capable of improving query execution efficiency until the constrained storage space is occupied, and starting from an index full set, reducing indexes with minimum effect on improving query execution efficiency until the constraint of storage overhead is met. Such heuristic index recommendation algorithms can sense complex effects such as interactions among indexes, so that better index selection decisions can be made, but high computational overhead is easily caused, so that the heuristic index recommendation algorithms are difficult to adapt to scenes with huge scale (such as a large number of involved attribute columns). Another category is the reinforcement learning based index recommendation algorithms, such as Swirl, DRLinda, etc. Such indexing can achieve efficient solution of better index selection decisions at the cost of higher training overhead. All the above works use the data set as the recommended index combination with minimum granularity, and waste the opportunity of performing index tuning at the data block level, so that it is difficult to fully utilize the scarce memory resources to accelerate the index scanning effect. To solve this problem, some research efforts have also attempted to perform indexing at the data block level. For example, the Alaba tries to build a single column index for each data block, and Slalom can dynamically select the single column index to be built based on the workload. Even if the above technology for performing index recommendation at the data block level is proposed, the existing work is difficult to meet the requirement of effectively utilizing scarce memory resources in the cloud computing scene, because such work can only support the construction of single-column indexes, cannot support multi-column indexes, and cannot effectively process the problem that the interaction among indexes affects the effect of an index recommendation algorithm. Disclosure of Invention Aiming at the defects and improvement demands of the prior art, the invention provides an index recommendation method and an index recommendation system for a cloud data analysis system, which aim to establish proper single-column indexes or multi-column indexes for data blocks with different distributions based on the difference of data distribution among the data blocks so as to fully utilize scarce memory resources to accelerate the index scanning effect. In order to achieve the above object, according to an aspect of the present invention, there is provided an index recommendation method for a cloud data analysis system, including: (S1) for each current load, acquiring characteristic representations of each query task on each data block, and fusing the characteristic representations as input characteristics of the corresponding load; (S2) vectorizing the text information of each load, the inpu