CN-122019542-A - Multidimensional data index optimization method for hydroelectric big data storage

CN122019542ACN 122019542 ACN122019542 ACN 122019542ACN-122019542-A

Abstract

The invention relates to a multidimensional data index optimization method for hydroelectric big data storage, which comprises the steps of data access and feature extraction, extraction of physical dimension and service dimension attribute of time sequence data, dynamic weight calculation, comprehensive weight based on query frequency, data selectivity and service priority real-time calculation of dimension combination, hierarchical fusion index construction and updating, construction of a self-adaptive grid partition R tree main index and a reverse auxiliary index according to the weight, establishment of bidirectional pointer association, intelligent query routing and execution, selection of an optimal index path according to the weight, rapid positioning and merging of data through pointers, online monitoring and self-adaptive adjustment, real-time monitoring of data distribution and query performance, triggering weight calculation and index increment recombination, storage management of life cycle perception, and differential index policy implementation on hot, warm and cold data.

Inventors

ZHANG KEFENG
MA WENHUA
HUANG CHENYU
CHEN YANAN
WANG WEIHAO
WANG DANDAN
XUE YUANYUAN

Assignees

国家能源集团新疆吉林台水电开发有限公司

Dates

Publication Date: 20260512
Application Date: 20260130

Claims (9)

1. The multidimensional data index optimization method for hydroelectric big data storage is characterized by comprising the following steps of: S1, data access and feature extraction, namely accessing a multi-source heterogeneous time sequence data stream in the hydropower field in real time, and extracting physical dimension attributes and service dimension attributes of each data object, wherein the physical dimension attributes comprise a time stamp and space position information; S2, dynamic weight calculation, namely continuously collecting historical query logs, analyzing a query load mode, and dynamically calculating and updating comprehensive weights of all dimensions and dimension combinations based on query frequencies, data selectivity and predefined service priorities of all dimensions and dimension combinations; s3, constructing and updating a hierarchical fusion index structure based on the dynamic weight of the current moment; Constructing a main index, and adopting a grid partition R tree structure based on dynamic weight self-adaption, wherein a node partition strategy of an R tree is dynamically adjusted according to a single space-time dimension of a current high weight; one or more auxiliary indexes are built for a single service dimension with high weight, an inverted index structure is adopted, and a bidirectional pointer association is built with a leaf node data block of the main index; S4, intelligent query routing and execution, receiving a user query request, analyzing the dimension in the query condition, selecting a query path which preferentially traverses the main index and the corresponding auxiliary index according to the real-time weight of the dimension, rapidly positioning associated data through the bidirectional pointer, and merging and filtering query results; And S5, on-line monitoring and self-adaptive adjustment, namely, monitoring the deviation of data distribution and inquiring performance indexes in real time, triggering root cause diagnosis when the change exceeds a preset threshold value, and executing the adjustment of the dynamic weight and the on-line incremental recombination of the hierarchical fusion index structure according to the diagnosis result.
2. The method according to claim 1, wherein in the step S1, the extracted physical dimension attribute includes a time stamp accurate to milliseconds and three-dimensional space coordinates composed of longitude, latitude and altitude for each data object, and the extracted service dimension attribute includes a device code for uniquely identifying a source of the device, a condition status class code divided according to a preset rule, and values of the obtained voltage, current, flow and pressure parameters are monitored in real time.
3. The method according to claim 1, wherein in S2, the dimension query frequency is obtained by counting the number of conditional constraint operations on a single dimension and a combination of dimensions within a specified time window, the data selectivity is evaluated according to the distribution dispersion of dimension values within a value range thereof, the predefined business priority is associated with a device importance level and an operation security level, and a comprehensive weight value reflecting query heat and screening efficiency is obtained in real time through a weighted calculation model based on the foregoing elements.
4. A method according to claim 3, wherein the process of establishing the weighted calculation model is specifically; The method comprises the steps of taking continuously collected historical query logs, dimension value domain distribution statistical data and a preset service priority rule as inputs, normalizing query frequency values, data selectivity evaluation values and service priority values of all dimensions and dimension combinations in a processing process, carrying out linear weighted summation according to field expert experience and coefficients dynamically set by a reinforcement learning mechanism, and outputting real-time and quantized comprehensive weight scores of all dimensions and dimension combinations, wherein the comprehensive weight scores are directly used for guiding the selection of the dividing dimensions of nodes in an index structure and the arrangement sequence of the dimensions in a composite index key.
5. The method of claim 1, wherein in S3, the constructing and updating of the hierarchical fusion index structure is based on the comprehensive weight score obtained in S2; The main index adopts a grid partition R tree structure, when tree nodes of the main index are spatially partitioned, the partition granularity and the hierarchy of the grid are adaptively determined according to the weight score corresponding to the space-time dimension with high weight at present, the auxiliary index is an inverted index constructed for the service dimension with high weight, each inverted list is associated with a leaf node data block storing a corresponding dimension value data object in the main index R tree by establishing a bidirectional pointer, and a data object set and a corresponding time range and a physical storage position thereof are recorded in the leaf node data block.
6. The method of claim 1, wherein in S4, each dimension and dimension combination in the query condition is matched with the real-time comprehensive weight updated in S2, the query route preferentially traverses the main index and the auxiliary index corresponding to the dimension with the highest weight, and the related data sets in other indexes are rapidly related and referenced through the bidirectional pointer established in S3, and then, the union and intersection operation is performed on the data sets acquired from different index paths according to the time stamps and the spatial positions thereof to finish the combination and final filtering of the results.
7. The method according to claim 1, wherein in S5, the offset of the data distribution is quantified by counting the change rate of the value distribution of the newly added data in the critical time and service dimension, the query performance index comprises average query delay and index hit rate, the change rate and performance index are compared with respective preset static and dynamic thresholds in the monitoring process, root cause diagnosis is started when any index continuously exceeds the threshold, the root cause diagnosis carries out association analysis on the abnormality, positions the core dimension causing the abnormality, and then, directional weight adjustment and index recombination are triggered according to the diagnosis result, if the diagnosis points to the specific dimension, the related dimension weight is directionally adjusted and local index recombination related to the dimension data is triggered, if the diagnosis indicates wide change, the dynamic weight recalculation and the index recombination of the corresponding range are triggered, and the recombination is carried out on line with the data block as granularity.
8. The method according to any one of claims 1-7, further comprising: s6, index storage management of life cycle perception, namely marking data objects as hot data, warm data and cold data according to the access frequency, the generation time and the service criticality of the data; for the thermal data, maintaining a coarse granularity structure of a main index and a part of key auxiliary indexes of the thermal data in disk storage; And archiving the cold data to low-cost storage, extracting core dimension values of the cold data to form a lightweight metadata index in a column storage format, and reconstructing part indexes according to the metadata index when the query is triggered.
9. The method of claim 8, wherein in S6, the marking of the data object is comprehensively determined according to the accessed frequency, the duration from the current time and the predefined service criticality level in the sliding time window, for the data object marked as hot data, the complete main index structure of the R tree and the inverted auxiliary index of the data object reside in a high-speed storage medium, for warm data, the main index of the data object reserves an upper coarse-grained node in a disk to reduce space occupation, the auxiliary index reserves a part for a high-frequency query dimension, for cold data, the core dimension value of the data object is extracted and stored in columns to form a lightweight metadata index while being archived to be stored at low cost, and when the part of data is hit in a subsequent query, the local index structure of the dimension required by the query is temporarily reconstructed in a memory according to the metadata index.

Description

Multidimensional data index optimization method for hydroelectric big data storage Technical Field The invention belongs to the technical field of database indexes, and particularly relates to a multidimensional data index optimization method for hydroelectric big data storage, which is particularly suitable for dynamic index construction, query optimization and storage management of multi-source heterogeneous time sequence data in a smart hydroelectric scene. Background Along with the continuous deep construction of intelligent hydropower and watershed digitization, mass data are continuously generated and gathered in the hydropower production operation process, wherein the data have distinct multi-source heterogeneous characteristics and are derived from a monitoring system, a water condition forecasting system, an equipment state monitoring system, an information management system and the like; The method is characterized in that the method comprises the steps of high-efficiency storage and quick retrieval of multidimensional data, is a core technology foundation for realizing real-time assessment of equipment states, accurate fault early warning, optimal power generation dispatching and intelligent management decision, and is difficult to cope with complex multi-condition combined query in a traditional single-dimensional index structure such as a B tree and the like in the face of the requirements; However, these generic schemes gradually expose several limitations when applied to the specific business field of hydropower, firstly, the dimensions of the hydropower data are not independent, but there is a tight business logic association, for example, vibration data of a unit need to be analyzed in combination with specific operation conditions, time periods and water head parameters to make sense; The data flow has obvious periodicity, trend and burstiness, such as flood season and non-flood season, peak regulation period and steady operation period, the data generation rate and query mode have huge differences, most of the existing index structures are statically constructed or only support limited offline recombination, and online and self-adaptive adjustment cannot be carried out according to the dynamic changes of data distribution and query hotspots, so that the indexing performance greatly fluctuates along with the service period; In addition, in order to accelerate multidimensional inquiry, a common technical means is to construct a plurality of independent indexes or composite indexes containing a large amount of redundant data, which can improve the inquiry speed to a certain extent, but leads to the sharp rise of the storage cost, and huge maintenance cost is introduced when the data is updated, so that the economic principle of large data storage is not met; Finally, the existing method generally lacks consideration of the value difference of the whole life cycle of the data, adopts the same set of high-cost index strategy for the frequently accessed real-time monitoring data and the rarely queried historical archiving data, and causes waste of precious storage and calculation resources, so that a multidimensional data index method which can deeply integrate hydropower service characteristics, has dynamic self-adaption capability and can achieve optimal balance between storage cost and query performance is needed in the field. Disclosure of Invention The invention aims to overcome the defects of the prior art and provides a multidimensional data index optimization method for hydroelectric big data storage, which comprises the following steps: The method comprises the steps of S1, data access and characteristic extraction, real-time access to a multi-source heterogeneous time sequence data stream in the hydropower field, extracting physical dimension attributes and service dimension attributes of each data object, wherein the physical dimension attributes comprise a timestamp and space position information, the service dimension attributes comprise equipment identifications, working condition status codes and operation parameters, S2, dynamic weight calculation, continuously collecting historical query logs, analyzing a query load mode, dynamically calculating and updating comprehensive weights of each dimension and dimension combination based on query frequency of each dimension and dimension combination, data selectivity and predefined service priority, and S3, constructing and updating a hierarchical fusion index structure based on dynamic weights at the current moment; Constructing a main index, and adopting a grid partition R tree structure based on dynamic weight self-adaption, wherein a node partition strategy of an R tree is dynamically adjusted according to a single space-time dimension of a current high weight; One or more auxiliary indexes are built for a single service dimension with high weight, an inverted index structure is adopted, a bi-directional pointer association is built