CN-121979957-A - Hadoop-based data warehouse construction method and system

CN121979957ACN 121979957 ACN121979957 ACN 121979957ACN-121979957-A

Abstract

The invention discloses a data warehouse construction method and a system based on Hadoop, and aims to solve the problems that in the existing method, core monitoring signals are insufficient in real time, non-core signal resources are wasted and a data model cannot be adapted to equipment states. The method comprises the steps of constructing an ODS-DW-DM three-layer framework, dividing core/non-core signals according to a device mechanism by the ODS layer, dynamically extracting and forming standardized data, calculating the device health degree by the DW layer and adaptively modeling, storing time sequence data by the DM layer in a layered manner, and optimizing the data quality by a full link. The system comprises three layers of architecture construction, ODS dynamic extraction, DW health modeling, DM layered storage and full link quality optimization modules. The invention improves the real-time property and the resource utilization rate of the data, ensures the data quality and adapts to the industrial monitoring requirements.

Inventors

ZHANG YANZHI
TAN XIAOGANG
LI PENGFEI
QU WEIQIANG
CHEN ZEYANG
CAO CHEN
TANG XINWEN
HE QIFAN
LIU BO
MA XINHONG
HE LIN

Assignees

黄河水利水电开发集团有限公司
北京华科同安监控技术有限公司

Dates

Publication Date: 20260505
Application Date: 20251130

Claims (10)

1. The data warehouse construction method based on Hadoop is characterized by comprising the following steps of: Constructing a three-layer data architecture based on Hadoop ecology, wherein the three-layer data architecture sequentially comprises an operation data layer ODS, a data warehouse layer DW and a data mart layer DM, and provides infrastructure support for subsequent data processing and storage; Based on the three-layer architecture, dividing the monitoring quantity into a core signal and a non-core signal at the ODS layer through a preset equipment mechanism model, executing real-time stream extraction on the core signal based on a Flink-CDC component and synchronizing the core signal to a Hive table and an HBase memory area of the ODS layer, executing self-adaptive period extraction on the non-core signal based on HadoopYARN resource load feedback, and dynamically adjusting the extraction period according to the utilization rate of a YARN cluster CPU to form standardized original data of the ODS layer; Based on the monitoring amount data extracted by the ODS layer, a real-time calculation module is built on the DW layer through HiveUDF and SparkMLlib, the equipment health degree is calculated by combining vibration spectrum characteristics, temperature trend and historical fault weight, and according to the health degree in a normal, early warning or fault risk state, a coarse-grain dimension model, fine-grain dimension expansion or cross-equipment dimension association is adopted respectively, so that analysis type data modeling of the DW layer is completed; Based on dimension model data constructed by the DW layer, the time sequence data cube is divided into a real-time query layer, a history analysis layer and an archiving layer according to storage layer in the DM layer, so that differential storage and rapid query support of a data mart layer are realized; And performing first-round outlier correction on the original data of the ODS layer through a preset equipment physical constraint mechanism rule engine, performing complementation on the missing value through an LSTM time sequence prediction model trained by SparkML, marking a predicted value label on the DW layer, and meanwhile, comparing the repaired data by combining the actual operation and maintenance records, dynamically optimizing the parameters of the LSTM model to improve the complementation accuracy rate, so as to form closed-loop optimization of the data quality covering the ODS to the DM layer.
2. The Hadoop-based data warehouse construction method according to claim 1, wherein in the equipment mechanism model preset by the ODS layer, a core signal is a monitoring amount with high association degree with equipment faults and at least comprises bearing vibration and stator temperature data, and a non-core signal is a monitoring amount with low association degree with equipment faults and at least comprises environmental humidity and auxiliary equipment voltage data.
3. The Hadoop-based data warehouse construction method according to claim 1, wherein the method for calculating the health degree of the equipment is characterized in that the method is based on weighted calculation of vibration spectrum feature matching degree, temperature trend normal duty ratio and historical fault uncorrelated weight, wherein the vibration spectrum feature matching degree is cosine similarity of a real-time vibration spectrum and a standard normal spectrum, the temperature trend normal duty ratio is sampling point duty ratio of real-time temperature in a preset normal interval, and the historical fault uncorrelated weight is Euclidean distance normalization value of real-time monitoring data and historical fault data.
4. The Hadoop-based data warehouse construction method according to claim 1, wherein the DM layer real-time query layer stores data through a Alluxio component, the Alluxio component enables a memory caching strategy, and when the cache hit rate is lower than a preset threshold value, the DM layer history analysis layer automatically expands a cache space, the DM layer history analysis layer stores data through an HBase, in a column group design of the HBase, vibration value and temperature value fields of high-frequency query are independently allocated to a column group, and equipment remarks and operation and maintenance record fields of low-frequency query are allocated to another column group.
5. The Hadoop-based data warehouse construction method as set forth in claim 1, wherein the mechanism rule engine for performing outlier correction on the ODS layer raw data includes at least a preset device physical constraint including a rotation speed of the turbine being proportional to a square of an output, a bearing temperature not being lower than an ambient temperature preset difference value, and correcting the outlier to a most recent reasonable value conforming to the constraint when the ODS layer raw data is detected to violate the physical constraint.
6. The Hadoop-based data warehouse construction method according to claim 1, wherein the LSTM time sequence prediction model for supplementing the missing value is input as ODS layer normal monitoring amount data of a preset period before the missing value, and the data is output as the supplementing data of the missing period, the model parameter optimization comprises the steps of adjusting the LSTM time step and the hidden layer neuron number, when the error rate of the supplementing data and the actual operation and maintenance record exceeds the preset value, parameter iteration update is triggered, and the updated model is used for supplementing the missing value of the subsequent DW layer.
7. The Hadoop-based data warehouse construction method of claim 1, wherein priority allocation is performed on each step of task through a YARN resource scheduler of the Hadoop, wherein the priority of an ODS layer core signal extraction task is highest, the priority of a DW layer equipment health degree calculation task is lowest, the priority of a DM layer time sequence data cube archiving and storing task is lowest, and the priority of a high priority task is guaranteed to obtain computing resources preferentially so as to guarantee timeliness of data processing.
8. A Hadoop-based data warehouse construction system, comprising: The three-layer framework building module is used for building a three-layer data framework based on Hadoop ecology, wherein the three-layer data framework sequentially comprises an operation data layer ODS, a data warehouse layer DW and a data mart layer DM, and provides a basic framework support for subsequent data processing and storage; The ODS layer dynamic data extraction module is used for dividing the monitoring quantity into a core signal and a non-core signal through a preset equipment mechanism model in the ODS layer based on the three-layer architecture, performing real-time stream extraction on the core signal based on a Flink-CDC component and synchronizing the core signal to an ODS layer Hive table and an HBase memory area, performing adaptive period extraction on the non-core signal based on HadoopYARN resource load feedback, and dynamically adjusting the extraction period according to the YARN cluster CPU utilization rate to form standardized original data of the ODS layer; The DW layer health degree driving modeling module is used for constructing a real-time calculation module through HiveUDF and SparkMLlib on the basis of monitoring amount data extracted by the ODS layer, calculating the equipment health degree by combining vibration spectrum characteristics, temperature trend and historical fault weight, and respectively adopting a coarse-granularity dimension model, fine-granularity dimension expansion or cross-equipment dimension association according to the health degree in a normal, early-warning or fault risk state to finish analysis type data modeling of the DW layer; The DM layer time sequence data layering storage module is used for dividing a time sequence data cube into a real-time query layer, a history analysis layer and an archiving layer according to storage layer levels in the DM layer based on dimension model data constructed by the DW layer, so that differential storage and rapid query support of a data mart layer are realized; And the full-link data quality self-healing module is used for carrying out first-round abnormal value correction on original data of an ODS layer through a preset equipment physical constraint mechanism rule engine, then carrying out complementation on the missing value through an LSTM time sequence prediction model trained by SparkML, marking a predicted value label on a DW layer, and simultaneously combining actual operation and maintenance records to compare the repaired data, dynamically optimizing LSTM model parameters to improve the complementation accuracy and form closed-loop optimization of data quality covering the ODS layer to the DM layer.
9. A computer device, comprising A memory having computer readable instructions stored therein; A processor which when executing the computer readable instructions implements the steps of the Hadoop-based data warehouse construction method as claimed in any one of claims 1 to 7.
10. A computer readable storage medium, characterized in that it has stored thereon computer readable instructions, which when executed by a processor, implement the steps of the Hadoop-based data warehouse construction method as claimed in any one of claims 1 to 7.

Description

Hadoop-based data warehouse construction method and system Technical Field The invention relates to the field of data warehouse, in particular to a data warehouse construction method and system based on Hadoop. Background In industrial digital transformation, massive monitoring data generated by equipment such as hydropower sets, wind power sets and the like are distributed in multiple systems, and mining values are integrated through a data warehouse. The data warehouse is a distributed system oriented to analysis decision, can integrate multi-source data in a standardized way, store the multi-source data in a structured way and support multi-dimensional analysis, and provides support for operation and maintenance optimization and safety early warning. Because industrial data often reach TB/PB level and real-time processing and long-term storage are required to be considered, constructing a data warehouse based on Hadoop ecology (HDFS, spark/Flink, hive/HBase) becomes the mainstream, and can effectively meet the storage and processing requirements of mass monitoring data. The existing industrial monitoring data warehouse construction method based on Hadoop has two key pain points, namely on one hand, the real-time deficiency of a core signal and the wasting of non-core signal resources are obviously contradicted. The core signals (such as bearing vibration) need to be collected in real time to ensure early warning aging, the non-core signals (such as ambient humidity) do not need high-frequency processing, but the existing method mostly adopts a unified extraction strategy (such as fixed period full extraction), so that the delay of the core signals affects early warning, the high-frequency extraction of the non-core signals wastes YARN cluster resources, and on the other hand, a data model cannot adapt to the state change of equipment, and the efficiency and dimensional integrity are unbalanced. In the existing method, static dimension modeling (such as fixed hour time dimension) is mostly adopted, the fine granularity dimension of equipment increases cost during normal state, and the coarse granularity dimension lacks key analysis dimension (such as load fluctuation amplitude) during early warning/fault state, so that operation and maintenance decision accuracy is affected. Disclosure of Invention The invention provides a data warehouse construction method and a system based on Hadoop, which aim to solve the contradiction between insufficient real-time performance of a core monitoring signal and non-core signal resource waste, and simultaneously adapt a data model to the state change of equipment so as to balance analysis efficiency and dimensional integrity. The technical scheme of the invention is as follows: A data warehouse construction method based on Hadoop comprises the following steps: Constructing a three-layer data architecture based on Hadoop ecology, wherein the three-layer data architecture sequentially comprises an operation data layer ODS, a data warehouse layer DW and a data mart layer DM, and provides infrastructure support for subsequent data processing and storage; Based on the three-layer architecture, dividing the monitoring quantity into a core signal and a non-core signal at the ODS layer through a preset equipment mechanism model, executing real-time stream extraction on the core signal based on a Flink-CDC component and synchronizing the core signal to a Hive table and an HBase memory area of the ODS layer, executing self-adaptive period extraction on the non-core signal based on HadoopYARN resource load feedback, and dynamically adjusting the extraction period according to the utilization rate of a YARN cluster CPU to form standardized original data of the ODS layer; Based on the monitoring amount data extracted by the ODS layer, a real-time calculation module is built on the DW layer through HiveUDF and SparkMLlib, the equipment health degree is calculated by combining vibration spectrum characteristics, temperature trend and historical fault weight, and according to the health degree in a normal, early warning or fault risk state, a coarse-grain dimension model, fine-grain dimension expansion or cross-equipment dimension association is adopted respectively, so that analysis type data modeling of the DW layer is completed; Based on dimension model data constructed by the DW layer, the time sequence data cube is divided into a real-time query layer, a history analysis layer and an archiving layer according to storage layer in the DM layer, so that differential storage and rapid query support of a data mart layer are realized; And performing first-round outlier correction on the original data of the ODS layer through a preset equipment physical constraint mechanism rule engine, performing complementation on the missing value through an LSTM time sequence prediction model trained by SparkML, marking a predicted value label on the DW layer, and meanwhile, comparing the repaired dat