CN-121996211-A - Data center user information dynamic management system based on distributed architecture

CN121996211ACN 121996211 ACN121996211 ACN 121996211ACN-121996211-A

Abstract

The invention discloses a data center user information dynamic management system based on a distributed architecture, which comprises a distributed architecture layer, a core function module layer and a data interaction layer, wherein the distributed architecture layer provides distributed computing, storage and communication support for the system. The method solves the problems of low processing efficiency and response lag under the rapid increase of equipment and data volume by the parallel processing task and the load balancing dynamic allocation request of the distributed computing node cluster, avoids single-point fault risks of centralized data storage by means of distributed storage and fragmentation storage and redundancy backup, ensures stable system operation, realizes real-time synchronization of cross-regional data by means of a distributed message queue, supports unified management and control of a multi-regional data center, and simultaneously realizes traceability of the whole life cycle of an asset by a block chain, improves fault early warning accuracy by an LSTM model, optimizes energy consumption by multidimensional accounting and algorithm, and comprehensively meets the management requirements of high efficiency, safety and energy conservation of a modern data center.

Inventors

WENG ZHIMING
Luo Jiazhao
CHEN CAN
CHEN DENGBIAO
XU YONGPING

Assignees

福建省数字福建云计算运营有限公司

Dates

Publication Date: 20260508
Application Date: 20251210

Claims (8)

1. The data center user information dynamic management system based on the distributed architecture is characterized by comprising a distributed architecture layer, a core function module layer and a data interaction layer; the distributed architecture layer provides distributed computing, storage and communication support for the system and comprises distributed computing nodes, distributed storage nodes, load balancing nodes and data synchronization nodes; The core function module layer realizes the full life cycle management function of the data center based on the distributed architecture layer and comprises an asset management and life cycle hastening module, a multi-dimensional monitoring module, an intelligent alarm module, an energy consumption management and optimization module, an automatic operation and maintenance and work order system module and a security and authority management module; The data interaction layer is responsible for bidirectional data transmission between modules and between the system and external equipment and between the system and tools.
2. The data center user information dynamic management system based on the distributed architecture according to claim 1 is characterized in that the distributed computing nodes adopt a cluster deployment mode to split computing tasks of all functional modules into a plurality of nodes for parallel processing, the distributed storage nodes adopt a slicing storage mechanism to split and store equipment asset data, monitoring data and operation logs according to preset rules and guarantee data safety through redundancy backup, load balancing nodes monitor load states of all the computing nodes and storage nodes in real time and dynamically distribute task requests, and data synchronization nodes realize real-time synchronization of data across nodes and regions through distributed message queues.
3. The dynamic management system for user information in a data center based on a distributed architecture according to claim 1, wherein the asset management and life cycle tracking module collects data of model numbers, serial numbers, purchasing information and maintenance records of a cabinet, a server and a switch device in a distributed crawler and interface docking mode, establishes a distributed asset ledger, and generates a non-tamperable life cycle tracking chain based on the whole-flow operation from purchasing, warehousing, putting on shelf, running and maintaining to scrapping of a blockchain technology recording device, so as to support cross-node inquiry of asset change history; The multi-dimensional monitoring module is used for deploying monitoring agents at distributed computing nodes, collecting the whole temperature and humidity of a machine room, the CPU utilization rate, the memory occupancy rate, the hard disk read-write speed and the network flow index data of each cabinet and single equipment in real time, preprocessing the collected original data through the distributed computing nodes, screening effective data and transmitting the effective data to a core storage node, and reducing the occupation of data transmission bandwidth, wherein the specific logic steps are as follows: S101, matching an adaptation agent from a system 'equipment-agent mapping library' according to equipment types, automatically distributing the adaptation agent to a target node through Ansible, reading distributed configuration center parameters by the agent, determining acquisition rules and abnormal thresholds, simultaneously sending heartbeat packets to an edge node by the agent every 10 seconds, continuously not receiving 'acknowledgement frames' 3 times, triggering 'agent offline alarming', and recording node IDs; s102, acquiring temperature data and hardware indexes, and packaging the acquired data into JSON; S103, continuously calculating the same index data for 10 times for single equipment, and firstly calculating the mean value and the standard deviation: If the data satisfies |x k - μ| >3σ, then replace with the first 3 valid data means: Noise was then removed with a moving average of window size n=5, the filtered data were: If y k is in the normal range and the difference delta y from the last valid data is less than the minimum change threshold, determining redundant data and discarding, if y k exceeds the threshold or the change rate Judging that the data is valid data; s104, compressing effective data by Zstandard algorithm, and calculating compression rate according to formula: S org is the original size, S com is the compressed size, and the actual compression rate can reach 75%; S105, alarm type high-priority data are transmitted by WebSocket, normal data are transmitted in batches by HTTP/2, distributed computing nodes acquire storage node loads through load balancing nodes, and node weights are calculated according to a formula Transmitting the data to the node with the largest weight, wherein R i is the storage utilization rate, and T i is the writing response time; S106, the distributed computing node calculates MD5 value of the compressed data and encapsulates the MD5 value and the data into a transmission packet, the distributed storage node calculates MD5 after receiving, decompresses if the MD5 value is consistent, retransmits if the MD5 value is inconsistent, and the distributed storage node stores the data in a slicing way according to the equipment type-acquisition date-index type, establishes a distributed index containing the equipment ID, the time stamp and the storage position and supports quick query; And S107, automatically restarting the agent remotely through the SSH after detecting that the agent is offline, generating an agent repair work order when the agent is restarted, assigning the agent repair work order to operation and maintenance personnel, temporarily storing the effective data into a local SSD when the distributed computing node is disconnected from the distributed storage node, carrying out batch transmission supplementing in ascending order according to time stamps after connection recovery, and deleting the local cache after the transmission supplementing is completed.
4. The dynamic management system of user information in a data center based on a distributed architecture according to claim 1, wherein the intelligent alarm module analyzes monitoring data based on a distributed machine learning model, trains a prediction model through historical fault data, realizes early warning of equipment faults, establishes a multi-dimensional alarm association rule base, automatically promotes alarm level when a plurality of related indexes of temperature rise and fan rotation speed fall trigger alarms at the same time, pushes alarm information to corresponding responsible person terminals in real time through a distributed message pushing mechanism, records the whole flow of alarm confirmation, assignment, processing and closing, and forms closed-loop management, and the specific logic steps are as follows: S201, collecting equipment monitoring data and fault records within 3 years, removing abnormal values by using a3 sigma principle, removing data of |x-mu| >3 sigma, standardizing effective data to a [0,1] interval according to a formula, S202, dividing standardized data into training sets/test sets according to a ratio of 7:3 by using SparkMLlib frames, constructing an LSTM network, and enabling fault risk probability output by a model to be more accurate by back propagation optimization weight W and bias b in the training process, wherein a core prediction formula is as follows: wherein H t is the LSTM hidden layer output, P is the probability of failure risk of 1 hour in the future, and the model is verified by a testing set, if the accuracy rate is high The model is available; S203, receiving preprocessing data of a multi-dimensional monitoring module through a Kafka message queue, classifying and storing the preprocessing data to a Redis cache according to equipment ID, extracting time sequence data of the last 5 minutes in the Redis for each equipment, inputting a trained LSTM model, calculating fault risk probability P according to a formula, if P is more than or equal to 80%, triggering a 'primary predictive alarm', if 50% < P is less than 80%, triggering a 'tertiary early warning prompt', and if P is less than or equal to 50%, not triggering an alarm; S204, comparing the real-time index with a preset threshold, triggering a second-level Shan Weidu alarm if the single index exceeds the threshold, and automatically inquiring an association rule base and matching the association rule when the system receives a plurality of single-dimension alarms at the same time; And S205, packaging the alarm information into a JSON format, and selecting a pushing channel according to equipment attribution and alarm level, wherein the primary alarm is pushed to an operation and maintenance manager and engineers through a WebSocket, and simultaneously short messages are sent, the tertiary early warning is pushed to the corresponding engineers only through enterprise WeChat, and the pushing content comprises the packaged alarm JSON data, so that the real-time receiving of responsible people is ensured.
5. The dynamic management system for user information of data center based on distributed architecture according to claim 1, wherein the energy consumption management and optimization module collects total energy consumption, energy consumption of IT equipment and energy consumption data of refrigerating system of data center in real time through distributed metering nodes, calculates PUE value in real time by combining parallel computing capability of distributed computing nodes, generates day, week and month energy consumption trend chart through historical energy consumption data stored in distributed mode based on time sequence analysis algorithm, supports self-defined energy consumption statistical dimension, calculates energy cost according to machine room, cabinet and enterprise dimension, and gives energy saving suggestion through distributed optimization algorithm, and the specific logic steps are as follows: S301, arranging intelligent metering nodes beside a total power distribution room, an IT equipment cabinet and a refrigerating system of a data center, collecting energy consumption data in real time, wherein the energy consumption data comprises total energy consumption E total , IT equipment energy consumption E IT and refrigerating system energy consumption E cool , the collection frequency is set to be1 minute/time, the metering nodes transmit original data to distributed computing nodes through LoRaWAN protocol, the distributed computing nodes verify the collected data, reject abnormal values, and complement the average value of the last 5 times of effective data for missing data, and the used formula is as follows: Wherein E fill is the energy consumption value after completion, and E t-1 -E t-5 is the effective data of the first 5 times; s302, packaging the verified energy consumption data according to a label of acquisition time, equipment type and position, and storing the packaged energy consumption data into a distributed database for subsequent calculation and calling; S303, calculating the PUE in real time by the distributed calculation node according to a formula: E total is total energy consumption of the data center, E IT is energy consumption of IT equipment, the PUE is closer to 1, the energy efficiency is higher, and the computing node synchronizes the PUE value of every 5 minutes to the distributed cache; S304, calling the historical energy consumption data stored in a distributed mode, analyzing the trend by adopting an ARIMA time sequence algorithm, generating a daily/weekly/monthly energy consumption trend line graph by using ECharts, and intuitively displaying the energy consumption peak period; S305, calculating a mean value mu PUE and a standard deviation sigma PUE of the PUE of nearly 7 days, and if the PUE of the current day meets the PUE of > mu PUE +1.5σ PUE , judging that the PUE is abnormally increased, and triggering 'energy efficiency early warning'; s306, calculating energy consumption of each dimension, including machine room energy consumption and enterprise energy consumption, wherein a calculation formula is as follows: Wherein M is the number of cabinets in the machine room, Energy consumption for a single cabinet; Enterprise energy consumption: where N is the number of servers leased by the enterprise, The energy consumption of a single server is realized; S307, recording the unit price of the electric charge in different time periods, and calculating the cost C=E×P of each dimension according to a formula, wherein C is the energy cost, E is the statistical dimension energy consumption, and P is the unit price of the electric charge in the corresponding time period; And S308, optimizing the operation strategy and the refrigerating system parameters of IT equipment by adopting a genetic algorithm with the aim of 'PUE minimization' and 'cost minimization', and calculating the energy saving potential of each optimization scheme, wherein the used formula is delta E=E current -E optimize , E current is the current energy consumption, E optimize is the predicted energy consumption after optimization, delta E is an energy saving potential value, and outputting a suggestion according to the optimization result, synchronously pushing to an operation and maintenance terminal, recording the energy consumption change after the suggestion is executed, and verifying the energy saving effect in a closed loop.
6. The data center user information dynamic management system based on the distributed architecture according to claim 1 is characterized in that the automatic operation and maintenance and work order system module integrates a distributed automatic tool cluster, the automatic operation of installation, application deployment and configuration issuing of an operating system of a new on-shelf server is achieved through distributed deployment of Ansible, puppet tools, simple and predictable faults are automatically repaired through edge computing nodes based on preset fault processing rules, an operation and maintenance work order is automatically generated through the edge computing nodes, and a work order system supports fault repair, equipment on-and-off-shelf application, creation, approval, assignment and execution tracking of asset change by adopting a distributed circulation mechanism, and work order states are synchronized to relevant nodes in real time.
7. The data center user information dynamic management system based on the distributed architecture according to claim 1 is characterized in that the security and authority management module adopts a role-based distributed authority control model to subdivide multiple roles of an administrator, an operation and maintenance person, a visitor and an auditor, configures different cross-node access authorities for each role, records login, configuration modification and data query operations of all users through a distributed log collection system, stores and encrypts log data in a slicing mode, supports an integrated distributed video monitoring system, realizes cross-node viewing of real-time pictures of a machine room, and enhances physical security management and control.
8. The dynamic management system of user information in data center based on distributed architecture according to claim 1, wherein the data interaction layer adopts RESTfulAPI, webSocket and distributed message queues combined communication mode to support bidirectional data interaction between system and data center equipment, third party operation and maintenance tool and user terminal, and adopts encryption transmission protocol to encrypt data transmission process, ensure data transmission safety, support protocol self-adapting conversion, and be compatible with communication protocols of different equipment and tools.

Description

Data center user information dynamic management system based on distributed architecture Technical Field The invention relates to the technical field of dynamic management of user information of a data center, in particular to a dynamic management system of user information of the data center based on a distributed architecture. Background With the continuous expansion of the scale of the data center, the number of devices is increased, the layout of the cross-regional data center becomes a trend, the traditional centralized data center management system faces a plurality of challenges, and the method comprises the following steps of 1, under a centralized architecture, centralized deployment of computing and storage resources, greatly reducing the system processing efficiency when the number of devices and the data volume reach a certain scale, delaying the response speed, 2, enabling the data centralized storage to have single-point failure risk, and enabling the whole system to be paralyzed once the core storage device fails, 3, delaying the data synchronization of the cross-regional data center, and not realizing unified management, 4, limiting the asset traceability of the traditional system, ensuring insufficient failure early warning accuracy and low energy consumption management refinement degree, and being difficult to meet the high-efficiency, safe and energy-saving management requirements of the modern data center. Disclosure of Invention Based on the technical problems in the background technology, the invention provides a data center user information dynamic management system based on a distributed architecture. The invention provides a data center user information dynamic management system based on a distributed architecture, which comprises a distributed architecture layer, a core function module layer and a data interaction layer; the distributed architecture layer provides distributed computing, storage and communication support for the system and comprises distributed computing nodes, distributed storage nodes, load balancing nodes and data synchronization nodes; The core function module layer realizes the full life cycle management function of the data center based on the distributed architecture layer and comprises an asset management and life cycle hastening module, a multi-dimensional monitoring module, an intelligent alarm module, an energy consumption management and optimization module, an automatic operation and maintenance and work order system module and a security and authority management module; The data interaction layer is responsible for bidirectional data transmission between modules and between the system and external equipment and between the system and tools. The distributed computing nodes adopt a cluster deployment mode to split computing tasks of all functional modules into a plurality of nodes for parallel processing, the distributed storage nodes adopt a fragmentation storage mechanism to split and store equipment asset data, monitoring data and operation logs according to preset rules, the data security is guaranteed through redundancy backup, load balancing nodes monitor the load states of all the computing nodes and the storage nodes in real time and dynamically distribute task requests, and data synchronization nodes realize real-time synchronization of data across nodes and regions through distributed message queues. Preferably, the asset management and life cycle tracking module collects data of model numbers, serial numbers, purchasing information and maintenance records of the equipment cabinet, the server and the switch equipment in a distributed crawler and interface docking mode, establishes a distributed asset ledger, and generates a non-tamperable life cycle tracking chain based on the whole-flow operation from purchasing, warehousing, putting on shelf, running and maintaining to scrapping of the blockchain technology recording equipment, so as to support the cross-node inquiry of the asset change history; The multi-dimensional monitoring module is used for deploying monitoring agents at distributed computing nodes, collecting the whole temperature and humidity of a machine room, the CPU utilization rate, the memory occupancy rate, the hard disk read-write speed and the network flow index data of each cabinet and single equipment in real time, preprocessing the collected original data through the distributed computing nodes, screening effective data and transmitting the effective data to a core storage node, and reducing the occupation of data transmission bandwidth, wherein the specific logic steps are as follows: S101, matching an adaptation agent from a system 'equipment-agent mapping library' according to equipment types, automatically distributing the adaptation agent to a target node through Ansible, reading distributed configuration center parameters by the agent, determining acquisition rules and abnormal thresholds, simultaneously sending heartb