CN-121418413-B - Cluster service node self-adaptive deployment and monitoring method, device and storage medium

CN121418413BCN 121418413 BCN121418413 BCN 121418413BCN-121418413-B

Abstract

The invention relates to a cluster service node self-adaptive deployment and monitoring method which comprises the following steps of uploading cluster service node information to a management server, connecting the management server with each cluster service node, enabling the management server to realize batch installation of acquisition clients on the cluster service nodes, requesting the management server to issue corresponding acquisition scripts by the cluster service nodes, uploading acquired data to the management server by the acquisition clients, and analyzing the acquired data by the management server to generate alarm information. The invention realizes the whole-flow closed-loop management of cluster service node monitoring from alarm triggering, operation and maintenance intervention to state recovery, effectively ensures the long-term stable operation and maintenance efficiency of the platform, and solves the problem of operation and maintenance closed-loop management deficiency.

Inventors

LV SHANSHAN
LI JINLI
ZHANG FUHONG

Assignees

麒麟软件有限公司

Dates

Publication Date: 20260505
Application Date: 20251226

Claims (8)

1. The cluster service node self-adaptive deployment and monitoring method is characterized by comprising the following steps: step S1, uploading cluster service node information to be deployed in batches to a management server; S2, the management server reads the IP address, the user name and the password from the information of each cluster service node and is connected with each cluster service node in a remote login mode; Step S3, the management server issues an installation script of the acquisition client to each cluster service node to realize batch installation of the acquisition clients on the cluster service nodes; step S4, collecting service type and IP address parameter information of the node where the client reads on each cluster service node, and requesting the management server to issue a corresponding collection script according to service type matching conditions, wherein the method comprises the following steps: Step S41, uploading the collection script files of all service types to a management server, and storing the collection script information into a database, wherein the implementation mode is as follows: 1.1, writing different acquisition scripts according to service types and defining acquisition script information, wherein the acquisition script information comprises a type name, a script path, an acquisition time interval and a checksum, and the acquisition script information is encrypted and packaged by AES to form an acquisition script file; 1.2, uploading the acquisition script file to a management server, and decrypting and decapsulating through AES; 1.3 the management server stores the acquisition script under the http:// management server ip/monitor/scripts of the web sharing directory; 1.4, the management server stores the corresponding acquisition script information into a database, and the node acquires script downloading path content by reading the acquisition script information to download the script; Step S42, each node acquires a service type which is read by a client and stored locally, and initiates a script checking request to a management server respectively, and the service type and IP address parameter information are transmitted; step S43, the management server receives the request parameter information, queries a database, queries script names, script paths, acquisition time intervals and checksum acquisition script information according to service types, and feeds back the acquired script information to the corresponding nodes; step S44, the node receives and stores the feedback information, compares the feedback information with the acquired script information of the node, judges whether the same script exists locally or not, if not, executes the step S45, otherwise, executes the step S5; Step S45, an acquisition client on a node initiates an acquisition script downloading request to a management server, and transmits the name of a required script, a script path, an acquisition time interval and acquisition script information of a checksum; step S46, the management server receives the request and returns the acquisition script to the corresponding node according to the acquisition script information; Step S5, the acquisition client periodically executes an acquisition script, uploads acquisition data to the management server and stores the acquisition data in a database; The acquisition script execution logic comprises: s51, constructing an acquired data empty dictionary; S52, monitoring system layer public information, collecting and calculating current CPU occupancy rate, memory occupancy rate and disk occupancy rate data of the system, and storing the data into a dictionary; Step S53, acquiring the Nginx service state of the node, returning active in operation, otherwise returning False, and storing the returned active into a dictionary; Step S54, checking whether NFS mounting points/opt/kylin-server/html catalogues exist or not, if yes, returning True, otherwise, returning False, and storing into a dictionary; step S55, acquiring a service state of the cluster service node KEEPALIVED, returning active in operation, otherwise returning False, and storing in a dictionary; step S56, extracting a virtual IP address from the/etc/keepalive. Conf configuration file and storing the virtual IP address in a dictionary; Step S57, inquiring a/32 bit IP address on a default network card through an IP command, further acquiring a virtual VIP bound by a current system, and storing the virtual VIP in a dictionary; step S58, acquiring the connection quantity of specified states close-wait, time-wait and established of Nginx on 443 port through get_ss_count function and storing the connection quantity into a dictionary; step S59, obtaining the number of Nginx error logs in each minute through a get_ errorlog _per_minute function and storing the number into a dictionary; Step S510, collecting all index collection data and returning a result in a dictionary form, and if the execution of a certain command fails, the corresponding collection value is None; And S6, the management server starts a monitoring service, analyzes the acquired data and generates alarm information.
2. The method for adaptively deploying and monitoring cluster service nodes according to claim 1, wherein in the step S1, the cluster service node information includes service name, service type, cluster name, IP address, user name, password, detection connection address, and port.
3. The method for adaptively deploying and monitoring cluster service nodes according to claim 1, wherein in step S3, the management server implements batch installation of the collection clients on the cluster service nodes by performing the following operations on each cluster service node: S31, pulling and collecting client installation scripts and corresponding data information from a shared directory of a management server, and storing the client installation scripts and the corresponding data information under a specific directory of a cluster service node; And S32, executing an acquisition client installation script on the cluster service node, designating a management server IP through a parameter-w, -designating the cluster service node IP, -designating shared directory information by an o.
4. The method for adaptively deploying and monitoring a cluster service node as in claim 3, wherein the client installation script implementing step comprises: Step S321, initializing information, reading input parameter values to obtain cluster service node IP, management server IP and shared directory information; Step S322, downloading and decompressing an installation package of the acquisition client from the shared directory of the management server; step S323, creating a local yum source; step S324, using yum to install the acquisition client, if the installation is successful, executing step S325, if the installation is failed, executing step S327; step S325, modifying the acquisition client configuration file agent; step S326, starting acquisition client service ismp-guardian-agent. Step S327, record the installation log to/var/log/guardian _register. Log.
5. The method for adaptively deploying and monitoring cluster service nodes according to claim 1, wherein the step S6 comprises the following steps: step S61, uploading the alarm rule to a management server; Step S62, deploying and starting management server monitoring service ismp-guardian; step S63, the management server monitors a service starting scheduler; step S64, triggering a scheduling period once by the scheduler, monitoring a service triggering groovy rule engine by the management server, sequentially reading alarm rules from the database, and sequentially judging whether the state of the alarm rules is started or not, if the state is started, executing step S65, and if the state is not started, not triggering rule verification, and reading the next alarm rule; Step S65, the management server monitors the service and reads the corresponding acquired data in turn according to the service type corresponding to the alarm rule; and step S66, the management server monitors the service analysis collected data, judges whether the alarm check rule is matched, and triggers an alarm if the alarm check rule is matched.
6. The method for adaptive deployment and monitoring of cluster service nodes according to claim 5, wherein in the step S61, the alarm rule comprises a load balancing connection availability exception rule, a vip and ip binding state persistent exception rule, a keep service exception rule, a nginx_time_wait_count value exception rule, a errorlog _per_minute value exception rule, and a nginx_close_wait_count value exception rule.
7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1-6 when the computer program is executed.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1-6.

Description

Cluster service node self-adaptive deployment and monitoring method, device and storage medium Technical Field The invention relates to the technical field of cluster monitoring, in particular to a cluster service node self-adaptive deployment and monitoring method, equipment and a storage medium. Background The Galaxy kylin server cluster is a large-scale cluster loaded with the Galaxy kylin advanced server operating system, and has the advantages of full-platform autonomous controllability, endophytic intrinsic safety system, high availability and the like. In order to improve the efficiency of unified nano-tubes, upgrading and operation and maintenance of large-scale cluster nodes, the Galaxy kylin server cluster is based on a Galaxy kylin advanced server operating system, adopts a distributed architecture, deploys an upgrade management platform of the Galaxy kylin system and is used for supporting key functions of platform core business logic, basic service capability, data storage, message transmission and the like. However, the large-scale cluster node system is complicated in deployment, complicated in topological connection, and when faults or potential faults occur, the fault point cannot be monitored in the first time, so that the problems of time and labor consumption, incomplete monitoring, inefficiency and the like exist. Therefore, in order to comprehensively improve the monitoring efficiency of the large-scale cluster service node, the continuous and stable operation of tens of thousands of hosts is strongly ensured, and development of a self-adaptive deployment and monitoring device suitable for the cluster service node of the Galaxy server is needed, and the abnormal timely warning and auxiliary rapid positioning are realized through deployment on an upgrade management platform of the Galaxy system. At present, in the aspect of server cluster based on an operating system, a mature self-adaptive deployment and monitoring scheme does not exist yet, and the monitoring stage with weaker automation capability is still realized. The related implementation scheme is realized by deploying monitoring agents on part of nodes in the server cluster and monitoring the basic states of a limited number of nodes one by one. The method comprises the steps that operation and maintenance personnel install the same monitoring agents on server cluster nodes one by one to carry out complex configuration, the monitoring agents execute the same monitoring script to collect node operation data according to preset time intervals, and all the nodes adopt unified monitoring indexes and collection frequencies. Existing monitoring techniques suffer from the following disadvantages: 1. the deployment configuration efficiency is low, the monitoring agent needs to be manually installed and configured by each node, the deployment period is long in a large-scale cluster environment, the configuration consistency is difficult to ensure, and the change requirement of the cluster scale cannot be responded quickly. 2. The monitoring strategy has poor pertinence, namely, the adoption of the issued unified script for monitoring cannot implement differentiated monitoring according to the characteristics of different node types such as a database, a message queue, a cache, a front end, load balancing and the like, and the deep monitoring requirement of various services is difficult to meet. 3. And the operation and maintenance state tracking is missing, namely the system only provides an abnormal alarm function, lacks a tracking mechanism for an operation and maintenance processing process, cannot automatically verify the effectiveness of a repairing measure, and is difficult to form a complete operation and maintenance management closed loop. Disclosure of Invention In order to solve the defects existing in the prior art, the invention provides a cluster service node self-adaptive deployment and monitoring method, which comprises the following steps: step S1, uploading cluster service node information to be deployed in batches to a management server; S2, the management server reads the IP address, the user name and the password from the information of each cluster service node and is connected with each cluster service node in a remote login mode; Step S3, the management server issues an installation script of the acquisition client to each cluster service node to realize batch installation of the acquisition clients on the cluster service nodes; s4, collecting service type and IP address parameter information of the node where the client reads on each cluster service node, and requesting a management server to issue a corresponding collection script according to service type matching conditions; Step S5, the acquisition client periodically executes an acquisition script, uploads acquisition data to the management server and stores the acquisition data in a database; And S6, the management server starts a monitoring service, analyz