Search

CN-122001739-A - Cluster monitoring system

CN122001739ACN 122001739 ACN122001739 ACN 122001739ACN-122001739-A

Abstract

The invention provides a cluster monitoring system, which is arranged in a monitored cluster, wherein the monitored cluster comprises a plurality of sub-monitoring elements, the monitoring system comprises a plurality of sub-monitoring units, each sub-monitoring unit comprises a first monitoring sub-unit and a second monitoring sub-unit, the sub-monitoring units are in one-to-one correspondence with the sub-monitoring elements in the monitored cluster, the first monitoring sub-unit is used for carrying out fault monitoring on the corresponding sub-monitoring element, the second monitoring sub-unit is used for monitoring other sub-monitoring elements except the corresponding sub-monitoring element, and the sub-monitoring elements corresponding to the second monitoring sub-unit are different. Each sub-monitoring element can be guaranteed to be monitored by the first monitoring subunit and the second monitoring subunit, the monitored first monitoring subunit and the monitored second monitoring subunit are not affiliated to the same sub-monitoring unit, the situation that the monitored cluster cannot be monitored when the first monitoring subunit is damaged due to accident can be reduced, and the safety of the monitored cluster is improved.

Inventors

  • CUI MENG
  • LIU HAILONG
  • ZHANG WENLING
  • Shen Cunjing
  • SONG QIONG

Assignees

  • 网联清算有限公司

Dates

Publication Date
20260508
Application Date
20241101

Claims (10)

  1. 1. A cluster monitoring system, wherein the system is installed in a monitored cluster, the monitored cluster comprising a plurality of sub-monitoring elements, the system comprising: each sub-monitoring unit comprises a first monitoring sub-unit and a second monitoring sub-unit; The sub-monitoring units are in one-to-one correspondence with sub-monitoring elements in the monitored cluster, wherein the first monitoring subunit is used for performing fault monitoring on the corresponding sub-monitoring elements, and the second monitoring subunit is used for monitoring one other sub-monitoring element except the corresponding sub-monitoring elements; and the sub-monitoring elements corresponding to the second monitoring sub-units are different from each other.
  2. 2. The system of claim 1, wherein the system further comprises a controller configured to control the controller, The first monitoring subunit and the second monitoring subunit are further configured to determine monitored monitoring data, and determine whether the monitoring data is failure data.
  3. 3. The system of claim 2, wherein the system further comprises: And each alarm module is respectively corresponding to the first monitoring subunit or the second monitoring subunit and is used for generating fault alarm information under the condition that the monitoring data are judged to be fault data.
  4. 4. A system according to claim 3, wherein the system further comprises: and the notification module is used for acquiring the fault alarm information, determining a subscription object corresponding to the fault alarm information and pushing the fault alarm information to the subscription object.
  5. 5. The system of claim 4, wherein the notification module is further configured to: And in response to the simultaneous reception of the fault alarm information sent by the first monitoring subunit and the second monitoring subunit, the two fault alarm information belong to the same sub-monitoring element, and the two fault alarm information are combined into one fault alarm information to be pushed.
  6. 6. A system according to claim 3, wherein the system further comprises: The plurality of sub-database modules are respectively corresponding to the alarm modules and are used for receiving fault alarm information generated by the corresponding alarm modules, and data sharing is carried out among all the sub-database modules; The sub-database module is further configured to determine whether the received fault alarm information is a repeated fault alarm information, and delete the received fault alarm information in response to determining that the received fault alarm information is the repeated fault alarm information.
  7. 7. The system of any one of claims 1-6, wherein the system further comprises: and the display module is used for displaying the running state and the fault alarm information of the monitoring cluster.
  8. 8. The system of any one of claims 1-6, wherein the system further comprises: And the automatic treatment module is used for determining a treatment scheme based on the fault data and carrying out treatment on the corresponding sub-monitoring elements based on the treatment scheme when the monitoring data are judged to be the fault data.
  9. 9. The system of claim 8, wherein the treatment regimen comprises one or more of automated quarantining, automated resumption.
  10. 10. The system of claim 8, wherein the automated treatment module is further to: And generating a maintenance work order based on the treatment scheme, and sending the maintenance work order to a maintenance end for processing.

Description

Cluster monitoring system Technical Field The disclosure relates to the technical field of security management, and in particular relates to a cluster monitoring system. Background The cluster is composed of a plurality of mutually independent servers, each server is called a node of the cluster, and the servers cooperatively provide services for users by utilizing network communication. The machine fault refers to a server fault, and common faults include a disk fault, a memory fault, a network card fault, a CPU fault, a battery fault and the like, after the machine fault, cluster services deployed on the machine cannot normally operate, and the stability of the services is directly affected, so that the fault can be timely identified, responded and repaired. The machine fault is usually found by manual inspection or automatically alarming after a monitoring system finds that a certain index of the machine is abnormal, a cluster manager receives fault notification or cluster alarming and then manually isolates cluster service, operation and maintenance personnel manually notifying the machine after isolation is completed can stop maintenance, the operation and maintenance personnel manually establish a maintenance approval process, after the process approval is passed, the operation and maintenance personnel manually repairs the machine and then manually notifies the cluster manager, and the cluster manager resumes the cluster service. Disclosure of Invention The present disclosure aims to solve, at least to some extent, one of the technical problems in the related art. To this end, an object of the present disclosure is to propose a cluster monitoring system. A second object of the present disclosure is to propose a cluster monitoring method. A third object of the present disclosure is to provide a cluster monitoring device. A fourth object of the present disclosure is to propose an electronic device. A fifth object of the present disclosure is to propose a non-transitory computer readable storage medium. A sixth object of the present disclosure is to propose a computer programme product. In order to achieve the above purpose, an embodiment of a first aspect of the present disclosure provides a cluster monitoring system, where the system is installed in a monitored cluster, and the monitored cluster includes a plurality of sub-monitoring elements, and the system includes a plurality of sub-monitoring units, each of the sub-monitoring units includes a first monitoring subunit and a second monitoring subunit, the sub-monitoring units are in one-to-one correspondence with the sub-monitoring elements in the monitored cluster, where the first monitoring subunit is configured to perform fault monitoring on a corresponding sub-monitoring element, and the second monitoring subunit is configured to monitor one sub-monitoring element other than the corresponding sub-monitoring element, and the sub-monitoring elements corresponding to the second monitoring subunit are different from each other. According to one embodiment of the disclosure, the first monitoring subunit and the second monitoring subunit are further configured to determine monitored monitoring data, and determine whether the monitoring data is failure data. According to one embodiment of the disclosure, the system further comprises a plurality of alarm modules, wherein each alarm module corresponds to the first monitoring subunit or the second monitoring subunit respectively and is used for generating fault alarm information under the condition that the monitoring data are judged to be fault data. According to one embodiment of the disclosure, the system further comprises a notification module, wherein the notification module is used for acquiring the fault alarm information, determining a subscription object corresponding to the fault alarm information and pushing the fault alarm information to the subscription object. According to one embodiment of the disclosure, the notification module is further configured to, in response to receiving the fault alarm information sent by the first monitoring subunit and the second monitoring subunit at the same time, combine the two fault alarm information into one fault alarm information for pushing, where the two fault alarm information belong to the same sub-monitoring element. According to one embodiment of the disclosure, the system further comprises a plurality of sub-database modules, each sub-database module corresponds to an alarm module respectively and is used for receiving fault alarm information generated by the corresponding alarm module and sharing data among all the sub-database modules, the sub-database modules are further used for judging whether the received fault alarm information is repeated fault alarm information or not, and deleting the received fault alarm information in response to determining that the received fault alarm information is repeated fault alarm information. According to one