US-12619512-B2 - System failure monitoring device and system failure monitoring method

US12619512B2US 12619512 B2US12619512 B2US 12619512B2US-12619512-B2

Abstract

Provided is a system failure monitoring device capable of easily estimating a root cause of a system. A data collection unit acquires pod/calculation node information, which is configuration information regarding a pod that executes a process according to a request, and tracing data regarding the request processed by the pod. A determination unit including a processing time calculation unit and an abnormality degree calculation determination unit determines, for each of the requests, whether the request is an abnormal request relevant to an abnormality of a monitoring target system, based on the tracing data and the pod/calculation node information. A presentation unit including an abnormal request distribution calculation unit and an abnormal request visualization unit generates and presents visualized data indicating an abnormal request distribution obtained by plotting the abnormal request in a request space defined by a coordinate axis related to the requests.

Inventors

Shinya Furukawa
Masaki Kimura
Kazumasa Tobe
Seiji Aguchi

Assignees

HITACHI, LTD.

Dates

Publication Date: 20260505
Application Date: 20240904
Priority Date: 20240319

Claims (6)

1 . A system failure monitoring device that monitors a monitoring target system including a plurality of components that execute processes according to requests, the system failure monitoring device comprising: a collection unit that collects configuration information regarding each of the components and request information regarding each of the requests processed by each of the components; a determination unit that determines, for each of the requests, whether the request is an abnormal request relevant to an abnormality of the monitoring target system, based on the configuration information and the request information; and a presentation unit that generates and presents visualized data indicating an abnormal request distribution obtained by plotting the abnormal request in a request space defined by a coordinate axis related to the requests, wherein the determination unit includes: a processing time calculation unit that calculates, for each of the requests, a processing time required for a process according to the request, based on the configuration information and the request information; and a determination execution unit that, for each of the requests, determines whether the request is the abnormal request, based on a comparison value obtained by comparing the processing time of the request with a processing time of a comparison target request having a predetermined homogeneous relationship with the request, wherein the processing time of the request is a time obtained by subtracting a stand-by time from a response time, the response time being taken from when a component receives the request to when the component responds to the request, and the stand-by time being taken from when the component transmits a subordinate request corresponding to the request to another one of the components to when the component receives a response from the another component, wherein the comparison target request is a request having a same transmission source component, which is one of the components as a transmission source, and a same transmission destination component, which is another one of the components as a transmission destination, as the request, wherein one of the components provides a specific service, wherein the monitoring target system includes a plurality of components that provide a same service as the specific service, and wherein, when a number of requests having the same transmission source component and the same transmission destination component as the request is smaller than a predetermined required number, the determination execution unit substitutes, as the comparison target request, a request having the same service provided by the transmission source component and the same service provided by the transmission destination component as the request.
2 . The system failure monitoring device according to claim 1 , wherein the request information is error log information indicating a history of abnormalities related to the requests acquired in the monitoring target system, and the determination unit determines, for each of the requests, whether the request is an abnormal request based on the error log information.
3 . The system failure monitoring device according to claim 1 , wherein the presentation unit selects a display coordinate axis that is a coordinate axis to be used in the visualized data from a plurality of coordinate axes different from each other, based on a total distance that is a sum of distances between positions of respective abnormal requests in the request space and a position of a centroid between the abnormal requests for each of the plurality of coordinate axes.
4 . The system failure monitoring device according to claim 3 , wherein the coordinate axes are settable by a user.
5 . The system failure monitoring device according to claim 1 , wherein the determination unit calculates, for each of the requests, an abnormality degree obtained by evaluating a degree of relevance of the request to the abnormality of the monitoring target system based on the configuration information and the request information, and determines whether the request is an abnormal request based on the abnormality degree, and the visualized data is a heat map indicating a plot of the abnormal request with visual information according to the abnormality degree of the abnormal request.
6 . A system failure monitoring method performed by a system failure monitoring device that monitors a monitoring target system including a plurality of components that execute processes according to requests, the system failure monitoring method comprising: collecting configuration information regarding each of the components and request information regarding each of the requests processed by each of the components; determining, for each of the requests, whether the request is an abnormal request relevant to an abnormality of the monitoring target system, based on the configuration information and the request information; and generating and presenting visualized data indicating an abnormal request distribution obtained by plotting the abnormal request in a request space defined by a coordinate axis related to the requests, wherein the determination step includes: calculating, for each of the requests, a processing time required for a process according to the request, based on the configuration information and the request information; and determining, for each of the requests, whether the request is the abnormal request, based on a comparison value obtained by comparing the processing time of the request with a processing time of a comparison target request having a predetermined homogeneous relationship with the request, wherein the processing time of the request is a time obtained by subtracting a stand-by time from a response time, the response time being taken from when a component receives the request to when the component responds to the request, and the stand-by time being taken from when the component transmits a subordinate request corresponding to the request to another one of the components to when the component receives a response from the another component, wherein the comparison target request is a request having a same transmission source component, which is one of the components as a transmission source, and a same transmission destination component, which is another one of the components as a transmission destination, as the request, wherein one of the components provides a specific service, wherein the monitoring target system includes a plurality of components that provide a same service as the specific service, and wherein, when a number of requests having the same transmission source component and the same transmission destination component as the request is smaller than a predetermined required number, the determination execution unit substitutes, as the comparison target request, a request having the same service provided by the transmission source component and the same service provided by the transmission destination component as the request.

Description

CROSS-REFERENCE TO RELATED APPLICATION The present application claims priority from Japanese application JP2024-043660, filed on Mar. 19, 2024, the content of which is hereby incorporated by reference into this application. TECHNICAL FIELD The present disclosure relates to a system failure monitoring device and a system failure monitoring method. BACKGROUND ART With the spread of distributed systems such as microservice architectures, problems are increasing in managing operations thereof. For example, when a failure occurs in a distributed system, it is necessary to quickly specify whether the failure is caused on an application or an infrastructure serving as a platform, and an administrator who operates the distributed system needs to shorten the time required for specifying the root cause. In order to cope with the aforementioned problem, it is important to introduce a technology for monitoring an operation of a system and detecting a sign of a problem occurrence at an early stage. Such a technology can significantly shorten the time required for analyzing a root cause and can achieve system stability. In this regard, PTL 1 discloses a technology of calculating a feature from data collected using a monitoring tool, and determining a cause of a failure in a microservice (a causal relationship between a failure of an infrastructure and a failure of an application) based on the feature. In this technology, a relationship between a feature and a teacher label is learned, and a cause of a failure in a microservice is determined using a feature acquired in an actual environment. CITATION LIST Patent Literature PTL 1: JP 2021-144401 A SUMMARY OF INVENTION Technical Problem In order to meet demands for various applications and services, a distributed system is generally designed to execute applications in a distributed manner using virtual processing units. In this case, it is expected that resources can be utilized more effectively, and scalability and flexibility can be improved. However, in such a distributed system, a large number of virtual processing units are involved, and interactions and dependency relationships between different applications executed in a virtual environment are complicated, making it difficult to specify a cause of a failure when the failure has occurred and efficiently analyze the failure. In a case where the technology described in PTL 1 is applied to the above-described distributed system, it is necessary to acquire information indicating a failure of an application and information indicating a failure of a processing node, evaluate a hierarchical relationship and a dependency relationship between applications executed between different processing nodes, and also calculate a similarity between error messages in the different failure information. In addition, these evaluation values are held as features, and training data in which features are associated with acquired teacher labels is created. Using this training data, a failure prediction model that determines whether two pieces of failure information are relevant to each other is generated. In this case, a cause of a failure in the virtualized system can be efficiently analyzed, but there are the following problems. Specifically, since it is necessary to accumulate failure data and train data, when a new application is introduced, it is necessary to collect and accumulate failure data and train data related to the new application. If the data is insufficient, the accuracy and reliability of the failure prediction model will be reduced. In addition, even in a case where an application is updated, there is a possibility that a particular failure cause and a particular operation that are not supported by the failure prediction model of the application before being updated exist in the application after the update, and thus, it is necessary to update the failure prediction model in order to improve accuracy and reliability. For this reason, in order to keep up with development styles such as agile development, in which applications are updated quickly, the cost for updating the failure prediction model is required. The present disclosure has been made in view of the aforementioned problems, and an object of the present disclosure is to provide a system failure monitoring device and a system failure monitoring method capable of easily estimating a root cause of a system failure. Solution to Problem A system failure monitoring device according to an aspect of the present disclosure is a system failure monitoring device that monitors a monitoring target system including a plurality of components that execute processes according to requests, the system failure monitoring device including: a collection unit that collects configuration information regarding each of the components and request information regarding each of the requests processed by each of the components; a determination unit that determines, for each of the requests, wheth