CN-122020388-A - Abnormal event root cause positioning method and device, electronic equipment and storage medium
Abstract
The application provides a method and a device for positioning the root cause of an abnormal event, electronic equipment and a storage medium. The method comprises the steps of obtaining time sequence monitoring data of a target business index, carrying out anomaly detection based on the time sequence monitoring data to determine a time window when an abnormal event occurs, obtaining multi-dimensional business feature data corresponding to the time window, carrying out causal analysis on the multi-dimensional business feature data to screen candidate dimension subsets with potential causal association with the abnormal event from multiple dimensions, carrying out contribution analysis on the dimensions in the candidate dimension subsets to quantify contribution values of each dimension to the abnormal event occurring in the time window by the target business index, and determining core root cause dimensions which cause the abnormal event according to the contribution values. Therefore, root cause positioning efficiency and result reliability are improved.
Inventors
- ZHOU GUOJING
Assignees
- 北京奇艺世纪科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260205
Claims (10)
- 1. A method for locating the root cause of an abnormal event, the method comprising: Acquiring time sequence monitoring data of a target service index; performing anomaly detection based on the time sequence monitoring data to determine a time window in which an anomaly event occurs; Acquiring multidimensional service characteristic data corresponding to the time window; Performing causal analysis on the multi-dimensional business feature data to screen a subset of candidate dimensions from a plurality of dimensions that have potential causal links to the anomaly event; Performing contribution degree analysis on the dimensions in the candidate dimension subset to quantify contribution values of each dimension to the abnormal event occurring in the time window by the target business index; and determining the dimension of the core root cause of the abnormal event according to the contribution value.
- 2. The method of claim 1, wherein the anomaly detection based on the timing monitor data to determine a time window for an anomaly event to occur comprises: processing the time sequence monitoring data by using a fusion prediction model to generate a dynamic abnormality judgment threshold; when the actual value of the target business index continuously deviates from the abnormality judgment threshold value on a continuous preset number of monitoring points, judging that an abnormal event occurs; and determining the time windows corresponding to the preset number of monitoring points as the time windows of the occurrence of the abnormal event.
- 3. The method of claim 2, wherein processing the timing monitor data using a fusion prediction model to generate a dynamic anomaly determination threshold value comprises: Fitting macroscopic variation components of the target business indexes through a trend period predictor model; carrying out nonlinear fluctuation learning on the residual error of the macroscopic variation component through a residual error learning sub-model to obtain nonlinear fluctuation data; carrying out weighted fusion on the macroscopic variation component and the nonlinear fluctuation data to obtain a target predicted value; Extracting change characteristics of the time sequence monitoring data, and calculating an abnormality score based on the change characteristics; And generating the abnormality judgment threshold according to the target predicted value, the abnormality score and a preset reference fluctuation parameter.
- 4. The method of claim 1, wherein the causal analysis of the multi-dimensional business feature data to screen a subset of candidate dimensions from a plurality of dimensions that have potential causal links to the abnormal event comprises: Acquiring a predefined service causal graph, wherein the service causal graph characterizes causal dependency relations among the dimensions by using a directed acyclic graph; selecting a plurality of analysis objects from a plurality of dimensions, wherein each analysis object is a dimension or a combination formed by a plurality of dimensions; Calculating, for each analysis object, a conditional probability of occurrence of the abnormal event when the analysis object is intervened based on the business causal graph; determining whether a causal relationship exists between the analysis object and the abnormal event based on the conditional probability and the occurrence probability of the abnormal event without intervention; Adding the analysis object determined to have causal relation to the candidate dimension subset.
- 5. The method of claim 4, wherein selecting a plurality of analysis objects from a plurality of the dimensions, wherein each analysis object is a dimension or a combination of dimensions, comprises: dividing the plurality of dimensions into a plurality of dimension groupings, wherein each dimension grouping contains one or more dimensions; Calculating, for each dimension group, an information gain of the dimension group with respect to the anomaly event; sequencing a plurality of dimension groups according to the order of the information gain from large to small to obtain a dimension group sequence; And selecting a preset number of dimension groups from the forefront of the dimension group sequence as target groups, wherein each target group is one analysis object.
- 6. The method of claim 1, wherein said performing a contribution analysis of the dimensions in the subset of candidate dimensions to quantify the contribution of each dimension to the occurrence of the anomaly event within the time window for the target business indicator comprises: Enumerating, for each target dimension in the subset of candidate dimensions, all possible subsets of the subset of candidate dimensions that do not include the target dimension; Calculating, for each subset, a marginal effect on the anomaly event when the target dimension is added to the subset; And calculating the contribution value of the target dimension to the occurrence of the abnormal event of the target business index in the time window based on all marginal effects calculated for the target dimension.
- 7. The method of claim 1, wherein the obtaining the timing monitor data of the target traffic indicator comprises: receiving stream data of a service system in real time; Performing dimension complement and real-time aggregation on the stream data to obtain aggregation result data with a plurality of dimension labels, wherein each dimension label corresponds to one dimension; Synchronizing the aggregate result data to a time sequence database at a predetermined time granularity to form the time sequence monitoring data.
- 8. An abnormal event root cause locating apparatus, the apparatus comprising: the first acquisition module is used for acquiring time sequence monitoring data of the target business index; The detection module is used for carrying out abnormality detection based on the time sequence monitoring data so as to determine a time window for occurrence of an abnormal event; the second acquisition module is used for acquiring multidimensional service characteristic data corresponding to the time window; The screening module is used for carrying out causal analysis on the multidimensional service characteristic data so as to screen a candidate dimension subset with potential causal association with the abnormal event from a plurality of dimensions; the quantization module is used for carrying out contribution degree analysis on the dimensions in the candidate dimension subset so as to quantize the contribution value of each dimension to the abnormal event occurring in the time window of the target business index; and the determining module is used for determining the dimension of the core root cause which leads to the abnormal event according to the contribution value.
- 9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; a memory for storing a computer program; A processor for implementing the method for locating the root cause of an abnormal event according to any one of claims 1 to 7 when executing a program stored in a memory.
- 10. A computer readable storage medium, characterized in that a computer program is stored in the computer readable storage medium, which computer program, when being executed by a processor, implements the abnormal event root cause localization method according to any one of claims 1-7.
Description
Abnormal event root cause positioning method and device, electronic equipment and storage medium Technical Field The application relates to the technical field of intelligent operation and maintenance and business monitoring, in particular to a method and a device for positioning the root cause of an abnormal event, electronic equipment and a storage medium. Background With the rapid development of internet services, core services such as member subscription, online transaction and the like of various platforms show the characteristics of high concurrency, multidimensional and fluctuation sensitivity. Ensuring the stability of such critical business indicators is critical, and once abnormal fluctuations occur, if the root cause is not located in time, loss of revenue and degradation of user experience may result. Therefore, real-time monitoring of business indexes and rapid and accurate positioning of root causes when anomalies occur have become a core challenge in the field of intelligent operation and maintenance (ARTIFICIAL INTELLIGENCE for IT Operations, AIOps for short). Currently, it is common practice to use a time series database (e.g., prometheus) in conjunction with static thresholds for monitoring. Specifically, the system collects and stores time series data of traffic metrics at a fixed time granularity. And setting a fixed numerical range for the key index as an alarm threshold value by operation and maintenance personnel according to historical experience. When the index data collected in real time continuously exceeds the threshold range, the monitoring system triggers an alarm. After the alarm is generated, operation and maintenance or data analysis personnel are usually required to log in a related system manually, query various dimension data (such as user region, equipment type, payment channel and the like) related to the time period, and gradually infer and locate main dimensions which possibly cause abnormality through manual comparison, screening and elimination. However, in the prior art, the positioning process is slow and highly depends on manual experience, so that the requirements of the modern large-scale internet service on the real-time performance, the accuracy and the automation of the monitoring system are difficult to meet. Disclosure of Invention The embodiment of the application aims to provide a method, a device, electronic equipment and a storage medium for positioning the root cause of an abnormal event, which are used for solving the problems that the root cause positioning process is slow and highly depends on manual experience in the prior art, and the requirements of modern large-scale internet business on real-time performance, accuracy and automation of a monitoring system are difficult to meet. The specific technical scheme is as follows: In a first aspect, the present application provides a method for locating the root cause of an abnormal event, including: Acquiring time sequence monitoring data of a target service index; performing anomaly detection based on the time sequence monitoring data to determine a time window in which an anomaly event occurs; Acquiring multidimensional service characteristic data corresponding to the time window; Performing causal analysis on the multi-dimensional business feature data to screen a subset of candidate dimensions from a plurality of dimensions that have potential causal links to the anomaly event; Performing contribution degree analysis on the dimensions in the candidate dimension subset to quantify contribution values of each dimension to the abnormal event occurring in the time window by the target business index; and determining the dimension of the core root cause of the abnormal event according to the contribution value. In one possible implementation manner, the detecting the abnormality based on the time sequence monitoring data to determine a time window of occurrence of the abnormal event includes: processing the time sequence monitoring data by using a fusion prediction model to generate a dynamic abnormality judgment threshold; when the actual value of the target business index continuously deviates from the abnormality judgment threshold value on a continuous preset number of monitoring points, judging that an abnormal event occurs; and determining the time windows corresponding to the preset number of monitoring points as the time windows of the occurrence of the abnormal event. In one possible embodiment, the processing the time series monitoring data using the fusion prediction model to generate a dynamic anomaly determination threshold value includes: Fitting macroscopic variation components of the target business indexes through a trend period predictor model; carrying out nonlinear fluctuation learning on the residual error of the macroscopic variation component through a residual error learning sub-model to obtain nonlinear fluctuation data; carrying out weighted fusion on the macroscopic variation