CN-121233375-B - AIOps abnormality detection and root cause positioning method

CN121233375BCN 121233375 BCN121233375 BCN 121233375BCN-121233375-B

Abstract

The invention discloses AIOps abnormality detection and root cause positioning methods, and relates to the technical field of abnormality detection. The method comprises the steps of 1, processing a preset time window according to services and examples under a unified time line, generating monitoring abnormal segments for monitoring indexes, extracting spans between a client and a server in distributed tracking for segment pairing, 2, performing stitching by using the distributed tracking as a guide to obtain a candidate evidence chain set, using the last segment of each evidence chain unit as a candidate root cause direction, and 3, outputting a root cause list for the candidate evidence chain set according to a deterministic rule, and giving a time range of a related template text and adjacent monitoring abnormal segments. The invention can accurately identify the abnormal propagation path in the multi-source heterogeneous data, and carry out association verification on the upstream appearance and the downstream resource failure, thereby providing clear root cause target and evidentiary explanation.

Inventors

WANG HONGLAN
LIU QINGYANG

Assignees

宁波三扬信息科技有限公司

Dates

Publication Date: 20260508
Application Date: 20250917

Claims (8)

The AIOps abnormality detection and root cause positioning method is characterized by comprising the following steps: step 1, under a unified time line, processing a preset time window according to service and examples, generating monitoring type abnormal fragments for monitoring indexes, generating log type abnormal fragments for running logs through template merging and trigger word recognition, and extracting client and server spans in distributed tracking for fragment pairing; Step 2, performing stitching by using distributed tracking as a guide, pairing client spans with server spans to generate calling fragments according to the same tracking identification, recording source service, source instance, destination service, destination instance and occurrence time, searching monitoring abnormal fragments which are overlapped with or adjacent to the source side and the destination side for not more than one second and log abnormal fragments which comprise phrases which reject new connection, lack threads, full queues and unwritable disks, marking anchor point types for the classifiable templates, marking the starting side of the calling fragments as the stitchable calling fragments when the log abnormal fragments which comprise calling downstream failure or timeout trigger words exist and capacity anchor points or storage anchor points exist at the destination side, starting from any stitchable calling fragments, generating evidence chain units according to the destination instance and time continuity connection, cutting the evidence chain units which cover the same source and destination service at the same time and have opposite anchor point semantics according to conflict rules to obtain candidate evidence chain sets, and pointing the extreme fragments of each evidence chain unit as candidate root causes; The method comprises the steps of outputting a root cause list for a candidate evidence chain set according to a deterministic rule, taking a destination service and a destination instance of the evidence chain set as root cause targets if a capacity anchor point or a storage anchor point exists at the extreme end of the evidence chain set, giving a time range of related template text and adjacent monitoring class abnormal fragments, specifically taking a destination service identifier and a destination instance identifier of the extreme end fragment as the root cause targets when the anchor point type exists at the extreme end of an evidence chain unit as the capacity anchor point or the storage anchor point or the access control anchor point, taking a destination service identifier and a destination instance identifier of the extreme end fragment as the root cause targets when only a change anchor point exists at the extreme end of the evidence chain unit and a source side of an upstream adjacent fragment of the evidence chain unit comprises a call downstream failure or a timeout trigger word, taking a destination service identifier and a destination instance identifier of the extreme end fragment as the root cause targets when the evidence chain unit does not exist any anchor point but the destination side of three call fragments are aligned to the monitoring class abnormal fragments, and sequentially taking a destination service identifier and a destination instance identifier of the last call fragment as the root cause targets, and explaining each item of the root cause targets, and at least comprising a trigger template text, a participation sequence, a direct sequence, a direct sequence of the adjacent source service and a source and a monitoring class, and a sequence of the source identifier.
2. The method for detecting and locating the root cause of AIOps according to claim 1, wherein in step 1, in the set of instance identifiers under the same service identifier in the same second, the monitored values of all the instances are sorted according to the size, the median is taken as the same group of references, when the value of an instance in three adjacent seconds satisfies the condition that the value in the adjacent seconds is continuously greater than twice the same group of references or continuously less than half the same group of references, the three seconds are combined into one monitored abnormal segment, and then the ending second of the segment is followed if the condition is still satisfied.
3. The method of claim 2, wherein in step 1, the running log data is subjected to template merging, the numeric string, hexadecimal string, network address, file path and unique identifier are replaced by placeholders, fixed word sequence text is used as a template, when the same template appears at least ten times in any continuous ten-second interval or the template text contains any trigger word of failure, timeout, refusal, unreachable, exhaustion, full, authority deficiency, rollback and crashing, log abnormal fragments are generated in the first appearance second and the last appearance second, span records in the current time window are reserved from distributed tracking data, tracking identifiers, span identifiers, father span identifiers, service identifiers, instance identifiers, span types, start seconds, end seconds, status fields and destination network address fields are saved, and when the monitored abnormal fragments and the log abnormal fragments are identical in service identifiers and instance identifiers and overlap in time or are not adjacent for more than two seconds, a segment pairing relation is established and recorded into an abnormal fragment set.
4. The method for detecting and locating the root cause of AIOps anomaly in claim 3, wherein in step 2, the process of generating a calling segment includes screening records of span types of client and server from distributed trace data, generating a calling segment for a pair of span identifiers of a parent span identifier of a server record equal to a span identifier of a client record with the same trace identifier, the calling segment including a source service identifier, a source instance identifier, a destination service identifier, a destination instance identifier, a start second and an end second, generating a calling segment when the client record and the server record are not paired but have the same trace identifier, a difference between the start second and the end second is not more than one second, and a destination network address field is identical, and de-duplicating a plurality of calling segments of the same endogenous service identifier, the source instance identifier, the destination service identifier and the destination instance identifier for the same second and retaining an earliest starting second.
5. The method for detecting AIOps abnormality and positioning root cause according to claim 4, wherein in step 2, an anchor point segment is generated by identifying an anchor point type according to a template text in a log type abnormal segment, wherein new connection is refused, certificate invalidation, authority is insufficient, key expiration is classified as an access control anchor point, threads are insufficient, queues are full, a connection pool is exhausted, a file handle is classified as a capacity anchor point, a disk is unwritable, a directory is not present, a device is read only, verification failure is classified as a storage anchor point, configuration is validated, restart is completed, rollback success is classified as a change anchor point, and an alignment result is formed by searching for a monitoring type abnormal segment and an anchor point segment, which are overlapped with or are not more than one second adjacent to a starting second to an ending second of the calling segment, on a source instance identifier and a destination instance identifier, respectively, and when the log type abnormal segment including a call downstream failure, timeout or retry trigger word exists on a source side, and the capacity or the storage anchor point exists on a destination side, the calling segment is marked as a seamable calling segment.
6. The method of claim 5, wherein in step 2, a message system related key is searched from span attributes of distributed trace data, when a target name of a span of a transmitting end and a target name of a span of a receiving end are found to be consistent and a difference between start seconds is not more than one second, an intermediary calling segment with the target name as an intermediary identifier is generated and the segment is aligned with a call according to a corresponding rule, meanwhile, a log type abnormal segment containing keywords issued to a queue, consumed from the queue, subject, partition and offset is searched in the operation log data, the queue or subject name is extracted as an intermediary identifier, and when the intermediary identifier is consistent with the intermediary calling segment and time overlapping or adjacent to the intermediary calling segment is not more than one second, the intermediary calling segment is marked as a seamable calling segment.
7. The method for detecting AIOps abnormality and locating root cause according to claim 6, wherein in step 2, the procedure for generating the evidence chain unit includes expanding from early to late with any one of the stitchable call segments as a start point, searching for a next stitchable call segment with the instance as a source instance identifier at a destination instance identifier of the call segment, wherein a start second is the same as or not more than one second adjacent to an end second of a current call segment, continuing the call segment, and stopping expanding if the continued call segment has neither an anchor segment nor a monitoring class abnormality segment at the destination side, or if a difference between the start second and the end second of a preceding segment exceeds one second, and jointly forming the evidence chain unit from the obtained sequence of continuous call segments and the aligned monitoring class abnormality segments and anchor segments.
8. The method of claim 7, wherein in step 2, when two evidence chain units cover the same source service identifier and destination service identifier in the same second but the corresponding anchor type is opposite in semantics, the evidence chain unit containing the access control anchor or capacity anchor or storage anchor is reserved and the other evidence chain unit is deleted, when any calling segment in the evidence chain unit is not aligned to the monitoring class exception segment or anchor segment on the source side or the destination side, the calling segment and the subsequent segments are deleted, the rest is solidified into a candidate evidence chain set, the destination instance identifier of the end-most segment of each evidence chain unit and the anchor type on the end-most segment are output as root candidates, and if the end-most segment does not have an anchor segment but the monitoring class exception segment is output by using the destination instance identifier and the monitoring class exception segment as root candidates.

Description

AIOps abnormality detection and root cause positioning method Technical Field The invention relates to the technical field of anomaly detection, in particular to AIOps anomaly detection and root cause positioning methods. Background With the wide application of cloud computing, large-scale distributed systems and micro-service architecture, the application systems and infrastructures can generate a large amount of monitoring indexes, running logs and distributed tracking data in the running process. The data contains key information such as system state, request path, performance index, abnormal phenomenon and the like, and is widely used for operation and maintenance monitoring and fault investigation. In the prior art, the industry has generally relied on a variety of automated monitoring and analysis means to process such data in an attempt to automate anomaly detection and root cause localization. However, the existing method still has obvious defects, and is difficult to meet the requirements of rapid and accurate fault diagnosis in complex service scenes. In the prior art, a threshold warning or statistical method is often adopted for detecting the abnormality of the monitoring index. For example, many systems configure a monitor indicator through a static threshold, and when the value exceeds a preset range, an alarm is triggered. Although the mode is simple to realize, false alarm and missing report are easy to occur under the complex environment of dynamic change, and especially the effect is poor under the condition of severe load fluctuation or multi-tenant scene. For improvement, part of the technology introduces a time sequence prediction model, carries out trend prediction on future indexes and judges abnormality based on residual errors. However, the method is still focused on the abnormality discrimination of a single index sequence, lacks the overall association analysis of cross indexes and cross examples, and is difficult to reveal a fault propagation link at a system level. Log analysis techniques have focused mainly on pattern matching and keyword recognition in the prior art methods. On one hand, the existing log analysis method often depends on regular expressions or fixed templates for classification, but system upgrading or application iteration can frequently change the log format, so that the method based on static rules has high maintenance cost, and is difficult to maintain stable accuracy when facing diversified logs. On the other hand, although studies have been proposed to use word vectors or deep learning models to perform log semantic representation so as to improve automatic recognition capability, these methods often cannot be effectively integrated in an environment where multi-source log data coexist, resulting in difficulty in forming unified abnormal segments and increased positioning complexity. Disclosure of Invention In view of this, the invention provides AIOps anomaly detection and root cause positioning methods, by carrying out cooperative processing on monitoring indexes, running logs and distributed tracking data under a unified time line, firstly generating and aligning monitoring type anomaly fragments and log type anomaly fragments, then generating calling fragments by using client spans and server spans as guidance, forming an evidence chain unit through anchor point type labeling and time continuity constraint, and finally carrying out conflict cutting in a candidate evidence chain set and outputting a root cause list. The method can accurately identify the abnormal propagation path in the multi-source heterogeneous data, correlate and verify the upstream appearance and the downstream resource failure, provide clear root cause targets and evidence interpretation, effectively reduce false alarm and missing report, improve the accuracy and certainty of abnormal positioning, remarkably shorten the diagnosis time, and enhance the automation and intelligence level of the operation and maintenance of the complex distributed system. The technical scheme adopted by the invention is as follows: AIOps abnormality detection and root cause positioning method, the method comprises: step 1, under a unified time line, processing a preset time window according to service and examples, generating monitoring type abnormal fragments for monitoring indexes, generating log type abnormal fragments for running logs through template merging and trigger word recognition, and extracting client and server spans in distributed tracking for fragment pairing; Step 2, performing stitching by using distributed tracking as a guide, pairing client spans with server spans to generate calling fragments according to the same tracking identification, recording source service, source instance, destination service, destination instance and occurrence time, searching monitoring abnormal fragments which are overlapped with or adjacent to the source side and the destination side for not more than