CN-115577275-A - Time sequence data anomaly monitoring system and method based on LOF and isolated forest

CN115577275ACN 115577275 ACN115577275 ACN 115577275ACN-115577275-A

Abstract

The invention provides a time sequence data abnormity monitoring system and method based on LOF and isolated forest, and relates to the technical field of data processing. The method comprises the following steps: the data acquisition module is used for acquiring time sequence data; the data preprocessing module is used for preprocessing the time sequence data; the LOF abnormal score acquisition module is used for dividing time sequence data into a plurality of subsets, adaptively adjusting the window length and the K neighbor distance, establishing an LOF model, and calculating an abnormal score LOF K (p); the isolated forest abnormal score acquisition module is used for establishing an isolated forest model, inputting time sequence data into the isolated forest model and calculating an abnormal score s (p, n); a fusion module for fusing the abnormal score LOF K (p) andand (5) obtaining a final abnormal value based on hierarchical clustering by using the abnormal score s (p, n). The invention combines the advantages that the isolated forest is more accurate to group abnormity and fast to execute mass data, and the advantages that the LOF is high in accuracy to single-point abnormity and context abnormity, further improves the LOF, and obviously improves time sequence data compared with other algorithms.

Inventors

CHAN YIK KEUNG
LI YANING
YANG XIAODONG
Pan Zixing

Assignees

INSTITUTE OF COMPUTING SHANDONG INDUSTRIAL TECH RESEARCH INSTITUTE

Dates

Publication Date: 20230106
Application Date: 20221111
Priority Date: 20221111

Claims (10)

1. The utility model provides a time series data anomaly monitoring system based on LOF and isolated forest which characterized in that includes: the data acquisition module is used for acquiring time sequence data; the data preprocessing module is used for preprocessing the time sequence data to obtain preprocessed time sequence data; the LOF abnormal score acquisition module is used for dividing the preprocessed time sequence data into a plurality of subsets by using a window, reflecting the dispersion degree of the data in different subsets by using standard deviation, adaptively adjusting the window length and the K neighbor distance, establishing an LOF model, and calculating the LOF abnormal score K (p) subjecting LOF K (p) sending to a fusion module; the isolated forest abnormal score acquisition module is used for establishing an isolated forest model, inputting the preprocessed time sequence data into the isolated forest model, calculating an abnormal score s (p, n), and sending s (p, n) to the fusion module; a fusion module for fusing the anomaly scores LOF K (p) and an anomaly score s (p, n), adjusting the anomaly score LOF based on the hierarchical clustering K (p) and confidence of the anomaly score s (p, n) to obtain a final anomaly value.
2. The system of claim 1, wherein the LOF and isolated forest based time series data anomaly monitoring system comprises a LOF anomaly score acquisition module for adaptively adjusting the window length and K-nearest neighbor distance, comprising: step 1: presetting the K nearest neighbor distance and the fixed window size; step 2: the time sequence data is segmented, a first section of window of the time sequence data is segmented by the size of a fixed window, and the length of the size of the fixed window is pre-segmented backwards to obtain a second section of window; and step 3: calculating the standard deviation of the subset in the second section of window and the standard deviation of the subset in the first section of window; and 4, step 4: judging the size of the standard deviation of the subset in the two-segment window and the standard deviation of the subset in the first segment window: when the standard deviation of the subset in the second section of window is smaller than that of the first section of window, increasing the length of the second section of window and increasing the size of the K neighbor distance; when the standard deviation of the subset in the second section of window is larger than that of the first section of window, reducing the length of the second section of window and simultaneously reducing the size of the K neighbor distance; and 5: and (5) repeating the steps 1 to 4 until all time sequence data sets are completely segmented.
3. The system of claim 2, wherein when the standard deviation of the subset in the current window is smaller than the standard deviation of the previous window, it means that the fluctuation of the time series data in the current window is smaller, the probability of the appearance of the outlier is correspondingly reduced, and the density in the neighborhood is measured by using a larger neighborhood distance; when the standard deviation of the subset in the current window is larger than that of the previous window, the fluctuation of time sequence data in the current window is larger, the probability of appearance of outliers is correspondingly improved, and the density in the neighborhood is measured by using smaller neighborhood distance.
4. An LOF and isolated forest based time series data anomaly monitoring system in accordance with claim 2, wherein the window length and K neighbor distance are expressed as: the expression for the window length is: the expression of the K neighbor distance is: wherein K _ g is a preset K neighbor distance; w _ g is the size of a preset fixed window; q represents a dimension; std i,j The standard deviation of the time sequence data in the current window is obtained; std i+1,j Is the standard deviation of the time series data in the next window.
5. A LOF and isolated forest based time series data anomaly monitoring system as claimed in claim 2, wherein the standard deviation calculation formula is: where W is the length of the current window, x t For the sample points in the subset, i ∈ m, E(s) i ) Is the average of the subsets in the window.
6. The LOF and isolated forest based time series data anomaly monitoring system as claimed in claim 4, wherein the LOF anomaly score obtaining module is used for building LOF model and calculating the LOF anomaly score K The specific steps of (p) are as follows: the set points p and o being the ith subset s i O is the Kth point closest to the point p, and o e s i And then: step 1, calculating the window length W; step 2, calculating K neighbor distance K; step 3, calculating Euclidean distance between the points p and o Wherein QSequentially extracting from the {1,2 \8230n } dimensions for calculating for different dimensions of points; step 4, calculating K neighbor distance d K (p)＝d(p,o)； Step 5, finding K distance neighborhood N K (p)＝{d(p,o 1 ) D (p, o) }, wherein o 1 The method comprises the steps of taking p as a center and collecting all points smaller than K neighbor distance; step 6, calculating the reachable distances d of the points p and o K (p,o)＝max{d K (p),d(p,o)}； Step 7, calculating the local reachable density of the points p and o Step 8, calculating partial outlier factors of points p and o Obtained LOF K (p) is the outlier of sample point p.
7. A LOF and isolated forest based time series data anomaly monitoring system as claimed in claim 1 wherein the isolated forest anomaly score acquisition module is used to build an isolated forest model, the specific steps of calculating the anomaly score s (p, n) are: step 1: extracting samples from time sequence data, randomly selecting characteristics of a certain dimension, and constructing a decision tree; step 2: calculating the average value E (h (p)) of the distances h (p) and h (p) from the sample points to the root; and step 3: calculating the average path length l (n) of the decision tree; and 4, step 4: calculating an abnormality score of the sample point p based on the average path lengths l (n) and an average value E (h (p)) of h (p); and 5: and (4) extracting samples and characteristics for multiple times, constructing multiple decision trees, and repeating the processes from the step 1 to the step 4 to obtain the abnormal score s (p, n) of the isolated forest model.
8. The LOF and solisson-based system of claim 1The forest time series data anomaly monitoring system is characterized in that the fusion module is used for fusing an anomaly score LOF K (p) and an anomaly score s (p, n), adjusting the anomaly score LOF based on the hierarchical clustering K The specific steps of obtaining the final abnormal value by the confidence degrees of the (p) and the abnormal score s (p, n) are as follows: LOF abnormal score K (p) normalizing and normalizing the abnormality score s (p, n); clustering data under each window by utilizing hierarchical clustering, and calculating inter-cluster distance; setting a cluster spacing threshold; if the inter-cluster distance is larger than the inter-cluster distance threshold value and the data volume in each cluster is larger than or equal to 2, improving the confidence coefficient of the isolated forest detector; conversely, the confidence of the LOF detector is increased.
9. A LOF and isolated forest based time series data anomaly monitoring system in accordance with claim 8, wherein data greater than a cluster spacing threshold is screened for anomaly data.
10. A time sequence data abnormity monitoring method based on LOF and isolated forest is characterized in that: the method comprises the following steps: acquiring time sequence data, and preprocessing the time sequence data; dividing the preprocessed time sequence data into a plurality of subsets by using a window, reflecting the dispersion degree of the data in different subsets by using standard deviation, adaptively adjusting the window length and the K neighbor distance, establishing an LOF model, and calculating an abnormal score LOF K (p)； Establishing an isolated forest model, inputting the preprocessed time sequence data into the isolated forest model, and calculating an abnormal score s (p, n); fusion anomaly score LOF K (p) and an anomaly score s (p, n), adjusting the anomaly score LOF based on the hierarchical clustering K (p) and confidence of the anomaly score s (p, n) to obtain a final anomaly value.

Description

Time sequence data anomaly monitoring system and method based on LOF and isolated forest Technical Field The invention belongs to the technical field of data processing, and particularly relates to a time sequence data anomaly monitoring system and method based on LOF and isolated forests. Background The statements in this section merely provide background information related to the present disclosure and may not constitute prior art. Monitoring of outliers is important in engineering. For example, the time sequence data such as blood pressure, flow and the like, the outlier can disturb the distribution of the data, and the subsequent series of catastrophic accidents can be avoided by accurately monitoring the abnormal data. Therefore, the anomaly monitoring of the time series data has very important significance in the field of data mining. For common time-series data, the exact definition of outliers is different, so existing detection methods do not behave stably in the face of this type of task. At present, mainstream data anomaly monitoring methods are commonly found in methods based on statistics, methods based on clustering, methods based on density, methods based on isolated forests and methods based on fusion of LOF and isolated forests, and the inventor finds that the prior art has at least the following defects: 1. statistical-based methods: such methods compute the probability of each object by building a model of the probability distribution, with low probabilities often being considered outliers. However, the method is highly dependent on the selection of the model, different models are suitable for different tasks, and if a wrong model is selected, the detection is likely to have abnormal points with more misjudgments. 2. The clustering-based method comprises the following steps: the method gathers data into different clusters based on the characteristic distribution of the data, and outliers are usually far away from the center of the clusters, but the difficulty is in selecting the number of the clustered clusters. Each cluster model is only suitable for a specific data type, and the results produced by different cluster numbers are completely different. 3. Density-based methods: typically an LOF algorithm, byDefining the distance between different points to determine density, and calculating outlier factor based on the density to reflect abnormal degree, the method has high accuracy, but the time complexity is O (N) 2 ) And is inefficient for large volumes of data. In addition, when a group anomaly is encountered, the density of the area in which the group anomaly is located is high, so that the scores of the anomaly in the area are deviated. 4. The method based on the isolated forest comprises the following steps: the isolated forest algorithm has the advantages of time complexity close to linearity, is very effective compared with other methods, can quickly detect mass data, is suitable for high-dimensional data, and can give reasonable prediction when abnormal cluster conditions are met. But since the segmentation process is implemented by randomly extracting dimensions, the accuracy of the detection result is reduced. 5. A method for integrating LOFs based on isolated forests comprises the following steps: most of the existing fusion methods are simple to integrate two detectors, and common integration methods include: (1) And the detector is utilized in a layering manner, the isolated forest detector is firstly used for screening out coarse-grained data, and then the more reliable LOF detector is utilized for screening out fuzzy exceptions. However, the algorithm does not consider the time sequence correlation of the data and does not perform well on the time sequence data. (2) The concept of Boost i ng is adopted for the two detectors, a higher confidence coefficient is given to a more reliable LOF detector, a lower confidence coefficient is given to an unstable isolated forest detector, and weighted abnormal scores are added. However, this algorithm is simple to perform weighted integration, and does not fully utilize the advantages of the two detectors, and the higher confidence LOF detector still does not perform well in the face of group anomalies. In summary, the existing data anomaly monitoring method is often directed at data with stable fluctuation, obvious anomaly data and large single-point anomaly proportion, and in the face of data with complicated anomaly rules, doped with a large number of groups and abnormal context, the existing data anomaly monitoring method is unstable in performance, difficult to recall a large number of anomaly data, and difficult to ensure accuracy. Disclosure of Invention In order to overcome the defects of the prior art, the invention provides the time sequence data abnormity monitoring system and method based on the LOF and the isolated forest, which effectively combine the advantages that the isolated forest is more accurate to group abnormit