CN-116151361-B - Unsupervised concept drift detection method based on stack self-encoder and Page-Hinckley test
Abstract
The invention discloses an unsupervised concept drift detection method based on a stack self-encoder and Page-Hinckley test, which aims to monitor concept drift phenomenon existing in dynamic data flow, timely adjust a downstream decision model and improve generalization capability of the model on new data. The method is performed by 1. Data window partitioning for organizing individual instances in a data stream into different data windows to support subsequent density estimation of data distribution in the windows. 2. And training a stack self-encoder of the characterization distribution, and extracting hidden statistical features of the characterization data distribution to realize indirect density estimation of window data. 3. The distribution difference metric uses the reconstruction error of the stack self-encoder as a metric of the distribution difference between the window data. 4. And setting a self-adaptive threshold value, making a dynamic threshold value through a Page-Hinckley test, reporting drift occurrence when a reconstruction error exceeds the threshold value by utilizing the thought of hypothesis testing, and simultaneously quickly adjusting a downstream decision model.
Inventors
- ZHAO YUNLONG
- ZHAN SHU
Assignees
- 南京航空航天大学
Dates
- Publication Date
- 20260512
- Application Date
- 20230222
Claims (5)
- 1. An unsupervised concept drift detection method based on a stack self-encoder and Page-Hinckley test is characterized by comprising the following steps: The method comprises the steps of firstly, dividing data windows, organizing single instances in a data stream into different data windows to support the subsequent density estimation of data distribution in the windows; Training a stack self-encoder of the characterization distribution, which is used for learning the bottom layer distribution of the data, extracting hidden statistical characteristics of the characterization data distribution and realizing indirect density estimation of the window data; thirdly, measuring the distribution difference, namely using a reconstruction error of a stack self-encoder as a measurement value of the distribution difference between window data; And fourthly, setting a self-adaptive threshold value, making a dynamic threshold value through Page-Hinckley test, reporting drift occurrence when a reconstruction error exceeds the threshold value by utilizing the thought of hypothesis test, and simultaneously quickly adjusting a downstream decision model.
- 2. The method for detecting the unsupervised concept drift based on the stack self-encoder and Page-Hinckley test as claimed in claim 1, wherein the data window division comprises three windows, namely a reference window S ref , a current window S cur and a distribution characterization window S single ; Representing the data stream as stream = { X 1 ,X 2 ,...,X n }, where X i ∈R P is a feature vector in the P-dimensional feature space at time i, assuming that the time at which the conceptual drift occurred last was t c , the reference window S ref contains the data set that comes first after the last drift occurred, and the window capacity is w 1 , i.e. When concept drift occurs again, the reference window should also be updated correspondingly, the distribution characterization window S single contains the minimum data set used for characterizing data distribution when training the self-encoder, the window capacity is w 2 , the window is slid continuously on the reference window to form a training set of the self-encoder stack, the current window S cur always contains the latest w 2 pieces of data, is initialized to the w 2 pieces of data set after the reference window, and is slid continuously along the arrival direction of new data examples by adopting a first-in-first-out mechanism (FIFO), the reference window represents the reference data distribution, and the current window represents the latest current data distribution.
- 3. The method for unsupervised concept drift detection based on stack self-encoder and Page-Hinckley test as claimed in claim 1, wherein the self-encoder of the training characterization distribution uses the data itself in the reference window S ref as a supervisory signal to guide training, after filling S ref , the distribution characterization window S single is initialized to the first w 2 pieces of data coming in S ref and slides continuously over the reference window in a FIFO mechanism in the direction of data coming in a sliding step of L, as described above, Wherein the method comprises the steps of The training set of the stack self-encoder is Train, different data sets in S single corresponding to each i value are stretched into one-dimensional vectors and added into Train, and finally Wherein the method comprises the steps of The stack self-encoder consists of an encoder and a decoder, wherein the encoder and the decoder share L+1 layers, the input dimension and the output dimension are D, and the value of the jth neuron of the first layer is calculated as follows: wherein, T n is a data element in Train, which is used as the input of the self-encoder; for the j-th component of the input vector, N l-1 is the number of neurons of layer 1, Is the weight between the first layer-1 and the first layer, For the bias of layer I, σ (z) is the activation function, leakyReLU is selected as the activation function, whose expression is as follows: σ(z)=max(0,z)+leak*min(0,z) Where leak is a constant, the training goal of the stack self-encoder is to bring the network input and output infinitely close, i.e., minimize the loss function Dist (T n ), also known as reconstruction error: The training from the encoder is guided using a back-propagation algorithm and a gradient descent optimization algorithm, with weights and biases updated according to the following formula: Where η is the learning rate, B is the batch size for each training, and the current data batch is { T t ,T t+1 ,...,T t+B-1 }, because training is typically done on a batch of data rather than a single instance.
- 4. The unsupervised concept drift detection method based on stack self-encoder and Page-Hinckley test as claimed in claim 1, wherein the distribution difference metric uses the data correlation of the self-encoder to consider that the trained self-encoder corresponds to the reference data distribution represented by the reference window one by one, the current window S cur is started to slide continuously along the data arrival direction with the step length of L by adopting a FIFO mechanism to always contain the latest w 2 pieces of data, the data in S cur are also constructed into a one-dimensional vector to be input into the stack self-encoder, the stack self-encoder is tried to fit the current data distribution, and the reconstruction error is obtained, if the data in the current window and the data in the reference window are kept in the same distribution, the statistical information representing the data distribution is correctly extracted as the neural unit output of the hidden layer of the encoder, the decoder decodes the hidden statistical information again to be basically consistent with the original data, the reconstruction error fluctuates up and down around a small value, and otherwise the reconstruction error will have larger forward jump.
- 5. The unsupervised conceptual drift detection method based on stack self-encoder and Page-Hinckley Test as claimed in claim 1, wherein Page-Hinckley Test (PH Test) is used for dynamically calculating adaptive threshold, dynamic threshold setting mode is combined with historical data characteristic to dynamically update current threshold, so that the defects of static threshold scene capability, configuration threshold and maintenance cost are overcome, the original purpose of PH Test design is to detect mutation in average value of Gaussian signal, researchers detect change of signal processing on line later, we use PH Test to judge whether drift occurs by observing difference between reconstruction error and overall rule of current data block, if conceptual drift occurs, self-encoder can not recognize current distribution state and restore original data, reconstruction error will increase, and when reconstruction error is greater than threshold alpha t at current time, it is indicated that data drift occurs; PH Test defines p i as the current observation of the random variable p at time i, Representing the average of all observations before time t, m t is the cumulative variable used to store the accumulation of the difference between the average of the observations and the previous observations: Where δ is a non-negative real number close to 0 representing the maximum value of the allowed change, detecting whether the mean value of the variable suddenly increases by continuously observing the difference between M t and M t , and when the difference is greater than the threshold α t , PH Test will report the change, wherein: M t =max{m 1 ,m 2 ,...,m t } Wherein lambda is a super parameter, so that alpha t at each moment can be self-regulated according to historical observation value, respectively defining warning threshold value And drift threshold Reaching the warning threshold means that there is a sign of a change in the data distribution, requiring incremental training of a new model M new and holding using the data samples coming after, reaching the drift threshold marks drift occurrence, replacing the old model with M new to accommodate the new data after drift.
Description
Unsupervised concept drift detection method based on stack self-encoder and Page-Hinckley test Technical Field The invention belongs to the technical field of dynamic data stream mining, relates to an unsupervised concept drift detection method, and in particular relates to an unsupervised concept drift detection method based on a stack self-encoder and Page-Hinckley test. Background In recent decades, big data and the internet of things gradually permeate into various fields of society, and more sensors and systems are involved. The sensors and the system continuously generate a large amount of data in different formats, and finally the data can be transmitted to the terminal in the form of data stream for real-time online analysis and processing. Because the data stream has the characteristics of continuous high-speed arrival, large data volume, complex data structure types and the like, the data stream mining method is usually limited by memory and running time, and the same data instance cannot be accessed for any times. There is a great deal of interest in how to process such data in real-time using efficient data stream mining techniques under limited memory and computing resources to obtain efficient information of potential value. Conventional learning methods, including many online learning methods, typically default to the generation of data in a static environment, i.e., training data remains independently co-distributed with test data. However, data in real world environments has dynamic behavior. The underlying distribution of data is not constant as product, market, and customer behavior changes, meaning that the statistical nature of the target variables that machine learning models attempt to predict can change unpredictably over time, a phenomenon known as concept drift. For example, in a spam email classification system, as the accuracy of the spam classifier increases, the spammer may modify the corresponding policy in an attempt to spoof the spam classifier. As an important decision tool, machine learning systems must be able to detect and adapt to such changes in the learning environment. The conceptual drift is defined as given time period [0, t ], D 0,t={d0,d1,d2,...,dt }, where D i=(Xi,yi) is the data instance coming at time i, and the time intervals between adjacent instances are not necessarily the same. X i∈RP is a feature vector in the i-time P-dimensional feature space, and y i is a label corresponding to the i-time X i. D 0,t obeys the distribution F 0,t (X, y). When F 0,t(X,y)≠Ft,∞ (X, y), i.e., the concept drift is considered to occur at time t+1, it is noted that The current stream data mining method with concept drift adaptation capability can be divided into active adaptation and passive adaptation, and the idea of the active adaptation method is to design a drift detection method to continuously pay attention to whether a concept is changed or not, and only when the concept is changed, a model is adjusted to adapt to the evolution of an environment. The method has higher policy efficiency and wider application range. Drift detection methods that have been developed so far are largely classified into supervised, semi-supervised and unsupervised methods. The supervised method assumes that all future and historical data labels are immediately available, and most corresponding detection methods judge whether drift occurs or not by tracking the output of a decision model, and drift alarm is carried out when the performance of the decision model is reduced. The unsupervised method assumes that no tag is available, and the method directly tracks the source of drift, namely distribution drift, and judges whether the drift occurs or not by researching the statistical properties of the features. The semi-supervised approach is in between the former two, considering that after a certain moment a few data tags are available, the rest of the data tags are not available. Some methods exist in the prior art that can detect the position of occurrence of conceptual drift in a data stream and adjust a downstream decision model in time. However, these methods have the disadvantage that 1) the existing about 85% concept drift detection methods are all in the supervised category, and although such methods have high detection accuracy, they are too optimistic for the assumption that the tags are immediately available and are not suitable for use in real-world situations. 2) While the need to retrain the model using the tag data is unavoidable, the use of tags in the drift detection link is a waste of tag resources. 3) The unsupervised method focuses on the change of the characteristic attribute distribution, how to model the data distribution is the key of the problem, and the traditional density estimation method is difficult to accurately fit the distribution which is met by the limited data sample. In summary, research and development of an unsupervised concept drift detection method capabl