Search

CN-122019990-A - Machine learning-based big data intelligent cleaning method and system

CN122019990ACN 122019990 ACN122019990 ACN 122019990ACN-122019990-A

Abstract

The invention provides a machine learning-based big data intelligent cleaning method and a system, which relate to the technical field of data processing, wherein the method comprises the following steps of 1, collecting and preprocessing a user transaction behavior data stream, carrying out characteristic engineering processing, extracting a multidimensional characteristic vector, and obtaining a characteristic data set to be cleaned; and step 2, constructing a data feature space based on the feature data set to be cleaned, mapping the multi-dimensional feature vector into the data feature space, inputting the multi-dimensional feature vector into an anomaly detection model constructed based on an isolated forest algorithm, and calculating a real-time anomaly score corresponding to each feature vector. According to the method, the feature vector anomaly score is calculated through an isolated forest algorithm, the dynamic judgment threshold is generated by combining the constructed quadrilateral judgment domain and the high-dimensional confidence ellipsoid projection, the unreal transaction data is cleaned, the model is iteratively updated along with the change of the data feature distribution, and the accuracy, the adaptability and the data reliability of big data cleaning are effectively improved.

Inventors

  • WANG TENGFEI
  • FAN BO
  • FENG TAO

Assignees

  • 济南银华信息技术有限公司

Dates

Publication Date
20260512
Application Date
20251205

Claims (10)

  1. 1. The intelligent big data cleaning method based on machine learning is characterized by comprising the following steps: Step 1, preprocessing and feature engineering processing are carried out on user transaction behavior data streams, and multidimensional feature vectors are extracted to obtain feature data sets to be cleaned; Step 2, based on the feature data set to be cleaned, constructing a data feature space, and mapping the multi-dimensional feature vector into the data feature space; step 3, carrying out statistical analysis on the real-time abnormal scores of all the feature vectors, and determining an initial judgment threshold; step 4, determining first and second feature dimensions in a data feature space to establish an anchor point set, and connecting the anchors in the anchor point set to form a quadrilateral abnormal interpretation domain; Step 5, constructing a high-dimensional confidence ellipsoid based on the covariance matrix and the mean vector of the feature vector, projecting the high-dimensional confidence ellipsoid onto a two-dimensional plane formed by the first feature dimension and the second feature dimension to obtain a two-dimensional confidence ellipse, calculating the area proportion of the overlapping area of the quadrangular abnormal judging domain and the two-dimensional confidence ellipse to obtain a threshold value adjusting factor, and weighting the threshold value adjusting factor and an initial judging threshold value to generate a dynamic judging threshold value; And 6, comparing the real-time abnormal score with a dynamic judgment threshold value to identify and clean unreal transaction data to obtain a clean data stream, continuously monitoring characteristic distribution change of the clean data stream, and triggering an abnormal detection model and a updating process of the dynamic judgment threshold value when the distribution change exceeds a preset threshold value.
  2. 2. The machine learning based big data intelligent cleaning method according to claim 1, wherein the step 1 comprises: Collecting a user transaction behavior data stream, wherein the user transaction behavior data stream comprises a transaction time stamp, a transaction amount, a buyer identifier, a seller identifier, logistics state information and a user operation behavior sequence; The method comprises the steps of cleaning data of a user transaction behavior data stream, including removing repeated records, filling missing values and correcting format errors, and extracting time dimension features, amount dimension features, behavior sequence dimension features and relationship network dimension features based on the cleaned user transaction behavior data stream to form a multidimensional feature vector; And carrying out standardization processing on the multidimensional feature vector to obtain a feature data set to be cleaned.
  3. 3. The machine learning based big data intelligent cleaning method according to claim 2, wherein the step 2 comprises: Constructing a data feature space based on the feature data set to be cleaned, and mapping the degree feature vector to a corresponding coordinate position of the data feature space; And constructing an anomaly detection model through an isolated forest algorithm, wherein the construction process of the anomaly detection model is that a preset number of isolated trees are generated in a data feature space, each isolated tree isolates feature vectors to independent nodes through recursive random partitioning, the average path length of each feature vector in all the isolated trees is calculated, and an anomaly score is generated according to the ratio of the average path length to a preset path length reference.
  4. 4. The machine learning based big data intelligent cleaning method according to claim 3, wherein the anomaly scores are sorted according to values, and the values in a predetermined quantile after sorting are selected as initial decision thresholds.
  5. 5. The machine learning based big data intelligent cleaning method according to claim 4, wherein the step 4 comprises: Analyzing the selected frequency and the generated information gain of each feature vector dimension in the isolated tree node splitting process in the anomaly detection model, calculating the anomaly distinguishing contribution degree of each feature vector dimension, and selecting two feature vector dimensions with the highest anomaly distinguishing contribution degree to be respectively recorded as a first feature dimension and a second feature dimension; Establishing an anchor point set based on projection distribution of all feature vectors in a feature data set to be cleaned on the two-dimensional feature subspace, wherein the anchor point set consists of four anchor points, a first anchor point is positioned at the intersection point position of the lower quartile of the first feature dimension and the lower quartile of the second feature dimension, a second anchor point is positioned at the intersection point position of the upper quartile of the first feature dimension and the lower quartile of the second feature dimension, a third anchor point is positioned at the intersection point position of the upper quartile of the first feature dimension and the upper quartile of the second feature dimension, and a fourth anchor point is positioned at the intersection point position of the lower quartile of the first feature dimension and the upper quartile of the second feature dimension; And connecting the four anchor points according to the spatial position sequence to form a quadrilateral abnormal judging domain.
  6. 6. The machine learning based big data intelligent cleaning method according to claim 5, wherein the step 5 comprises: Performing eigenvalue decomposition based on covariance matrixes of all eigenvectors in the feature data set to be cleaned to obtain covariance eigenvectors and eigenvalues; Constructing a high-dimensional confidence ellipsoid based on a covariance feature vector and a feature value, wherein the covariance feature vector determines the principal axis direction of the high-dimensional confidence ellipsoid, and the product of the square root of the feature value and a preset confidence coefficient determines the half axis length of each principal axis direction; projecting the high-dimensional confidence ellipsoid onto a two-dimensional plane formed by the first characteristic dimension and the second characteristic dimension to obtain a two-dimensional confidence ellipsoid; calculating the area of an overlapping area of the quadrilateral abnormal judging domain and the two-dimensional confidence ellipse on a two-dimensional plane, calculating the proportion of the area of the overlapping area to the total area of the quadrilateral abnormal judging domain, and taking the proportion as a threshold value adjusting factor; And carrying out weighted calculation on the threshold adjustment factor and the initial judgment threshold through a preset weight coefficient to generate a dynamic judgment threshold.
  7. 7. The machine learning based big data intelligent cleaning method according to claim 6, wherein the step 6 includes: Comparing the real-time abnormal score with a dynamic judgment threshold in real time, and identifying the user transaction behavior data with the real-time abnormal score larger than the dynamic judgment threshold as unreal transaction data; Removing records marked as unreal transaction data from the user transaction behavior data stream to obtain a clean user transaction behavior data stream; Calculating the difference degree between the characteristic distribution and the historical characteristic distribution of the clean user transaction behavior data stream, and triggering an updating process of the anomaly detection model when the difference degree exceeds a preset change threshold value, wherein the updating process comprises the steps of recalculating the anomaly detection contribution degree of each characteristic dimension by using newly collected user transaction behavior data, reselecting a first characteristic dimension and a second characteristic dimension, and reconstructing a quadrilateral anomaly interpretation domain, a high-dimensional confidence ellipsoid and a dynamic determination threshold value.
  8. 8. A machine learning based big data intelligent cleaning system implementing the method of any of claims 1 to 7, comprising: the acquisition module is used for acquiring and carrying out pretreatment and characteristic engineering treatment on the user transaction behavior data stream, extracting multidimensional characteristic vectors and obtaining a characteristic data set to be cleaned; the computing module is used for constructing a data feature space by the feature data set to be cleaned, and mapping the multi-dimensional feature vector into the data feature space; The determining module is used for carrying out statistical analysis on the real-time abnormal scores of all the feature vectors and determining an initial judging threshold value; the establishing module is used for determining a first characteristic dimension and a second characteristic dimension in the data characteristic space to establish an anchor point set, and connecting the anchor points in the anchor point set to form a quadrilateral abnormal judging domain; the weighting module is used for constructing a high-dimensional confidence ellipsoid based on the covariance matrix of the feature vector and the mean vector, projecting the high-dimensional confidence ellipsoid onto a two-dimensional plane formed by the first feature dimension and the second feature dimension to obtain a two-dimensional confidence ellipse; and the updating module is used for comparing the real-time abnormal score with the dynamic judgment threshold value to identify and clean the unreal transaction data to obtain a clean data stream, continuously monitoring the characteristic distribution change of the clean data stream, and triggering the updating process of the abnormal detection model and the dynamic judgment threshold value when the distribution change exceeds a preset threshold value.
  9. 9. A computing device, comprising: one or more processors; Storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of any of claims 1 to 7.
  10. 10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program which, when executed by a processor, implements the method according to any of claims 1 to 7.

Description

Machine learning-based big data intelligent cleaning method and system Technical Field The invention relates to the technical field of data processing, in particular to a machine learning-based big data intelligent cleaning method and system. Background In big data environment, the real-time cleaning of user transaction behavior data is an important link for guaranteeing the subsequent analysis and application quality, at present, an anomaly detection method based on machine learning, such as an isolated forest algorithm, is often tried to identify atypical modes from transaction data streams because of the characteristic of no prior label, however, such a method may face some challenges in actual deployment, and a common consideration is that the anomaly determination link of the method mostly depends on a preset or static threshold value obtained through initial data statistics, and in actual application, the characteristic distribution of the user transaction behavior is not constant, and may gradually evolve along with business dynamics, and the situation is particularly obvious in complex e-commerce transaction scenes. For example, during a period of "second killing" sales promotion or a large holiday activity carried out on a platform, a normal fluctuation range of key features such as transaction frequency, guest price and the like may be temporarily widened, so that a static threshold set previously becomes not fully applicable, at this time, a certain possibility exists in an original model to misjudge a part of normal high concurrent transactions generated by centralized sales promotion as abnormal, or when data returns to a normal state after the activity is finished, the data is not sensitive enough to novel and hidden fraudulent behaviors, so that accuracy and adaptability of a data cleaning result may be affected. Disclosure of Invention The invention aims to solve the technical problem of providing a machine learning-based big data intelligent cleaning method and a machine learning-based big data intelligent cleaning system, which can be used for accurately positioning unreal transaction data and reducing misjudgment rate and missed judgment rate. In order to solve the technical problems, the technical scheme of the invention is as follows: in a first aspect, a machine learning-based big data intelligent cleaning method includes: Step 1, preprocessing and feature engineering processing are carried out on user transaction behavior data streams, and multidimensional feature vectors are extracted to obtain feature data sets to be cleaned; Step 2, based on the feature data set to be cleaned, constructing a data feature space, and mapping the multi-dimensional feature vector into the data feature space; step 3, carrying out statistical analysis on the real-time abnormal scores of all the feature vectors, and determining an initial judgment threshold; step 4, determining first and second feature dimensions in a data feature space to establish an anchor point set, and connecting the anchors in the anchor point set to form a quadrilateral abnormal interpretation domain; Step 5, constructing a high-dimensional confidence ellipsoid based on the covariance matrix and the mean vector of the feature vector, projecting the high-dimensional confidence ellipsoid onto a two-dimensional plane formed by the first feature dimension and the second feature dimension to obtain a two-dimensional confidence ellipse, calculating the area proportion of the overlapping area of the quadrangular abnormal judging domain and the two-dimensional confidence ellipse to obtain a threshold value adjusting factor, and weighting the threshold value adjusting factor and an initial judging threshold value to generate a dynamic judging threshold value; And 6, comparing the real-time abnormal score with a dynamic judgment threshold value to identify and clean unreal transaction data to obtain a clean data stream, continuously monitoring characteristic distribution change of the clean data stream, and triggering an abnormal detection model and a updating process of the dynamic judgment threshold value when the distribution change exceeds a preset threshold value. In a second aspect, a machine learning based big data intelligent cleaning system includes: the acquisition module is used for acquiring and carrying out pretreatment and characteristic engineering treatment on the user transaction behavior data stream, extracting multidimensional characteristic vectors and obtaining a characteristic data set to be cleaned; the computing module is used for constructing a data feature space by the feature data set to be cleaned, and mapping the multi-dimensional feature vector into the data feature space; The determining module is used for carrying out statistical analysis on the real-time abnormal scores of all the feature vectors and determining an initial judging threshold value; the establishing module is used for determining a first charac