CN-120780982-B - Data quality detection method and system based on time sequence data
Abstract
The invention relates to the technical field of data detection, and discloses a data quality detection method and system based on time sequence data. The method comprises the steps of acquiring data at fixed intervals in a preset time period, judging whether the data are continuous and timely finding missing conditions by comparing the time difference between adjacent data points with a set interval, and ensuring the integrity and reliability of the whole acquisition process, secondly, calculating a data mean value according to a historical record when the data are confirmed to be complete, detecting the absolute deviation of each data value and the mean value, dividing the data into two types of normal and abnormal by combining a preset deviation threshold, and correcting the front and rear data mean values if the number of the abnormal data does not exceed the abnormal threshold for the abnormal data, so that the influence of the abnormality on the whole data set is reduced, otherwise, the abnormal data are considered to be abnormal seriously, and the data are required to be acquired again to ensure the accuracy.
Inventors
- WANG TAO
Assignees
- 南京海帆数据科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20250711
Claims (6)
- 1. A method for detecting data quality based on time series data, comprising: step one, setting a time period and a time interval, acquiring data at fixed time intervals in the time period to obtain a plurality of time sequence data, and acquiring the time difference between every two adjacent time sequence data based on the plurality of time sequence data; comparing the time difference with the time interval, wherein if the time difference is equal to the time interval, the time difference represents that no loss exists between two adjacent time sequence data; Counting the number of missing parts between two adjacent time sequence data, wherein if the number of missing parts is zero, the time sequence data in the time period is complete, and no missing parts exist; step four, acquiring a time sequence data value on the premise that the time sequence data is complete, acquiring a time sequence data mean value according to a history record, subtracting the time sequence data mean value from the time sequence data value, and obtaining a time sequence data value deviation through positive value processing; Step five, setting a time sequence data value deviation threshold, comparing each time sequence data value deviation with the time sequence data value deviation threshold respectively, and marking the time sequence data value represented by the time sequence data value deviation as a normal time sequence data value if the time sequence data value deviation is smaller than or equal to the time sequence data value deviation threshold and represents that the time sequence data value deviation is within a reasonable range; Step six, obtaining the number of abnormal time series data values, setting an abnormal number threshold, comparing the abnormal number with the abnormal number threshold, if the abnormal number is smaller than or equal to the abnormal number threshold, judging that the time series data values in the time period are abnormal and not serious, replacing each abnormal time series data value by using the average value of the two time series data values before and after the abnormal time series data value, if the abnormal number is larger than the abnormal number threshold, judging that the time series data values in the time period are abnormal and serious, re-obtaining the time series data, obtaining the time interval between two adjacent abnormal time series data values, recording the time interval as the abnormal time interval, comparing the abnormal time interval with the time interval when the abnormal number is smaller than or equal to the abnormal number threshold, judging that the time series data are not serious according to the comparison result, judging that the repeated time series data in the repeated time series data set have the same number, and if the repeated time series data have the repeated data in the same number set, judging that the repeated data have the repeated data in the repeated data set according to the repeated data in the time series data set in the time series in a time period are not continuous, judging that the repeated data are not repeated data in the time series data set of the repeated data are not continuous according to the repeated data when the repeated data obtained in the time series data set of the data of the repeated data obtained in the time series of the data set of the data obtained in the time series of the time series data value of the adjacent to the data value is not continuous, and different countermeasures are obtained according to the comparison result.
- 2. The method of claim 1, wherein if the number of repeated time series data value sets is less than or equal to a threshold value of the number of repeated time series data value sets, the number of repeated time series data value sets is within a normal range, the quality of time series data is not affected, and if the number of repeated time series data value sets is greater than the threshold value of the number of repeated time series data value sets, the number of repeated time series data value sets exceeds the normal range, the quality of time series data is affected, the repeated occupation ratio is further analyzed.
- 3. The method for detecting data quality based on time series data according to claim 2, wherein the specific process of further analyzing the repetition ratio is to acquire the number of repeated time series data values, obtain the repetition ratio by dividing the number of repeated time series data values by the total number of time series data values, set a repetition ratio threshold, compare the repetition ratio with the repetition ratio threshold, and obtain different responses according to the comparison result.
- 4. The method of claim 3, wherein if the repetition rate is equal to or less than a repetition rate threshold, the average value correction method is used to correct one of the number of repeated time series data values, and if the repetition rate is greater than the repetition rate threshold, the number of repeated time series data values is higher, and the time series data is retrieved.
- 5. A data quality detection system based on time series data for performing the method of any of claims 1-4, comprising: The time difference acquisition module is used for setting a time period and a time interval, acquiring data at fixed time intervals in the time period, obtaining a plurality of time sequence data, and acquiring the time difference between every two adjacent time sequence data based on the plurality of time sequence data; the first comparison module compares the time difference with the time interval, and if the time difference is equal to the time interval, the first comparison module indicates that no loss exists between the two adjacent time sequence data; The statistical analysis module is used for counting the number of the missing between the two adjacent time sequence data, and if the number of the missing is zero, the missing is complete in the time sequence data in the time period, and if the number of the missing is greater than zero, the missing is incomplete in the time sequence data in the time period; the deviation acquisition module acquires the time sequence data value on the premise that the time sequence data is complete, acquires the time sequence data average value according to the history record, and acquires the time sequence data value deviation after subtracting the time sequence data average value from the time sequence data value through positive value processing; The comparison module is used for setting a time sequence data value deviation threshold value, comparing each time sequence data value deviation with the time sequence data value deviation threshold value respectively, and marking the represented time sequence data value as a normal time sequence data value if the time sequence data value deviation is smaller than or equal to the time sequence data value deviation threshold value and represents that the time sequence data value deviation is within a reasonable range; The abnormal analysis module is used for acquiring the number of abnormal time sequence data values, setting an abnormal number threshold, comparing the abnormal number with the abnormal number threshold, and if the abnormal number is smaller than or equal to the abnormal number threshold, representing that the time sequence data values in the time period are not abnormal, replacing each abnormal time sequence data value by using the average value of the two time sequence data values before and after the abnormal time sequence data value, and if the abnormal number is larger than the abnormal number threshold, representing that the time sequence data value in the time period is abnormal, re-acquiring the time sequence data.
- 6. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program, the computer program being executed by a processor to implement a method of time-series data based data quality detection according to any of the preceding claims 1-4.
Description
Data quality detection method and system based on time sequence data Technical Field The invention relates to the technical field of data detection, in particular to a data quality detection method and system based on time sequence data. Background The time sequence data may be affected by various factors such as equipment fault, sensor error, network transmission delay, external environment interference and the like in the acquisition process, so that data is lost, wrong or abnormal fluctuation is caused, and the problems are not timely detected and corrected, so that the data analysis result is likely to deviate, and even an erroneous decision is caused. In applications such as industrial monitoring, financial transactions, weather forecast, etc., the real-time and accuracy of data are particularly critical, and any minor abnormality may cause untimely response of the system or increase of prediction errors, thereby affecting stable operation of the whole process. The time sequence data records the continuous state of the system or the equipment which changes along with time, and the data quality is directly related to the accuracy and the reliability of subsequent analysis, prediction and decision, so the method has very important significance for quality detection of the time sequence data. Disclosure of Invention In view of the foregoing problems of the prior art, an object of the present invention is to provide a method and a system for detecting data quality based on time series data, so as to detect the data quality of the time series data. The method comprises the following steps of firstly, setting a time period and a time interval, acquiring data at fixed time intervals in the time period to obtain a plurality of time sequence data, and acquiring time difference between every two adjacent time sequence data based on the plurality of time sequence data; the method comprises the steps of comparing a time difference with a time interval, if the time difference is equal to the time interval, representing that no defect exists between two adjacent time sequence data, if the time difference is larger than the time interval, representing that no defect exists between the two adjacent time sequence data, counting the number of the defects exists between the two adjacent time sequence data, if the number of the defects is zero, representing that the time sequence data in a time period is complete, no omission exists, if the number of the defects is larger than zero, representing that the time sequence data in the time period is incomplete, obtaining a time sequence data value on the premise that the time sequence data is complete, obtaining a time sequence data average value according to a historical record, subtracting the time sequence data average value from the time sequence data value, performing positive value processing to obtain a time sequence data value deviation, setting a time sequence data value deviation threshold, comparing each time sequence data value deviation with the time sequence data value deviation threshold, if the time sequence data value deviation is smaller than or equal to the time sequence data value deviation threshold, representing that the time sequence data value deviation is in a reasonable range, marking the time sequence data value represented by the time sequence data value as a normal time sequence data value, and if the time sequence data value deviation is larger than the time sequence data deviation threshold, representing that the time sequence data value exceeds a reasonable range, marking the time sequence data value and comparing value is obtained by the time sequence data value with a reasonable range, the method comprises the steps of obtaining the number of abnormal time sequence data values, setting an abnormal number threshold, comparing the abnormal number with the abnormal number threshold, if the abnormal number is smaller than or equal to the abnormal number threshold, representing that the time sequence data values in a time period are not abnormal seriously, replacing each abnormal time sequence data value by using the average value of the front time sequence data value and the rear time sequence data value, and if the abnormal number is larger than the abnormal number threshold, representing that the time sequence data values in the time period are abnormal seriously, re-obtaining the time sequence data. In some embodiments, the time interval between two adjacent abnormal time sequence data values is acquired and is recorded as an abnormal time interval, when the abnormal number is smaller than or equal to the abnormal number threshold value, the abnormal time interval is compared with the time interval, and different countermeasures are obtained according to the comparison result. In some embodiments, if the abnormal time interval is equal to the time interval, the two adjacent abnormal time sequence data values are continuous, the abnormal time se