CN-121542876-B - Electric power marketing data cleaning and deduplication method and system
Abstract
The invention provides a method and a system for cleaning and removing duplication of electric power marketing data, wherein the method comprises the steps of calculating the neighborhood density and the local attribute fluctuation degree of data points, marking as initial noise points if the neighborhood density is lower than a threshold value and the fluctuation degree is higher than the threshold value, otherwise dividing the neighborhood density into core points or boundary points according to the density, setting asymmetric expansion neighborhood for the boundary points, calculating the sum of the distance attenuation function values from the initial noise points to the boundary points in the neighborhood, reclassifying the initial noise points if the sum is higher than a first threshold value, reclassifying the initial noise points in the overlapped neighborhood and the noise association index geometric average value of the boundary points is lower than a second threshold value, reclassifying the noise points as secondary boundary points, classifying the secondary boundary points into nearest core point clusters, isolating the confirmed noise points, calculating stability in the clusters, correcting the distances from the data points to the mass centers of the clusters according to the stability, selecting the minimum corrected distances as main data records, and merging redundant records.
Inventors
- CHEN XUEMIN
- WANG LISAI
- MA JIAN
- WAN TIAN
- MENG CHANGYUAN
- YANG XIAOBO
- TAN CHEN
- LV LINJIE
- QI CHENGFEI
- LI HONGYU
- Yan Xiongpeng
- WANG YAOYU
- XIONG HONGZHANG
Assignees
- 国网冀北电力有限公司计量中心
- 朗新科技集团股份有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20251119
Claims (7)
- 1. The method for cleaning and de-duplication of the electric marketing data is characterized by comprising the following steps of: Calculating the neighborhood density of each data point in the electric power marketing data and the local attribute fluctuation degree based on the electric power consumption periodic characteristics, if the neighborhood density of the data point is lower than a density threshold value and the local attribute fluctuation degree is higher than a fluctuation degree threshold value, judging the data point as an initial noise point, otherwise, judging the data point as a core point or a boundary point according to the neighborhood density of the data point; setting an asymmetric expansion neighborhood for each boundary point, and calculating the sum of the distance attenuation function values from each initial noise point to the boundary point in the neighborhood to be used as the noise association index of the boundary point; If the initial noise point is positioned in the overlapping neighborhood of a group of boundary points, and the geometric mean value of the noise association index of the group of boundary points is lower than a second preset threshold value, reclassifying the initial noise point as a secondary boundary point and classifying the secondary boundary point into a cluster where a core point with the closest weighting distance to the noise point is positioned; correcting the distance from each data point to the mass center of the cluster according to the stability, selecting the data point with the minimum corrected distance in the cluster as a main data record, and merging redundant data records; the reclassifying the initial noise point as a secondary boundary point and classifying the secondary boundary point into a cluster where a core point closest to the noise point is located, including: Carrying out normalization processing on the space coordinates of the data points and the power consumption mean value attribute, and calculating the weighted distance from the initial noise point to each core point based on the normalized data, wherein a calculation formula of the weighted distance is as follows: Wherein For the normalized euclidean distance, For the absolute value of the difference value of the normalized power consumption average value, k1 and k2 are used as coefficients, and the power consumption average value is selected The smallest core point, and the initial noise point is classified into the cluster where the core point is located; The boundary point stability is inversely proportional to the noise correlation index, comprising: By the formula Calculating the stability of boundary points Wherein A noise correlation index for the boundary point; The correcting the distance from each data point to the cluster centroid according to the stability comprises the following steps: Using the formula Calculating the corrected distance Wherein The original Euclidean distance from a data point to the cluster centroid is given, and S is the data record stability of the data point.
- 2. The method of claim 1, wherein said calculating the neighborhood density of the data points and the local attribute variability based on the power usage periodicity characteristics comprises: and (3) extracting the power consumption data of the data points for 36 continuous months, calculating the standard deviation of the power consumption of the same month in three continuous years by taking 12 months as a period, and carrying out arithmetic average on the standard deviation of the 12 months to obtain the local attribute fluctuation degree of the data points.
- 3. The method according to claim 1, wherein the step of setting an asymmetric expansion neighborhood for each boundary point and calculating a sum of distance attenuation function values from each initial noise point in the neighborhood to the boundary point as a noise correlation index of the boundary point includes: defining a region with radius epsilon as an asymmetric expansion neighborhood in the direction of the density lower than the boundary point in the vicinity of the boundary point K by taking the boundary point as the center, wherein the noise point distance attenuation function value is as follows for any initial noise point in the neighborhood And summing the distance attenuation function values of all the initial noise points in the neighborhood to obtain a noise correlation index.
- 4. The method of claim 1, wherein the determining as a core point or a boundary point based on the neighborhood density of data points comprises: Obtaining a minimum neighborhood point threshold MinPts judged by the neighborhood radius epsilon and the core point; for any data point of the non-initial noise point, calculating the number N of data points in epsilon neighborhood; if N is greater than or equal to MinPts, judging the data point as a core point; If N is less than MinPts, the data point is judged to be a boundary point.
- 5. An electric marketing data cleaning and deduplication system, comprising the following modules: The judging module is used for calculating the neighborhood density of each data point in the electric marketing data and the local attribute fluctuation degree based on the electric periodic characteristics, judging the neighborhood density of the data point to be an initial noise point if the neighborhood density of the data point is lower than a density threshold value and the local attribute fluctuation degree is higher than a fluctuation degree threshold value, and judging the neighborhood density of the data point to be a core point or a boundary point if the neighborhood density of the data point is not lower than the density threshold value; the classification module is used for setting an asymmetric expansion neighborhood for each boundary point, calculating the sum of the distance attenuation function values from each initial noise point to the boundary point in the neighborhood and taking the sum as the noise association index of the boundary point; The computing module is used for reclassifying the initial noise points into secondary boundary points and classifying the secondary boundary points into clusters where core points closest to the noise point weighting distance are located if the initial noise points are located in the overlapping neighborhood of a group of boundary points and the geometric average value of noise association indexes of the group of boundary points is lower than a second preset threshold value; the merging module is used for correcting the distance from each data point to the mass center of the cluster according to the stability, further selecting the data point with the minimum corrected distance from the cluster as a main data record, and merging redundant data records; the reclassifying the initial noise point as a secondary boundary point and classifying the secondary boundary point into a cluster where a core point closest to the noise point is located, including: Carrying out normalization processing on the space coordinates of the data points and the power consumption mean value attribute, and calculating the weighted distance from the initial noise point to each core point based on the normalized data, wherein a calculation formula of the weighted distance is as follows: Wherein For the normalized euclidean distance, For the absolute value of the difference value of the normalized power consumption average value, k1 and k2 are used as coefficients, and the power consumption average value is selected The smallest core point, and the initial noise point is classified into the cluster where the core point is located; The boundary point stability is inversely proportional to the noise correlation index, comprising: By the formula Calculating the stability of boundary points Wherein A noise correlation index for the boundary point; The correcting the distance from each data point to the cluster centroid according to the stability comprises the following steps: Using the formula Calculating the corrected distance Wherein The original Euclidean distance from a data point to the cluster centroid is given, and S is the data record stability of the data point.
- 6. The system of claim 5, wherein said calculating the neighborhood density of the data points and the local attribute variability based on the power usage periodicity characteristics comprises: and (3) extracting the power consumption data of the data points for 36 continuous months, calculating the standard deviation of the power consumption of the same month in three continuous years by taking 12 months as a period, and carrying out arithmetic average on the standard deviation of the 12 months to obtain the local attribute fluctuation degree of the data points.
- 7. The system of claim 5, wherein said setting an asymmetric expansion neighborhood for each boundary point and calculating a sum of distance attenuation function values from each initial noise point in said neighborhood to said boundary point as a noise correlation index of said boundary point comprises: defining a region with radius epsilon as an asymmetric expansion neighborhood in the direction of the density lower than the boundary point in the vicinity of the boundary point K by taking the boundary point as the center, wherein the noise point distance attenuation function value is as follows for any initial noise point in the neighborhood And summing the distance attenuation function values of all the initial noise points in the neighborhood to obtain a noise correlation index.
Description
Electric power marketing data cleaning and deduplication method and system Technical Field The application belongs to the field of data processing, and particularly relates to a method and a system for cleaning and deduplicating electric power marketing data. Background The power marketing business system accumulates a large amount of user electricity data in the running process, and the data is the basis for load prediction, user portrait analysis and marketing strategy formulation. Currently, cleaning methods for such data include statistical-based methods and cluster-based methods. The clustering algorithm of the density can find clusters of any shape and does not need to pre-specify the number of clusters, but when the traditional density algorithm is used for identifying noise points, the special business attribute of the power marketing data is usually ignored based on the space neighborhood density of the data points, and when the power data with complex business logic and various data characteristics is processed, the data points which are partially normal but are in a sparse area are easily misjudged as noise, or the data points which are not low in density but have abnormal user behavior characteristics cannot be identified. In density clustering, boundary points are very close to noise points in spatial distribution due to low neighborhood density, and whether low-density points at the edges of clusters are reasonable extensions of the clusters or noise to be removed is difficult to distinguish. In the data deduplication link, the existing method usually ignores the difference of data points at different positions in data quality and stability after noise is removed. Particularly those that were adjacent to noise points, the reliability of the boundary point data records should be lower than the core points that are far from the noise region. Therefore, there is a lack of a deduplication mechanism that can combine the noise correlation and stability of data points, so that the deduplication result is susceptible to unreliable boundary points, and the selected primary data record may not be optimally represented, thereby affecting the quality of the data. Disclosure of Invention The invention provides a method for cleaning and de-duplication of electric marketing data, which is used for solving the problem that the prior art lacks a de-duplication mechanism capable of combining the noise association degree and the stability of data points, and comprises the following steps: Calculating the neighborhood density of each data point in the electric power marketing data and the local attribute fluctuation degree based on the electric power consumption periodic characteristics, if the neighborhood density of the data point is lower than a density threshold value and the local attribute fluctuation degree is higher than a fluctuation degree threshold value, judging the data point as an initial noise point, otherwise, judging the data point as a core point or a boundary point according to the neighborhood density of the data point; setting an asymmetric expansion neighborhood for each boundary point, and calculating the sum of the distance attenuation function values from each initial noise point to the boundary point in the neighborhood to be used as the noise association index of the boundary point; If the initial noise point is positioned in the overlapping neighborhood of a group of boundary points, and the geometric mean value of the noise association index of the group of boundary points is lower than a second preset threshold value, reclassifying the initial noise point as a secondary boundary point and classifying the secondary boundary point into a cluster where a core point with the closest weighting distance to the noise point is positioned; And correcting the distance from each data point to the mass center of the cluster according to the stability, selecting the data point with the minimum corrected distance in the cluster as a main data record, and merging redundant data records. In addition, the invention also relates to a system for cleaning and deduplicating the electric marketing data, which comprises the following modules: The judging module is used for calculating the neighborhood density of each data point in the electric marketing data and the local attribute fluctuation degree based on the electric periodic characteristics, judging the neighborhood density of the data point to be an initial noise point if the neighborhood density of the data point is lower than a density threshold value and the local attribute fluctuation degree is higher than a fluctuation degree threshold value, and judging the neighborhood density of the data point to be a core point or a boundary point if the neighborhood density of the data point is not lower than the density threshold value; the classification module is used for setting an asymmetric expansion neighborhood for each boundary point, calculating