CN-115794786-B - Method for detecting change of data table in data cleaning process
Abstract
The invention discloses a method for detecting data form change in a data cleaning process, which comprises the steps of summarizing a data form change space according to changes caused by various data conversion operations on a data form, wherein the data form change space comprises two dimensions, namely a data object and a change attribute, the data object comprises a form, a row, a column and a cell, the change attribute comprises a quantity attribute, a sequence attribute, a relation attribute, a value attribute and a type attribute, and comparing the change of a data input form and a data output form in the data cleaning process based on the data form change space. According to the invention, the change of the data table is compared on various change attributes according to different data objects, so that the detection result of the change of the data table is more detailed and comprehensive, and the detection method can be applied to numerous scenes such as deducing the semantics of data cleaning codes and visualizing the change of the data table, and has stronger applicability.
Inventors
- WANG YONGHENG
- FU SIWEI
- XIONG KAI
- WU YINGCAI
Assignees
- 之江实验室
Dates
- Publication Date
- 20260508
- Application Date
- 20220927
Claims (6)
- 1. A method of detecting a change in a data table during a data cleansing process, comprising the steps of: step S11, summarizing a data table change space according to change influences caused by various data conversion operations on a data table, wherein the data table change space comprises two dimensions, in particular a data object and change attributes, the data object comprises a table, a row, a column and a cell, and the change attributes comprise a quantity attribute, a sequence attribute, a relationship attribute, a value attribute and a type attribute; step S12, comparing the change of the data input table and the data output table in the data cleaning process based on the change space of the data table, wherein the step S12 comprises the following substeps: Step S121, comparing the table, row and column data objects of the data input table and the data output table with each other in the number attribute; step S122, respectively comparing the changes of the row and column data objects of the data input table and the data output table on the sequence attribute; Step S123, respectively comparing the changes of the row, column and cell data objects of the data input table and the data output table on the relation attribute; Step S124, comparing the data input table with the table, row, column and cell data object of the data output table respectively; Step S125, comparing the changes of the column data objects of the data input table and the data output table on the type attribute.
- 2. The method of claim 1, wherein the quantity attribute is used to describe a quantity change of the data object after performing a data conversion operation.
- 3. The method of claim 1, wherein the sequence attribute is used to describe a change in a position of a row or column of the data table.
- 4. The method of claim 1, wherein the relationship attributes are used to describe arithmetic relationships, set relationships between different data objects.
- 5. The method of claim 1, wherein the value attribute is used to describe whether a particular value exists in the data object.
- 6. The method of claim 1, wherein the type attribute is used to describe a change in a data type of the data object.
Description
Method for detecting change of data table in data cleaning process Technical Field The invention belongs to the field of data comparison, and particularly relates to a method for detecting data form change in a data cleaning process. Background Two-dimensional data forms are an effective means of organizing data, and various forms are widely adopted by people in communication, scientific research and data analysis activities. Because the original form often contains "dirty" data, or the data format, content, etc., does not meet the intended objectives, the data worker must perform data cleansing on the form. Data cleansing is a process of sorting complex, messy data into an ideal data format through data conversion operations (e.g., filling in missing values, removing duplicate rows, etc.). During the data cleansing process, data workers often need to compare changes to the data tables to confirm whether a given data conversion operation was successfully performed or to determine what data conversion operation should be performed next based on changes to the current data table. However, since the data table contains an excessive amount of row and column data, and the data conversion operation causes a wide variety of changes to the data table, it is difficult for the data worker to compare the changes of the data table before and after the data cleansing purely manually. Although much work is done to compare time series data, graphic image data, etc., there are few techniques for comparing tabular data. ExcelCompare, diffKit, daff, compare, etc. existing data comparison tools are capable of determining and calculating the difference between two input data tables based on some detection criteria (e.g., table size, cell content, unique number of rows and columns, etc.). In addition, some visualization processes such as VisDB, iHAT, visBricks, TACO use thermodynamic diagrams to visualize differences between data tables. However, the above works are only used for comparing the differences of two data tables in a general scene, the detected dimension is limited, the detected difference index is not rich enough, and the influence of the data conversion operation on the data tables is difficult to be reflected, so that the method is not suitable for describing the change of the data tables in the data cleaning process. Disclosure of Invention The invention aims to provide a method for detecting data form change in the data cleaning process, aiming at the defects of the prior art. The invention aims at realizing the following technical scheme that the method for detecting the change of the data table in the data cleaning process is characterized by comprising the following steps: step S11, summarizing a data table change space according to change influences caused by various data conversion operations on a data table, wherein the data table change space comprises two dimensions, in particular a data object and change attributes, the data object comprises a table, a row, a column and a cell, and the change attributes comprise a quantity attribute, a sequence attribute, a relationship attribute, a value attribute and a type attribute; And step S12, comparing the change of the data input table and the change of the data output table in the data cleaning process based on the change space of the data table. Further, the number attribute is used to describe a number change of the data object after performing the data conversion operation. Further, the sequence attribute is used to describe the position change of the rows and columns of the data table. Further, the relationship attributes are used to describe the arithmetic relationships, set relationships, between different data objects. Further, the value attribute is used to describe whether a particular value exists in the data object. Further, the type attribute is used to describe a change condition of a data type of the data object. Further, the step S12 includes the following substeps: Step S121, comparing the table, row and column data objects of the data input table and the data output table with each other in the number attribute; step S122, respectively comparing the changes of the row and column data objects of the data input table and the data output table on the sequence attribute; Step S123, respectively comparing the changes of the row, column and cell data objects of the data input table and the data output table on the relation attribute; Step S124, comparing the data input table with the table, row, column and cell data object of the data output table respectively; Step S125, comparing the changes of the column data objects of the data input table and the data output table on the type attribute. The method has the beneficial effects that the change of the data table in the data cleaning process is detected based on the table change space, the table change space contains 4 data objects and 5 change attributes, and 20 comparison fields are used in total,