Search

CN-122020082-A - Information system data quality assessment method based on process

CN122020082ACN 122020082 ACN122020082 ACN 122020082ACN-122020082-A

Abstract

The invention belongs to the technical field of data information processing, and relates to a process-based information system data quality assessment method which comprises the steps of 1 drawing an information system data flow chart, 2 constructing a DQ-Petri network, 3 constructing a data quality assessment model, 4 calculating data quality loss caused by data transmission, 5 assessing the data quality of the information system, opening black boxes of all links through the assessment model based on the flow, enabling a manager to locate source nodes of quality problems in real time, realizing visualization and quantitative assessment of data quality conditions of all links in the data flow, supporting refined management, and helping to realize a source control strategy instead of post cleaning in the prior art through identifying problem generation links, so that new quality problem generation is reduced, and high cost of long-term repeated cleaning is avoided.

Inventors

  • LIU QI
  • Feng Gengzhong
  • XU KE

Assignees

  • 西安交通大学

Dates

Publication Date
20260512
Application Date
20260112

Claims (10)

  1. 1. A process-based information system data quality assessment method, comprising the steps of: step 1, drawing an information system data flow chart, combing the organization and personnel structure, business flow, input and output related to an information system, and drawing the information system data flow chart; 2, constructing a DQ-Petri network, namely constructing the DQ-Petri network on the basis of the Petri network according to the characteristics of data quality loss transmission of an information system, and describing the process of data quality loss transmission in the information system and evaluating the data quality loss caused by each data operation node; step 3, constructing a data quality evaluation model by defining the data quality dimension of the information system and calculating the data quality loss caused by data operation; Step 4, calculating data quality loss caused by data transmission, namely splitting the data flow of the information system into a plurality of sub-networks connected in series or in parallel, and calculating the accumulated loss of data quality through a serial structure, a 1-n parallel structure and a n-1 parallel structure of the sub-networks after decomposition; and 5, evaluating the data quality of the information system, and obtaining the total data quality loss of the information system by calculating the influence of the data quality loss caused by each data operation node on the data quality of the information system.
  2. 2. The method for evaluating the data quality of a process-based information system according to claim 1, wherein in step 1, the tool used for drawing the data flow chart of the information system comprises Microsoft visual, draw.
  3. 3. The process-based information system data quality assessment method according to claim 1, wherein in said step 2, said DQ-Petri network is composed of a five-tuple , wherein, Representing a data manipulation node in an information system; , Is a finite non-empty transition set, used to describe the process of data quality loss transitions; Is a set of directed arcs representing the flow direction of the data; Data origin, L is a data quality loss function.
  4. 4. A process-based information system data quality assessment method according to claim 3, wherein in said step 2, each quality loss transition From parameters And Composition, the parameters Represents the first The first data operation node Data quality loss, the parameters Represents the first Every node unit time goes to the first The data quantity transmitted by each node and the first time in unit time The proportion of the amount of data received by the individual nodes.
  5. 5. The process-based information system data quality assessment method according to claim 1, wherein in step 3, the information system data quality loss is: (4) Wherein: The ideal relationship is represented by a graph of the relationship, The true relationship is represented by a relationship of, Representation of Is used for the measurement of the degree of inaccuracy of (a), Representation of Is used for the non-integrity rate of the (c), Representation of Is a false recording rate of (a).
  6. 6. The process-based information system data quality assessment method according to claim 1, wherein in step 3, the data quality loss caused by the manual query operation is: (5) (6) (7) Wherein, the , And Respectively is operated by The resulting inaccuracy rate, incompleteness rate, and miscountability rate of the data set, 、 And Respectively, arrive operation within unit time Front dataset An inaccuracy rate, an incompleteness rate, and a miscording rate, 、 And Respectively data sets Inaccuracy rate, incompleteness rate, and miscording rate of (a), data set For data sets In passing by After operation a new data set is generated.
  7. 7. The process-based information system data quality assessment method according to claim 6, wherein in step 3, a data quality loss caused by a query operation node is: (8) Wherein, the Is the loss of quality of the kth dimension caused by query operation i, For the k-th dimension data quality loss case of the query result set R generated by query operation i, For the k-th dimension quality loss case of the first relation involved in the query operation, n is the number of relations involved in the query operation.
  8. 8. A process-based information system data quality assessment method according to claim 1, wherein in said step 4, the following is caused Operating nodes for data To the node Loss of previous k-th data quality accumulation, then: in the serial structure, slave data operation nodes To the point of Loss of k-th data quality accumulation The method comprises the following steps: (9) under the parallel structure of '1-n', the data quality of the downstream node set F is accumulated and lost by the node 1 The method comprises the following steps: (10) Loss of data quality accumulation on each parallel branch line under n-1 parallel structure The method comprises the following steps: (11) Wherein, the 、 、 Respectively nodes To the point of 、 To the point of 、 To the point of The data transfer ratio per unit time of (c), 、 、 Respectively nodes The k-th dimension is lost in quality due to 1, l.
  9. 9. The process-based information system data quality evaluation method according to claim 1, wherein in the step 5, the accumulated result of the k-th data quality loss in the information system caused by the i-th node in the network structure information system The method comprises the following steps: (12) Wherein, the The loss of k-th data quality caused for the i-th node, The data transmission ratio in unit time between the upstream node j and the downstream node j and j+1 on the first data stream of the ith node.
  10. 10. The process-based information system data quality assessment method according to claim 9, wherein in step 5, the total loss of data quality is: (13) Wherein, the , , Is a true relationship Is used for the number of records of (a), Is an ideal relationship Is a recording number of (a) is recorded.

Description

Information system data quality assessment method based on process Technical Field The invention belongs to the technical field of data information processing, and particularly relates to a process-based information system data quality assessment method. Background In the prior art, the improvement of the data quality of an information system is mainly focused on a data cleaning and repairing stage, namely, after data enter a database, the existing dirty data is cleaned by technical means of data mining, prediction, de-duplication, error correction and the like so as to improve the accuracy, integrity and other quality dimensions of the data in the current database. Such methods are typically based on static data quality assessment at the database level, rely on sampling analysis, rule matching or manual auditing to identify data problems, and implement repairs after data storage. However, the existing data processing method has the following defects: The post-treatment has high cost, the data cleaning is only carried out after the data problem occurs, the generation of new quality problems can not be fundamentally prevented, the cleaning cost is accumulated in long-term operation, and the effect is limited. The flow is opaque, and the evaluation is carried out on the one side that the prior art lacks modeling and analysis of the flowing process of data in an information system, and can not identify the transmission and accumulation rules of the data quality problem among all operation nodes of the information system, so that the quality condition of all data operation links in the system is in a black box state. The resource allocation is blind, and because the influence of each link on the final data quality cannot be quantified, a manager lacks a target when inputting resources to promote the data operation behavior, and a scientific resource input strategy is difficult to formulate, so that resource waste or insufficient control is caused. Therefore, a method for transparent and accurate evaluation of flow and reducing data processing cost from source control is needed, and the technical problems are solved. Disclosure of Invention The invention provides the following technical scheme that the information system data quality assessment method based on the process comprises the following steps: And step 1, drawing an information system data flow chart, combing the organization and personnel structure, the business flow and the input and output related to the information system, and drawing the information system data flow chart. And 2, constructing a DQ-Petri network, namely constructing the DQ-Petri network on the basis of the Petri network according to the characteristics of data quality loss transmission of the information system, and describing the process of data quality loss transmission in the information system and evaluating the data quality loss caused by each data operation node. And 3, constructing a data quality evaluation model, namely constructing the data quality evaluation model by defining the data quality dimension of the information system and calculating the data quality loss caused by data operation. And 4, calculating data quality loss caused by data transmission, namely splitting the data flow of the information system into a plurality of sub-networks connected in series or in parallel, and calculating the accumulated loss of data quality through a series structure, a 1-n parallel structure and an n-1 parallel structure of the sub-networks after decomposition. And 5, evaluating the data quality of the information system, and obtaining the total data quality loss of the information system by calculating the influence of the data quality loss caused by each data operation node on the data quality of the information system. Preferably, in the step1, the tool used for drawing the data flow chart of the information system includes Microsoft visual and Draw. Preferably, in the step 2, the DQ-Petri network is composed of a five-tuple, wherein,Representing a data manipulation node in an information system;, Is a finite non-empty transition set, used to describe the process of data quality loss transitions; Is a set of directed arcs representing the flow direction of the data; Data origin, L is a data quality loss function. More preferably, in the step 2, each mass loss transitionFrom parametersAndComposition, parametersRepresents the firstThe first data operation nodeData quality loss, parametersRepresents the firstEvery node unit time goes to the firstThe data quantity transmitted by each node and the first time in unit timeThe proportion of the amount of data received by each node (simply referred to as the "proportion of data transferred per unit time"). Preferably, in the step3, the information system data quality loss is: (4) Wherein: The ideal relationship is represented by a graph of the relationship, The true relationship is represented by a relationship of,Representation ofIs