CN-122019524-A - Data quality evaluation and restoration method based on machine learning

CN122019524ACN 122019524 ACN122019524 ACN 122019524ACN-122019524-A

Abstract

The invention discloses a data quality evaluation and repair method and system based on machine learning. The method first preprocesses data to construct a structured generator network comprising a parameterized adjacency matrix. In the training stage, the network outputs preliminary reconstruction data, utilizes manifold constraint projection to find a logic correction target matrix meeting the field constraint, constructs a potential consistency loop through gradient blocking, and calculates consistency loss to drive the network to approach to the compliance manifold. Meanwhile, the overall targets of integrating reconstruction errors, logic consistency and loop-free constraint are constructed, and parameters are optimized by adopting an augmented Lagrangian method and a double-layer circulation strategy. After the double convergence determination is satisfied, final inference is performed based on mask synthesis. According to the invention, by internalizing the logic rules into the generating capacity and synchronously mining the sparse causal structure, the high-quality data restoration with logic compliance, distribution consistency and interpretability is realized.

Inventors

WANG XUEFEI
LI NA

Assignees

安徽青囊科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260203

Claims (10)

1. A machine learning based data quality assessment and repair method, the method comprising: Preprocessing an original input matrix containing missing values and observed values to generate an initialized input matrix and a corresponding binary mask matrix; constructing a structured generator network comprising a parameterized adjacency matrix for characterizing causal relationships between variables; executing a model training loop, and executing the following steps in each iteration of the training loop: performing forward propagation on the initialized input matrix by using the structured generator network, and outputting a preliminary reconstructed data matrix; Performing manifold constraint projection, namely searching a logic correction target matrix meeting a field constraint set through iterative optimization on the premise of keeping the network parameters of the structured generator fixed by taking the preliminary reconstruction data matrix as an initial value; Constructing a potential consistency learning loop, and calculating potential consistency loss, wherein the potential consistency loss is used for measuring consistency between the preliminary reconstruction data matrix and the logic correction target matrix; Constructing an augmented lagrangian overall training goal integrating a reconstruction error penalty, an overall logical consistency penalty, the potential consistency penalty, and an acyclic constraint penalty term for the parameterized adjacency matrix; Updating weight parameters of the structured generator network and the parameterized adjacency matrix based on the overall training goals; and when the preset convergence condition is met, performing final inference on the data to be processed based on the structural generator network with the training completed, and outputting the repaired complete data.
2. The machine learning based data quality assessment and restoration method according to claim 1, wherein the structured generator network, when performing forward propagation, first performs linear transformation and feature aggregation on the initialized input matrix using the parameterized adjacency matrix to simulate direct causal actions between variables; The parameterized adjacency matrix is a learnable parameter matrix jointly optimized with the weight parameters of the structured generator network for explicitly modeling the directionally weighted connection between data feature dimensions.
3. The machine learning based data quality assessment and restoration method according to claim 1, wherein the performing manifold constraint projection step specifically comprises: Constructing a constrained optimization problem of an inference stage, which aims at minimizing Euclidean distance to a preliminary reconstructed data matrix and is constrained to the set of domain constraints; Calculating the gradient of logic loss relative to the current data matrix, updating only the numerical value of the missing value position by using the gradient, and meanwhile, carrying out Hadamard product operation by the inverse code of the binary mask matrix to forcedly reset the numerical value of the observed value position so as to keep the numerical value as the original observed data; and after gradient updating, performing analysis projection and truncation operation to ensure that the updated numerical value falls within a predefined numerical value range interval, thereby obtaining the logic correction target matrix.
4. The machine learning based data quality assessment and restoration method according to claim 1, wherein said constructing potential consistency Xi Huilu specifically comprises: performing gradient blocking operation on the logic correction target matrix, and setting the logic correction target matrix as a fixed constant tensor which does not participate in gradient back propagation; Calculating the mean square error between the primary reconstruction data matrix directly output by the structured generator network and the logic correction target matrix after gradient blocking is executed, and taking the mean square error as the potential consistency loss; by minimizing the potential consistency loss, the structured generator network is driven to directly approach a manifold space meeting the field constraint set in a single forward propagation, and the non-microminiatable logic constraint solving process is converted into a microminiatable supervised learning process.
5. The machine learning based data quality assessment and restoration method according to claim 1, wherein the construction of an augmented lagrangian overall training target, in particular using an augmented lagrangian multiplier method, comprises: A task master loss term consisting of a weighted sum of the reconstruction error loss, the overall logical consistency loss, and the potential consistency loss; Sparse regularization term is L1 norm of the parameterized adjacency matrix and is used for inducing generation of a sparse causal graph structure; the loop-free constraint penalty term consists of a Lagrangian multiplier term and a quadratic penalty parameter term and is used for converting the equality constraint of the directed loop-free graph into a constraint-free optimization target; the method simultaneously searches for a solution with minimum data reconstruction errors, minimum logic rule violation and graph structure meeting the property of the directed acyclic graph by minimizing the overall training target.
6. The machine learning based data quality assessment and restoration method according to claim 5, wherein the updating of the weight parameters of the structured generator network and the parameterized adjacency matrix based on the overall training objective involves performing a two-layer loop optimization strategy: in the inner layer circulation, maintaining Lagrangian multipliers and secondary penalty parameters unchanged, and carrying out iterative updating by utilizing an optimizer according to the gradient of the overall training target relative to network parameters; and in the outer layer circulation, evaluating the descending condition of the acyclic constraint function value, namely if the descending amplitude of the acyclic constraint function value does not reach a preset threshold value, increasing the secondary penalty parameter to strengthen the penalty force against the acyclic constraint, and if the acyclic constraint function value is obviously reduced, keeping the secondary penalty parameter unchanged and updating the Lagrange multiplier.
7. The machine learning based data quality assessment and restoration method according to claim 1, wherein when a preset convergence condition is satisfied, specifically comprising performing a dual criterion based convergence determination: The first heavy standard is structural constraint convergence, namely judging whether the loop-free constraint function value calculated by the parameterized adjacency matrix is lower than a preset numerical tolerance threshold; Judging whether the relative change rate of a task main loss item on a verification data set is smaller than a preset convergence threshold or not; if and only if the two criteria are met simultaneously, the decision model training is completed and the current parameterized adjacency matrix and structured generator network parameters are saved.
8. The machine learning based data quality assessment and restoration method according to claim 1, wherein before the training-based structured generator network performs final inference on the data to be processed, the method further comprises: Performing hard threshold truncation operation on the parameterized adjacency matrix after training is finished, setting elements with absolute values smaller than a preset edge judgment threshold to be zero, and constructing a final binarization adjacency matrix; and outputting the binarization adjacency matrix as a causal structure among the mined characteristic variables.
9. The machine learning based data quality assessment and restoration method according to claim 1, wherein the training completion based structured generator network performs final inference on the data to be processed, specifically comprising: switching the structured generator network to an inference mode and freezing parameters; initializing and filling missing positions of data to be processed, and performing forward propagation on an input network to obtain a reconstructed query matrix; executing data synthesis operation, and fusing an original observation value with a filling value in a reconstructed query matrix generated by a network by using a mask matrix corresponding to data to be processed to ensure that an original observation record is not modified; And performing numerical range cutting and discretization processing on the fused data, and outputting a final data filling result of compliance.
10. A machine learning based data quality assessment and repair system, the system comprising: one or more processors; A memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-9.

Description

Data quality evaluation and restoration method based on machine learning Technical Field The invention relates to the technical field of data processing and artificial intelligence, in particular to a data quality assessment and repair method based on machine learning. Background In big data application scenarios such as industrial manufacturing, financial wind control, medical health and the like, data integrity is a basis for guaranteeing the accuracy of subsequent modeling analysis and decision. However, the method is limited by various factors such as sensor hardware faults, network transmission packet loss, privacy protection strategies or manual entry omission, and numerical value missing phenomena with different degrees commonly exist in the original data set. Currently, the means for dealing with data missing in industry mainly depends on data filling technology, including mean and median filling based on statistics, and machine learning filling methods based on chain equation multiple filling, generating countermeasure network or variation self-encoder. These techniques typically utilize existing observation data training models to predict and populate missing values by capturing statistical correlations between data features in an attempt to restore the integrity and usability of the data set. Although existing depth-generation-type shim models have made some progress in reducing data reconstruction errors, there are still substantial limitations in processing high-dimensional data that contain complex dependencies. In the prior art, a black box type fitting strategy is mostly adopted, only attention is paid to how to make the filling value close to the observed data on probability distribution, but the inherent causal action mechanism between characteristic variables and the strict logic constraint which must be followed in the specific application field are usually ignored. The omission of the causal structure and the business rule leads to extremely easy generation of invalid data with high statistical fitting degree and actually violating the physical law or business logic, and the existing filling algorithm is difficult to provide visual interpretation of the dependency relationship among variables due to lack of explicit modeling of the data generation process, so that severe requirements of high-reliability scenes on logic compliance and interpretability of the data restoration result cannot be met. Disclosure of Invention The first aspect of the invention provides a data quality assessment and repair method based on machine learning. The method is mainly used for solving the problem of filling missing values in multidimensional data, simultaneously exploring potential causal structures among data variables and ensuring that the repaired data accords with logic constraint in a specific field. The method comprises the steps of preprocessing an original input matrix containing missing values and observed values to generate an initialized input matrix and a corresponding binary mask matrix. On this basis, a structured generator network is constructed, which contains parameterized adjacency matrices for characterizing causal relationships between variables. During model training, the computer system performs a loop iteration operation that includes inner layer optimization. In each iteration, forward propagation is performed on the initialized input matrix using the structured generator network, outputting a preliminary reconstructed data matrix. And then executing a manifold constraint projection step, wherein the step takes the preliminary reconstruction data matrix as an initial value, and searches a logic correction target matrix meeting the field constraint set through iterative optimization on the premise of keeping the network parameters of the structural generator fixed. The logic correction target matrix numerically meets predefined logic rules and is adjacent to the preliminary reconstructed data matrix in euclidean space. To internalize the logical constraint into the network's generative capability, a potential consistency learning loop is constructed and a potential consistency penalty is calculated. The penalty is used to measure the consistency between the preliminary reconstructed data matrix and the logic correction target matrix. In this process, a gradient blocking operation is performed on the logically modified target matrix, which is set to a fixed constant tensor that does not participate in the back propagation of the gradient. By minimizing potential consistency loss, the structured generator network directly approximates manifold space meeting the field constraint set in a single forward propagation, thereby converting the non-microminiatable logic constraint solving process into a microminiatable distance approximation problem. Further, to synergistically optimize data reconstruction, logical compliance, and graph structure sparsity, an augmented lagrangian overall tr