CN-122019521-A - Multi-source data intelligent management method for actual error correction

CN122019521ACN 122019521 ACN122019521 ACN 122019521ACN-122019521-A

Abstract

The invention discloses an intelligent multi-source data management method for actual error correction, which comprises the following steps of S1, intelligent conflict resolution of multi-source data, S2, dynamic cleaning of redundant data, S3, real-time data quality monitoring and early warning, and construction of a multi-dimensional data quality assessment index system comprising accuracy, completeness and timeliness, wherein the steps of S1, S2 and S3 form a collaborative closed loop, the cleaning result of the step S2 is fed back to the weight parameter of an optimal reliability assessment model of the step S1, and abnormal data information of the step S3 is synchronized to the step S2 to update the screening rule of a redundant map. The method has the advantages that the technical scheme replaces manual screening of conflict data by a machine learning model through quantification of authority and data consistency of the data source, subjective deviation influence is avoided, automatic credible sorting of the conflict data is achieved, manual intervention links are reduced, and objectivity and efficiency of selection of a fact verification standard are improved.

Inventors

ZHANG YONGTING
SANG JIAWEI
WU HAITAO
WU XIANG

Assignees

徐州医科大学

Dates

Publication Date: 20260512
Application Date: 20260123

Claims (8)

1. A multi-source data intelligent management method for actual error correction is characterized by comprising the following steps: The method comprises the steps of S1, carrying out intelligent conflict resolution on multisource data, accessing at least 2 types of heterogeneous data sources, constructing a multi-dimensional data reliability assessment model based on machine learning, carrying out reliability sequencing on data with fact conflicts, screening data with high authority and high consistency as a fact verification benchmark, S2, carrying out text clustering on the benchmark data screened in the step S1 based on a semantic fingerprint algorithm, generating a data redundancy map, retaining an optimal data version according to the authority priority of the data sources and the integrity rule of data fields, deleting redundant copies, adopting an incremental cleaning mechanism, simultaneously comparing only semantic fingerprints of the existing data when new data are accessed, avoiding repeated processing of the whole data, S3, carrying out real-time data quality monitoring and early warning, constructing a multi-dimensional data quality assessment index system comprising accuracy, integrity and timeliness, automatically triggering early warning when the data is detected to be abnormal through a real-time crawler, and starting a standby data source switching mechanism, ensuring that the fact verification module continuously acquires available data, wherein the steps S1, S2 and S3 form a collaborative cleaning closed loop, and the step S2 is to update the reliability assessment rule of the redundancy verification module, and the step 2 is synchronous, and the step 2 is carried out.
2. The intelligent multi-source data management method for the actual error correction of the invention according to claim 1 is characterized in that in step S1, the multi-dimensional data reliability assessment model quantifies authority indexes of data sources through a logistic regression algorithm, and calculates consistency indexes of the data sources through a transducer model, wherein the authority indexes comprise qualification grades of institutions to which the data sources belong, data update frequency and verification accuracy of historical data, the consistency indexes are the actual coincidence degrees of target data and other 3 or more authority data sources, and the coincidence degree is more than or equal to 85% and is judged to be high consistency.
3. The intelligent multi-source data management method for actual error correction according to claim 1, wherein the processing logic of the fact conflict in the step S1 comprises the steps that when the same fact has differences in data sources with different authorities, the credibility evaluation model gives authority weights corresponding to the different data sources, a comprehensive credibility score is calculated by combining consistency indexes, and when the score difference value is more than or equal to 0.3, high authority data source data is preferentially selected as a verification standard.
4. The intelligent multi-source data governance method for true error correction of claim 1, wherein said semantic fingerprint algorithm in step S2 is implemented by: the method comprises the steps of performing word segmentation on target data, extracting core keywords, generating semantic vectors of the keywords based on a BERT model, compressing the semantic vectors into 64-bit or 128-bit fingerprint sequences through SimHash algorithm, calculating the hamming distances of the fingerprint sequences of different data, judging the distances are not higher than 3 to be highly similar data, and incorporating the highly similar data into a redundant map.
5. The intelligent multi-source data management method for actual error correction according to claim 1, wherein the incremental cleaning mechanism in step S2 comprises the following specific steps: When new data is accessed, executing a semantic fingerprint algorithm step, calling an existing data fingerprint database in a Redis cache for comparison, triggering redundancy judgment if a fingerprint sequence with the hamming distance not higher than 3 exists, executing authority and integrity screening on the data judged to be redundant, and enabling non-redundant data to directly enter a fact checking module.
6. The method for intelligent management of multi-source data for actual error correction of claim 1, wherein the quantization criteria of the multi-dimensional data quality assessment index system in step S3 comprises: the accuracy is that the error rate of the data field and the authority reference is not higher than 5%, the integrity is that the core key field loss rate is not higher than 2%, the timeliness is that the dynamic data update delay is not higher than 2 hours, and the static data update delay is not higher than 30 days.
7. The method for intelligent management of multi-source data for actual error correction of claim 1, wherein said backup data source switching mechanism in step S3 comprises: The method comprises the steps of constructing a standby data source library in advance, setting priority according to backup of the same mechanism and authoritative substitution in the same field, automatically calling the standby data source with the highest priority by a system when service unavailability, data tampering or overtime non-updating of the main data source occur, switching response time is less than or equal to 10 seconds, sending an early warning notice to an administrator after switching is completed, and updating the data source weight of a credibility assessment model.
8. The intelligent multi-source data governance method for actual error correction of claim 1, wherein abnormal data cases fed back in step S3 are collected quarterly, the multi-dimensional data credibility assessment model is updated through incremental training, weight coefficients are updated for a logistic regression model, and parameters of a top network layer are finely adjusted for a transducer model.

Description

Multi-source data intelligent management method for actual error correction Technical Field The invention relates to the technical field of data management and fact verification, in particular to an intelligent multi-source data management method for actual error correction. Background With the popularization of big data technology, the practical error correction system is increasingly widely applied in the fields of academic paper verification, policy interpretation verification and the like, and the core of the practical error correction system depends on cross verification of multi-source data. However, existing multi-source data management techniques have significant pain points: The conflict resolution relies on manual work, the efficiency is low, the deviation is easy, the entity alignment is carried out only through a knowledge graph, the authority and the data consistency of the data source are not quantized, the conflict data are required to be screened manually, the influence of subjective judgment is easy, and the efficiency is low. The redundancy recognition precision is insufficient, the processing flow is redundant, namely the redundancy is removed by adopting pure semantic recognition, the optimal version is screened by uncombined data source authority, the redundant data with large text expression difference and similar semantics cannot be recognized by traditional hash matching, and the whole data is required to be repeatedly processed, so that a large amount of calculation force is consumed. The data quality monitoring is incomplete, a fault-tolerant mechanism is lacked, a multi-dimensional evaluation system covering accuracy, integrity and timeliness is not constructed, and a main data source is abnormal (when a standby data source switching mechanism is not available, the fact checking system is interrupted). Closed loop optimization is not formed, and the method is difficult to adapt to data change, namely multi-focus single-link treatment, and cooperative closed loops of conflict resolution, redundant cleaning and real-time monitoring are not realized, so that the treatment effect cannot be dynamically optimized based on data feedback. Aiming at the defects, a multi-module collaborative multi-source data intelligent management method is needed, and accuracy of actual error correction and system stability are improved through quantitative evaluation, semantic accurate identification and a real-time fault tolerance mechanism. Disclosure of Invention The technical problem to be solved by the invention is to provide a multi-source data intelligent management method for actual error correction, which solves one or more of the problems in the prior art. In order to solve the technical problems, the invention adopts a technical scheme that the intelligent multi-source data management method for the actual error correction is innovative and comprises the following steps: The method comprises the steps of S1, carrying out intelligent conflict resolution on multisource data, accessing at least 2 types of heterogeneous data sources, constructing a multi-dimensional data reliability assessment model based on machine learning, carrying out reliability sequencing on data with fact conflicts, screening data with high authority and high consistency as a fact verification benchmark, S2, carrying out text clustering on the benchmark data screened in the step S1 based on a semantic fingerprint algorithm, generating a data redundancy map, retaining an optimal data version according to the authority priority of the data sources and the integrity rule of data fields, deleting redundant copies, adopting an incremental cleaning mechanism, simultaneously comparing only semantic fingerprints of the existing data when new data are accessed, avoiding repeated processing of the whole data, S3, carrying out real-time data quality monitoring and early warning, constructing a multi-dimensional data quality assessment index system comprising accuracy, integrity and timeliness, automatically triggering early warning when the data is detected to be abnormal through a real-time crawler, and starting a standby data source switching mechanism, ensuring that the fact verification module continuously acquires available data, wherein the steps S1, S2 and S3 form a collaborative cleaning closed loop, and the step S2 is to update the reliability assessment rule of the redundancy verification module, and the step 2 is synchronous, and the step 2 is carried out. In some embodiments, in step S1, the multidimensional data reliability assessment model quantifies authority indexes of the data source through a logistic regression algorithm, calculates consistency indexes of the data source through a transducer model, wherein the authority indexes include qualification grades of mechanisms to which the data source belongs, data updating frequency and verification accuracy of historical data, the consistency indexes are fact coincidence