CN-121996644-A - Cross-database data consistency verification method, device, equipment and storage medium
Abstract
The invention relates to the field of data processing, and discloses a cross-database data consistency verification method, device, equipment and storage medium. The method comprises the steps of obtaining a verification task, extracting task identification, double-data-source connection information of the same service index, derivative value calculation rules, static consistency threshold values and consistency judging strategies, obtaining double-data-source original data based on the connection information, converting the double-data-source original data into data vectors, obtaining double-derivative values of the double-data-source original data according to the calculation rules, calculating absolute difference values, combining task historical statistical information, calculation rule attributes, the data vectors and the double-derivative values if the absolute difference values exceed the static consistency threshold values, obtaining service logic inconsistency risk probability values and root cause classification through a pre-training probabilistic evaluation model, generating final results according to the probability values and judging strategies, and integrating relevant information to generate verification reports.
Inventors
- LI BOWEN
- LIU SIYA
- ZHU YUNYUN
Assignees
- 上海乾臻信息科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260126
Claims (10)
- 1. The cross-database data consistency verification method is characterized by comprising the following steps of: Acquiring a verification task, and obtaining a task identifier, connection information of a first data source and a second data source corresponding to the same service index, a derivative value calculation rule, a static consistency threshold value and a consistency judgment strategy based on the verification task; Based on the connection information, respectively acquiring first original data and second original data required by calculating derivative values from a first data source and a second data source, and converting the first original data and the second original data into a first data vector and a second data vector; Calculating a first derivative value by using the first original data, calculating a second derivative value by using the second original data, and calculating an absolute difference value between the first derivative value and the second derivative value based on the derivative value calculation rule; Comparing the absolute difference value with a static consistency threshold, if the absolute difference value exceeds the static consistency threshold, acquiring historical statistical information of the verification task based on the task identifier, extracting attributes of the derivative value calculation rule, and evaluating a probability value and root cause classification representing the risk of inconsistent business logic level based on the first derivative value, the second derivative value, the first data vector, the second data vector, the historical statistical information and attribute historical statistical information of the derivative value calculation rule by using a pre-trained probabilistic evaluation model; And generating a final judgment result based on the probability value and the consistency judgment strategy, and generating a verification report based on the probability value, the root cause classification and the final judgment result.
- 2. The cross-database data consistency verification method according to claim 1, wherein the obtaining a verification task, obtaining a task identifier, connection information of a first data source and a second data source corresponding to the same service index, a derivative value calculation rule, a static consistency threshold value, and a consistency determination policy based on the verification task, includes: Acquiring and analyzing a verification task to obtain a task identifier; Based on the task identification, retrieving and reading a corresponding structured task configuration file from a preset configuration management center; Analyzing the structured task configuration file to obtain a plurality of fields, and extracting connection information of a first data source and a second data source corresponding to the same service index, a derivative value calculation rule, a static consistency threshold value and a consistency judgment strategy from the plurality of fields.
- 3. The cross-database data consistency check method according to claim 1, wherein the obtaining the first and second original data required for calculating the derivative value from the first and second data sources based on the connection information, respectively, and converting the first and second original data into the first and second data vectors, comprises: generating a query request for the first data source and the second data source based on the connection information to acquire first original data and second original data required for calculating the derivative value; Performing data cleaning on the first original data and the second original data to obtain first cleaning data and second cleaning data; And converting the first cleaning data and the second cleaning data into a first data vector and a second data vector with the same dimension respectively by using a characteristic engineering method.
- 4. The cross-database data consistency check method according to claim 1, wherein the calculating a first derivative value using the first raw data and a second derivative value using the second raw data based on the derivative value calculation rule, and calculating an absolute difference value between the first derivative value and the second derivative value, comprises: analyzing the derivative value calculation rule into a calculation logic tree, and compiling the calculation logic tree into task codes capable of running in a calculation engine; Calculating the first original data and the second original data by using a calculation engine to obtain a first derivative value and a second derivative value; And calculating an absolute difference value between the first derivative value and the second derivative value.
- 5. The cross-database data consistency verification method according to claim 1, wherein the comparing the absolute difference value with a static consistency threshold value, if the absolute difference value exceeds the static consistency threshold value, acquiring historical statistical information of the verification task based on the task identifier, extracting attributes of the derivative value calculation rule, and calculating attribute historical statistical information of the rule based on the first derivative value, the second derivative value, the first data vector, the second data vector, the historical statistical information and the derivative value, and evaluating, using a pre-trained probabilistic evaluation model, to obtain a probability value and a root classification indicating that there is a business logic level inconsistency risk, including: Comparing the absolute difference value with the static consistency threshold, if the absolute difference value exceeds the static consistency threshold, inquiring historical statistical information of a verification task in a preset historical period from a preset historical knowledge base based on the task identification, and extracting attributes of derivative value calculation rules, wherein the historical statistical information comprises average difference, difference variance and historical inconsistency cases, and the attributes of the derivative value calculation rules comprise rule complexity scores, related operator types and numbers; Constructing a comprehensive vector based on the first derivative value, the second derivative value, the first data vector, the second data vector, the historical statistical information and the attribute of the derivative value calculation rule; And inputting the comprehensive vector into a pre-trained probabilistic evaluation model, and obtaining a probability value and root cause classification, which are output by the pre-trained probabilistic evaluation model and are used for representing inconsistent risks in a business logic layer.
- 6. The cross-database data consistency verification method according to claim 5, wherein the inputting the comprehensive vector into the pre-trained probabilistic assessment model to obtain a probability value and a root cause classification, which are output by the pre-trained probabilistic assessment model and represent that there is a risk of inconsistent business logic levels, comprises: constructing and training based on a lightweight gradient lifting tree algorithm to obtain a probabilistic evaluation model; carrying out standardization and normalization treatment on the comprehensive vector to obtain a preprocessing vector; And inputting the preprocessing vector into a pre-trained probabilistic evaluation model, and obtaining a probability value and root cause classification, which are output by the probabilistic evaluation model and are used for representing inconsistent risks in a business logic layer.
- 7. The cross-database data consistency verification method according to claim 1, wherein the generating a final decision result based on the probability value and the consistency decision policy, and generating a verification report based on the probability value, the root cause classification, and the final decision result, comprises: Matching the probability value with a risk level threshold in the consistency judging strategy, and determining that a final judging result is one of consistency, inconsistency and to-be-determined; Generating a verification report by combining the probability value, the root cause classification and the final judging result, wherein the verification report comprises a difference abstract, a risk level, root cause speculation and a processing suggestion; and storing the verification report in a structured format, and triggering an alarm according to a preset alarm rule.
- 8. A cross-database data consistency check device, comprising: The analysis module is used for acquiring a verification task, and obtaining a task identifier, connection information of a first data source and a second data source corresponding to the same service index, a derivative value calculation rule, a static consistency threshold value and a consistency judgment strategy based on the verification task; The conversion module is used for respectively acquiring first original data and second original data required by calculating derivative values from a first data source and a second data source based on the connection information, and converting the first original data and the second original data into a first data vector and a second data vector; the calculation module is used for calculating a first derivative value by utilizing the first original data, calculating a second derivative value by utilizing the second original data and calculating an absolute difference value between the first derivative value and the second derivative value based on the derivative value calculation rule; The evaluation module is used for comparing the absolute difference value with a static consistency threshold, acquiring historical statistical information of the verification task based on the task identifier if the absolute difference value exceeds the static consistency threshold, extracting attributes of the derivative value calculation rule, and evaluating probability values and root classifications representing inconsistent risks in a service logic level by using a pre-trained probabilistic evaluation model based on the first derivative value, the second derivative value, the first data vector, the second data vector, the historical statistical information and attribute historical statistical information of the derivative value calculation rule; And the generation module is used for generating a final judgment result based on the probability value and the consistency judgment strategy and generating a verification report based on the probability value, the root cause classification and the final judgment result.
- 9. A cross-database data consistency check device comprising a memory and at least one processor, the memory having computer readable instructions stored therein; the at least one processor invoking the computer readable instructions in the memory to perform the steps of the cross-database data consistency check method of any of claims 1-7.
- 10. A computer readable storage medium having computer readable instructions stored thereon, which when executed by a processor, implement the steps of the cross-database data consistency check method of any of claims 1-7.
Description
Cross-database data consistency verification method, device, equipment and storage medium Technical Field The present invention relates to the field of data processing, and in particular, to a method, an apparatus, a device, and a storage medium for checking consistency of database-crossing data. Background With the penetration of enterprise digital transformation, data asset distribution in a variety of heterogeneous data storage systems (e.g., relational databases, noSQL databases, data lakes, data warehouses, etc.) has become commonplace. To ensure accuracy in decision support, business analysis, and report generation, it is critical to maintain consistency of key business indicators among these scattered data sources. Such metrics are typically not direct mappings of raw data, but rather are aggregated, correlated, calculated from multiple sources, such as the number of active users on the day, monthly sales margins, inventory turnover rates, etc., according to complex business logic (i.e., derivative calculation rules). Currently, implementing cross-database data consistency checking in industry mainly depends on the following traditional methods: This is most common based on simple comparison of fixed thresholds, i.e. the same calculation logic is performed in the source and target libraries, respectively, to derive derivative values, which are then calculated as absolute differences or relative ratios, and compared with a preset, fixed static threshold. If the difference exceeds the threshold, it is determined that the difference is inconsistent. The method has the obvious defects that the threshold setting is seriously dependent on manual experience, the flexibility is lacking, acceptable normal service fluctuation cannot be distinguished from real data errors or logic defects, small differences can be amplified and misreported for complex rules, and real anomalies can be missed because the fixed threshold is not reached, although the method is simple to realize and quick to calculate. And (3) periodically carrying out batch processing and manual rechecking, namely regularly running a verification task by writing a script, generating a list of results exceeding a threshold value, and delivering the list to a data engineer or an analyst for manual checking. The method transfers the judgment pressure to manpower, has low efficiency and lag response, and the labor cost is rapidly increased along with the rapid increase of the number of verification tasks and the improvement of the rule complexity, so that the method is difficult to scale. And on the basis of simple comparison, introducing some business rules for filtering, such as distinguishing workdays from holidays, ignoring data of specific sources and the like. The method improves flexibility to a certain extent, but the rules still need to be manually maintained, and the rules are still basically judged by yes/no binary values, so that inconsistent risk levels cannot be quantified, and the data distribution change and the emerging abnormal modes are difficult to deal with. Accordingly, there is a need for improvement and development in the art. Disclosure of Invention The invention provides a cross-database data consistency verification method, device, equipment and storage medium, which are used for performing risk assessment on cross-database data derivative values. The first aspect of the invention provides a cross-database data consistency verification method, which comprises the steps of obtaining a verification task, obtaining a task identifier, connection information of a first data source and a second data source corresponding to the same service index, a derivative value calculation rule, a static consistency threshold value and a consistency judgment strategy based on the verification task; based on the connection information, respectively acquiring first original data and second original data required by calculating derivative values from a first data source and a second data source, and converting the first original data and the second original data into a first data vector and a second data vector; comparing the absolute difference value with a static consistency threshold, if the absolute difference value exceeds the static consistency threshold, acquiring historical statistical information of the verification task based on the task identifier, extracting attributes of the derivative value calculation rule, evaluating probability values and root classifications representing inconsistent risks of service logic levels based on the first derivative value, the second derivative value, the first data vector, the second data vector, the historical statistical information and attribute historical statistical information of the derivative value calculation rule by using a pre-trained probabilistic evaluation model, generating final judgment results based on the probability values and the consistency judgment strategies, and generating final