CN-121996646-A - Method and system for processing dirty data of real-time data collection based on Flink CDC

CN121996646ACN 121996646 ACN121996646 ACN 121996646ACN-121996646-A

Abstract

The application discloses a method for processing dirty data acquired by real-time data based on a Flink CDC. The method includes the steps of capturing source end data change through a Flink CDC, obtaining and caching target end metadata, carrying out multidimensional pre-verification of field length, type, non-empty constraint and the like based on the metadata before data writing, identifying dirty data in advance, packaging the dirty data into a structured message, sending the structured message to a Kafka isolation subject, executing writing, capturing and sending writing abnormality on the data passing through the pre-verification, and analyzing and repairing the dirty data by an independent Consumer to form a processing closed loop of identification-isolation-analysis-repair. The application advances dirty data interception time from writing failure to a data circulation stage, improves writing success rate and target end stability, reduces rollback retry cost, provides accurate field-level error diagnosis information, realizes observability and closed-loop treatment of data quality problems, and is suitable for big data real-time synchronization scenes.

Inventors

PENG ZHUANG

Assignees

中电云计算技术有限公司

Dates

Publication Date: 20260508
Application Date: 20260127

Claims (10)

1. A method for processing dirty data of real-time data collection based on a flank CDC, the method comprising: S1, monitoring and capturing a data change event of a source database in real time by using a Flink CDC connector, and converting the data change event into FLINK DATASTREAM; s2, acquiring and caching metadata information of a target-end data table; S3, before the data is written into the target-end database, pre-checking the data change event based on the metadata information, identifying dirty data which does not accord with the structural specification of the target-end table, packaging the identified dirty data and check failure information thereof into a structural message, and sending the structural message to a preset Kafka isolation theme; S4, performing target-end writing operation on the data passing through the pre-verification, capturing an exception in the writing process, packaging the data which fails to be written and the exception information thereof into a structural message, and then sending the structural message to the Kafka isolation subject; S5, consuming the dirty data information in the Kafka isolated subject by an independent Kafka Consumer, and analyzing and repairing the dirty data to form a dirty data processing closed loop of identification-isolation-analysis-repair.
2. The method according to claim 1, wherein the obtaining and buffering the metadata information of the target-side data table in step S2 includes: querying metadata information of a target database when the Flink application is started or the cache is invalid, wherein the metadata information comprises a field name, a field type, a field length and non-empty constraint; And caching the metadata information in an application memory, and setting a cache validity period to realize regular cache refreshing.
3. The method of claim 2, wherein the setting of the buffer validity period is used to balance database query pressure with table structure change perceived timeliness, and the periodically refreshing the buffer triggering condition includes finding that metadata is inconsistent with an actual table structure when the buffer reaches the validity period or checks.
4. The method according to claim 1, wherein the pre-verification in step S3 comprises: (1) Checking the field length, and checking whether the length of the field value to be written in the source end exceeds the maximum allowable length of the field of the target end; (2) Checking the field type, and checking whether the field data type to be written in the source end is compatible with the field data type of the target end; (3) Checking non-NULL constraint, and checking whether a non-NULL constraint field of a target end is a NULL value in source end data; (4) Checking the field number, and checking whether the field number of the source data change event is consistent with the field number defined by the target table structure.
5. The method of claim 1, wherein the step S3 encapsulates the identified dirty data and the verification failure information thereof into a structured message, and the encapsulation includes an original source data change record, a verification failure type identifier, a destination field specification requirement, a specific failure cause description, and source table name and destination table name.
6. The method according to claim 1, wherein capturing anomalies in step S4 comprises: intercepting write anomalies in write logic of Sink Function through an anomaly capture mechanism; And analyzing the abnormal information into structured data, and packaging the structured data together with the corresponding source data change record into a structured dirty data message.
7. The method of claim 1, wherein the exception information in step S4 includes exception types and exception messages, the exception types including network timeouts, transaction conflicts, database constraint violations.
8. The method according to claim 1, wherein the analyzing and repairing the dirty data in step S5 specifically includes: Locating a problem source according to the dirty data message; Repairing or adjusting a target end table structure of source end data according to the failure type and the reason in the dirty data message; reinjecting the repaired data into an acquisition flow to realize closed-loop treatment of dirty data.
9. A Flink CDC based real-time data acquisition dirty data processing system, wherein the system, when running, implements the steps of the Flink CDC based real-time data acquisition dirty data processing method as claimed in any one of claims 1-8, the system comprising: the data capture module is used for monitoring and capturing the data change event of the source database in real time through the Flink CDC connector and converting the data change event into FLINK DATASTREAM; the metadata management module is used for acquiring and caching metadata information of the target-end data table; The pre-verification module is used for pre-verifying the data change event based on the metadata information before the data is written into the target-end database, identifying dirty data which does not accord with the structural specification of the target-end table, packaging the identified dirty data and verification failure information thereof into a structured message, and sending the structured message to a preset Kafka isolation theme; The writing exception capturing module is used for executing target-end writing operation on the data passing through the pre-verification, capturing exception in the writing process, packaging the data with writing failure and the exception information thereof into a structured message and then sending the structured message to the Kafka isolation subject; and the dirty data processing module is used for analyzing and repairing dirty data by consuming the dirty data information in the Kafka isolation subject through an independent Kafka Consumer to form a dirty data processing closed loop of identification-isolation-analysis-repair.
10. An electronic device is characterized by comprising a memory and a processor; A memory for storing a computer program; A processor for executing the computer program to implement the steps of the method for processing real-time data acquisition dirty data based on a Flink CDC as defined in any one of claims 1-8.

Description

Method and system for processing dirty data of real-time data collection based on Flink CDC Technical Field The application relates to the technical field of big data real-time acquisition and processing, in particular to a method, a system, a computer readable storage medium and electronic equipment for identifying, collecting and processing dirty data of real-time data acquisition based on a Flink CDC (CHANGE DATA Capture). Background The big data age has placed higher demands on the real-time data acquisition and processing capabilities. APACHE FLINK CDC is used as a core component of the Flink ecology, and can efficiently capture the real-time change of data sources such as databases and the like and synchronize the data sources to a downstream system. However, in the actual acquisition scenario, source-side data quality defects or target-side table structure irregularities (such as field length overrun, data type incompatibility, non-empty constraint loss, etc.) often lead to data write failures, resulting in so-called "dirty data". In the prior art, dirty data is processed mainly through a write-in exception capturing mechanism, namely, after-the-fact interception is performed when illegal data trigger a database write-in error. However, this solution has significant drawbacks: 1. And the detection time is delayed, namely, the abnormality is discovered after the data flows in the transmission link and even partial data is possibly written successfully, and the cost and complexity of data rollback, retry and repair are obviously increased. 2. The error information is fuzzy, that is, partial general database abnormality only returns a wide write-in failure prompt, and violation details of specific verification rules (such as length, type and constraint) cannot be accurately positioned. 3. The exception is easy to be covered, namely in a complex ETL flow, the bottom writing exception can be captured, converted or packaged by upper logic, so that the original dirty data context is lost, and the root cause is difficult to trace. 4. The capture range is limited, and not all data nonstandard situations trigger clear resolvable anomalies, and partial problems are represented by silent data damage, logic inconsistency and the like, and cannot be directly identified through writing failure. The above limitations result in low efficiency of positioning, analyzing and repairing dirty data, seriously affecting data quality and processing timeliness, and a pre-positioned and refined real-time dirty data identification and processing scheme is needed. Disclosure of Invention In order to overcome the defects of the traditional scheme, the application provides a novel method and a system for processing real-time data acquisition dirty data based on the Flink CDC. According to the application, a dirty data processing mechanism mainly comprising pre-verification and auxiliary for abnormal capture is introduced in the Flink CDC real-time data acquisition process, strict verification is performed before data is written into a target end, and supplementary capture is performed when data writing fails, and a complete dirty data processing closed loop is finally formed by isolating dirty data and detailed information thereof to Kafka. In order to achieve the above object, the present invention adopts the following technical strategies: (1) And a pre-checking mechanism is arranged in advance, namely strict multidimensional checking is implemented before the data is written into the target end, dirty data identification time is advanced to a data circulation stage from the traditional writing failure, and the exception handling cost is fundamentally reduced. (2) And the dynamic rule engine is used for constructing a real-time quality rule base based on target end metadata, covering check dimensions such as field length, data type, non-null constraint, field deletion, name mapping and the like, and accurately intercepting format nonstandard data. (3) And the metadata is hot loaded and cached, namely the target end table structure change is dynamically synchronized through a metadata caching and timing refreshing mechanism, so that the real-time verification is ensured, and the performance bottleneck caused by frequent access to metadata service is avoided. (4) And isolating the structured mark from the dead message queue, namely packaging the original content of the dirty data, the verification failure type, the error details and the target end specification requirement into a structured message, and sending the structured message to the Kafka dead message topic to realize the physical isolation of the dirty data and the normal data and ensure the traceability of the problem. (5) The double-layer fault-tolerant system is characterized in that a prepositive pre-verification intercepts most format type anomalies, an anomaly capturing mechanism is used as a spam policy, and bottom write anomalies (such as network timeout and trans