CN-121979880-A - Real-time wide table construction method, system, equipment and medium based on Spark and Flink combination

CN121979880ACN 121979880 ACN121979880 ACN 121979880ACN-121979880-A

Abstract

The invention provides a real-time wide table construction method, a system, equipment and a medium based on Spark and Flink combination, which comprise the steps of S1, constructing a data link based on a Spark connector, completing quick generation of a data set snapshot, S2, constructing the data link based on a FlinkCDC connector, realizing incremental real-time update of data, S3, stopping a Spark data synchronization program, setting up the latest position real-time consumption incremental change data of the snapshot generated by a Flink CDC from S1, dividing a table to be associated into a master table and a plurality of slave tables, returning the related slave tables according to an association key when the master table changes, combining to form wide table data, writing the wide table data into a downstream wide table database in a upsert mode, and updating corresponding data in the downstream wide table directly according to the association key or an external key when the slave table changes. Through the mechanism, all changes of the association table can be fed back to the downstream wide table in real time, so that the dynamic construction of the wide table is completed.

Inventors

SHI HAIXIONG
ZHAO LEI
MA JUNQIANG
CHENG HONG

Assignees

上海大智慧财汇数据科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251224

Claims (10)

1. A real-time wide table construction method based on Spark and Flink is characterized by comprising the following steps: Step S1, constructing a data link based on a Spark connector to realize full-scale synchronization of data and generate a snapshot, wherein the snapshot comprises complete data of all source tables at the time of synchronization completion; S2, constructing a data link based on the Flink CDC connector to realize incremental real-time updating of data; step S3, stopping the Spark data synchronization program, reading metadata of the snap, setting an initial consumption time stamp of the Flink CDC as a time point when the snap generation is completed, dividing all source tables into a master table and a plurality of slave tables, integrating and writing data of the master table and related slave tables into a downstream wide table database when the master table changes, and directly updating corresponding data in the downstream wide table when the slave table changes, so that dynamic construction of the wide table is completed.
2. The real-time wide table construction method based on Spark and Flink combination according to claim 1, wherein when the Flink CDC captures the incremental change of the master table, the latest data of all associated slave tables are searched back by taking the associated key of the master table as a clue, the master table data and the slave table data are spliced and combined into complete wide table data, and the complete wide table data are written into a downstream wide table database in a upsert mode.
3. The real-time wide table construction method based on Spark and Flink combination according to claim 2, wherein the number-based and time-triggered window operators are used to capture the changed master table data, and the asynchronous clients of the thread pool simulation relational database are used to realize parallel query of all slave table data.
4. The real-time wide table construction method based on Spark and Flink combination according to claim 1, wherein when the Flink CDC captures the change from the table increment, the associated key/foreign key is directly used as an index to locate the corresponding record in the downstream wide table, and only the content corresponding to the current slave table field is updated.
5. A real-time wide-table construction system based on Spark in combination with Flink, comprising: The method comprises the steps of M1, constructing a data link based on a Spark connector, realizing full synchronization of data, and generating a snapshot, wherein the snapshot comprises complete data of all source tables at the time of synchronization completion; The module M2 is used for constructing a data link based on the Flink CDC connector to realize incremental real-time updating of data; The module M3 is used for stopping the Spark data synchronization program, reading metadata of the snap, setting an initial consumption time stamp of the Flink CDC as a time point when the snap generation is completed, dividing all source tables into a master table and a plurality of slave tables, integrating and writing data of the master table and related slave tables into a downstream wide table database when the master table changes, and directly updating corresponding data in the downstream wide table when the slave table changes, so that dynamic construction of the wide table is completed.
6. The real-time wide table construction system based on Spark and Flink combination according to claim 5, wherein when the Flink CDC captures the incremental change of the master table, the latest data of all associated slave tables are searched back by taking the associated key of the master table as a clue, the master table data and the slave table data are spliced and combined into complete wide table data, and the complete wide table data is written into the downstream wide table database in upsert mode.
7. The Spark and flank combination based real-time wide table construction system of claim 6, wherein the number and time based triggering of window operators is utilized to capture the changed master table data and the asynchronous client of the thread pool simulation relational database is utilized to realize parallel query of all slave table data.
8. The real-time wide table construction system based on Spark and Flink combination according to claim 5, wherein when the Flink CDC captures the change from the table increment, the record corresponding to the downstream wide table is located by directly using the associated key/foreign key as the index, and only the content corresponding to the current slave table field is updated.
9. An electronic device comprising a memory and at least one processor, the memory having instructions stored therein; The at least one processor invoking the instructions in the memory to cause the electronic device to perform the steps of the Spark and Flink based real-time wide table construction method as recited in any of claims 1-4.
10. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the steps of a real-time wide table construction method based on Spark in combination with Flink as claimed in any of claims 1-4.

Description

Real-time wide table construction method, system, equipment and medium based on Spark and Flink combination Technical Field The invention relates to the technical field of big data, in particular to a real-time wide table construction method, a system, equipment and a medium based on Spark and Flink combination. Background In the field of big data, in order to support multidimensional data analysis, statistics and quick query, a plurality of data tables are usually required to be integrated into an inverse-normal-form broad table in a correlation manner so as to meet analysis requirements in complex business scenes. However, the following problems are often faced in practical projects: if there is an association operation between large tables, many efficient join strategies are difficult to apply, and meanwhile, the finally generated wide table field may contain a nested structure formed by one-to-many relationships, and the main stream join mode generally lacks the native support for the complex structure. In addition, in some scenes with high real-time requirements, updating of the slave table data needs to be synchronized to a wide table in time, and no mature and stable solution is available for coping with the requirements. In the prior art, the broad table construction is dependent on materialized views or Flink double flow join. However, the materialized view is limited by more functions in different OLAP databases, and is difficult to flexibly adapt to the requirement of highly customized broad table construction, especially the performance is insufficient when processing one-to-many complex data structures, while the flank double-flow join has limitations on the table size and is difficult to effectively support complex data forms such as nested structures. In addition, in the process of generating the snap, the mode based on the Flink double-flow join has the problems that the time consumption of the snap generation is high, the online upgrading flow is long, the data freshness is reduced and the like due to the dependence on the state. Disclosure of Invention Aiming at the defects in the prior art, the invention aims to provide a real-time wide table construction method, a system, equipment and a medium based on Spark and Flink combination. The invention provides a real-time wide table construction method based on Spark and Flink combination, which comprises the following steps: Step S1, constructing a data link based on a Spark connector to realize full-scale synchronization of data and generate a snapshot, wherein the snapshot comprises complete data of all source tables at the time of synchronization completion; S2, constructing a data link based on the Flink CDC connector to realize incremental real-time updating of data; step S3, stopping the Spark data synchronization program, reading metadata of the snap, setting an initial consumption time stamp of the Flink CDC as a time point when the snap generation is completed, dividing all source tables into a master table and a plurality of slave tables, integrating and writing data of the master table and related slave tables into a downstream wide table database when the master table changes, and directly updating corresponding data in the downstream wide table when the slave table changes, so that dynamic construction of the wide table is completed. Preferably, when the flank CDC captures incremental changes of the master table, the association key of the master table is taken as a clue, the latest data of all associated slave tables are checked back, the master table data and the slave table data are spliced and combined into complete wide table data, and the complete wide table data are written into a downstream wide table database in a upsert mode. Preferably, the number-based and time-triggered capture of the window operator captures the changed master table data, and the asynchronous client of the thread pool simulation relational database is utilized to realize parallel query of all slave table data. Preferably, when the flank CDC captures a slave table increment change, the associated key/foreign key is directly used as an index to locate a corresponding record in the downstream wide table, and only the content corresponding to the current slave table field is updated. The invention provides a real-time wide table construction system based on Spark and Flink combination, which comprises: The method comprises the steps of M1, constructing a data link based on a Spark connector, realizing full synchronization of data, and generating a snapshot, wherein the snapshot comprises complete data of all source tables at the time of synchronization completion; The module M2 is used for constructing a data link based on the Flink CDC connector to realize incremental real-time updating of data; The module M3 is used for stopping the Spark data synchronization program, reading metadata of the snap, setting an initial consumption time stamp of the Flink