CN-122019667-A - HUDI data lake-based real-time increment synchronization method for number bins

CN122019667ACN 122019667 ACN122019667 ACN 122019667ACN-122019667-A

Abstract

The invention discloses a real-time bin increment synchronization method based on HUDI data lakes, which relates to the technical field of HUDI data lake data synchronization and comprises the steps of obtaining the finest time partition granularity of HUDI data lakes, collecting bin increment sizes based on the corresponding time partition granularity to obtain initial bin increment data, preprocessing the initial bin increment data, analyzing the increment sizes to obtain the optimal partition sizes, adjusting the time partition sizes of HUDI data lakes according to the optimal partition sizes, obtaining the optimal sub-bucket sizes to obtain optimal sub-bucket information, adjusting the sub-bucket sizes of HUDI data lakes according to the optimal sub-bucket information, and synchronizing the bin increment in real time.

Inventors

ZHOU LONGJIANG

Assignees

深圳大数信科技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260127

Claims (10)

1. The real-time number bin increment synchronization method based on HUDI data lakes is characterized by comprising the following steps of: acquiring HUDI the finest time partition granularity of the data lake, and acquiring the increment size of a plurality of bins based on the corresponding time partition granularity to obtain initial increment data of the plurality of bins; preprocessing initial number bin increment data, and analyzing increment size to obtain optimal partition size; Adjusting HUDI the time partition size of the data lake according to the optimal partition size, and acquiring the optimal barrel size to obtain optimal barrel information; And adjusting HUDI the barrel dividing size of the data lake according to the optimal barrel dividing information, and synchronizing the increment of the data warehouse in real time.
2. The method for synchronizing real-time bin increments based on HUDI data lakes of claim 1, wherein obtaining HUDI data lakes of the finest time partition granularity and collecting the bin increment sizes based on the corresponding time partition granularity, obtaining initial bin increment data comprises the sub-steps of: the HUDI data lake of the E-commerce is marked as a first data lake, the time partition granularity of the first data lake is obtained, the finest time partition granularity is obtained and marked as the finest partition granularity, and the step length and the unit corresponding to the finest partition granularity are obtained and marked as k1; And recording the actual physical storage data quantity which is generated in the first data lake real-time number bin increment synchronous link and only contains the effective data of the service new or changed in the lake entering front synchronous link as the original data increment.
3. The method for synchronizing real-time bins based on HUDI data lakes of claim 2, wherein the steps of obtaining HUDI data lakes' finest time partition granularity, and collecting the bin increment size based on the corresponding time partition granularity, and obtaining initial bin increment data, further comprise the sub-steps of: continuously collecting the unit granularity data increment of the first data lake in each sampling period, arranging according to a time sequence, and recording as original number bin increment data; Setting the first reference time length as t1 days, intercepting a part acquired in the last t1 days from the original number-bin increment data, and recording the part as the initial number-bin increment data.
4. The method for synchronizing the increment of the real-time number of bins based on HUDI data lakes of claim 3, wherein the steps of preprocessing the initial increment data of the number of bins and analyzing the increment size to obtain the optimal partition size comprise the following sub-steps: arranging the initial bin increment data according to time sequence, marking the initial bin increment data as a first increment sequence, and marking any unit granularity data increment in the first increment sequence as a first increment AD; For the initial number bin increment data, acquiring a first quartile AQ1, a third quartile AQ3 and a quartile IQ, removing abnormal values in the initial number bin increment data by utilizing a3 sigma principle, calculating the average value and standard deviation of the rest part, and respectively marking as a correction average value AP and a correction standard deviation AB; Let tl=max [ (AQ 1-1.5×iq), (AP-2×ab) ], tu=min [ (aq3+1.5×iq), (ap+2×ab) ], and [ TL, TU ] is noted as the global fluctuation range.
5. The method for synchronizing the increment of the real-time number of bins based on HUDI data lakes of claim 4, wherein the steps of preprocessing the initial increment data of the number of bins and analyzing the increment size to obtain the optimal partition size further comprise the following sub-steps: Calculating the mean value CA and standard deviation CB of the adjacent reference increment, and marking [ CA-k3×CB, CA+k3×CB ] as the local fluctuation range of the first increment, wherein k2 is the set number, and k3 is the set proportionality coefficient; The method comprises the steps of marking a first increment as a normal value if the first increment is located in a local fluctuation range, marking as an abnormal value if the first increment is not located in the local fluctuation range and is not located in a global fluctuation range, and marking as an abnormal suspected value if the first increment is not located in the local fluctuation range but is located in the global fluctuation range.
6. The method for synchronizing real-time number bin increments based on HUDI data lakes of claim 5, wherein preprocessing the initial number bin increment data and performing increment size analysis, obtaining the optimal partition size further comprises the sub-steps of: Acquiring k4 unit granularity data increments closest to the first increment in the first increment sequence by taking the first increment as a center, and recording the k4 unit granularity data increments as first increment field reference increments, wherein k4 is the set number; Calculating the average value DP of the reference increment in the field, calculating the relative deviation degree of the first increment of [ AD-DP ]/DP ]. Times.100%, repeatedly calculating the relative deviation degree of all normal values to obtain a normal deviation degree aggregate, and calculating the relative deviation degree of all abnormal suspected values; calculating the mean value EP and standard deviation EB of the normal deviation collection, and recording EP+k5XEB as a normal deviation threshold value, wherein k5 is a set proportionality coefficient; for any abnormal suspected value, if the corresponding relative deviation is greater than the normal deviation threshold, marking the abnormal value, otherwise marking the abnormal value as the normal value; And repeatedly acquiring all abnormal values in the initial number bin incremental data, removing the abnormal values, and obtaining effective incremental data after the abnormal values are completed.
7. The method for synchronizing real-time number bin increments based on HUDI data lakes of claim 6, wherein preprocessing the initial number bin increment data and performing increment size analysis, obtaining the optimal partition size further comprises the sub-steps of: Calculating a standard deviation YB of the effective incremental data, and setting the range of the partition size as [ DS, DL ]; Setting the adjacent radius as 0.8 XYB, the minimum point number as n, carrying out density clustering on the effective incremental data to obtain a plurality of clusters, calculating the proportion of the number of data in each cluster to the total number of data of the effective incremental data, and marking the proportion as the intra-cluster duty ratio; Marking the cluster with the intra-cluster duty ratio larger than k6 as a core cluster, and if the intra-cluster duty ratios are not larger than k6, marking k7 clusters with the largest intra-cluster duty ratio as the core cluster, wherein k6 is a set proportion, and k7 is a set number; And calculating the median of each core cluster, and calculating the weighted average of the median of all the core clusters according to the intra-cluster duty ratio, and recording the weighted average as a normal core increment VH.
8. The method for synchronizing real-time number bin increments based on HUDI data lakes of claim 7 wherein preprocessing the initial number bin increment data and performing an increment size analysis to obtain the optimal partition size further comprises the sub-steps of: Arranging the effective increment data in time sequence, and marking the effective increment data as a second increment sequence, wherein ES= "DS/VH", "EL= (" DL/VH "," 30 "), and [ ES, EL ] is taken as a value range of merging multiples, and the merging multiples are integers; Counting any merging multiple as E0, sequentially calculating the sum of E0 continuous unit granularity data increment from the initial position of the second increment sequence, and if the E0 continuous unit granularity data increment is not available, discarding the calculation result to obtain a merging increment sequence corresponding to E0; recording VHXE 0 as a corresponding candidate partition size SV, and repeatedly obtaining all corresponding merging increment sequences and candidate partition sizes; Calculating the standard deviation EB of a merging increment sequence corresponding to E0 for the candidate partition size SV corresponding to E0, calculating (SV/DS) X (1-EB/YB), and recording as a merging gain index of the SV; and repeatedly obtaining the combined gain index of all the candidate partition sizes, and recording the candidate partition size with the largest combined gain index as the optimal partition size WS.
9. The HUDI data lake based real-time number bin increment synchronization method of claim 8 wherein adjusting HUDI the time partition size of the data lake according to the optimal partition size and obtaining the optimal bucket size to obtain the optimal bucket information comprises the sub-steps of: adjusting HUDI the size of the finest time partition of the data lake to be the optimal partition size; Setting candidate sets {128MB,256MB }, respectively calculating WS/128 and WS/256, and respectively recording as reasonable barrel numbers F1 and F2 corresponding to 128MB and 256 MB; If the reasonable barrel dividing numbers F1 and F2 are both positioned in [1, DL/256], the barrel dividing size corresponding to the small reasonable barrel dividing number is used as the optimal barrel dividing size, and if only one of F1 and F2 is positioned in [1, DL/256], the barrel dividing size corresponding to the reasonable barrel dividing number positioned in [1, DL/256] is used as the optimal barrel dividing size.
10. The method for real-time synchronization of a number of bins in a HUDI data lake of claim 9, wherein adjusting HUDI the size of the sub-bins of the data lake according to the optimal sub-bin information and synchronizing the number of bins in real time comprises the sub-steps of: Adjusting the size of a sub-bucket of a time partition of HUDI data lakes to be the optimal sub-bucket size, setting an adjustment period to be T0, and periodically adjusting the size of the time partition and the sub-bucket size of HUDI data lakes according to the T0; capturing the increment of the number of bins in real time, building a corresponding data table in HUDI according to the adjusted partition size and the barrel size, and then placing newly added data in the corresponding position in the data table according to a rule to complete the synchronization of the increment of the number of bins in real time.

Description

HUDI data lake-based real-time increment synchronization method for number bins Technical Field The invention relates to the technical field of HUDI data lake data synchronization, in particular to a HUDI data lake-based real-time increment synchronization method for a number bin. Background The HUDI data lake data synchronization technology is a technical system for realizing efficient and consistent data transmission and dynamic updating of source data to HUDI lake bins by relying on the core characteristics of incremental storage, differentiated barrel, metadata management and the like of HUDI data lakes, and covers the whole flow of data capturing, cleaning conversion, differentiated barrel writing according to HUDI specifications, incremental synchronization and storage optimization. The conventional HUDI data lake data synchronization technology is often used for manually setting fixed partition sizes and barrel dividing sizes when carrying out real-time synchronization on data bin increment data of electronic commerce, and the manually fixed partition sizes and barrel dividing sizes cannot be adapted to dynamic fluctuation characteristics of a plurality of bin increments, are seriously mismatched with HDFS bottom storage design logic, and easily cause double loss of real-time synchronization core pain points and bottom I/O performance. The increment of a number bin can fluctuate along with service peak-valley, period and scene, a fixed value can cause partition blank and barrel division data sparse in low peak period, storage and metadata resources are wasted, a large amount of small files can be generated in peak period, compaction tasks of HUDI and the Flink synchronously occupy cluster resources, data landing delay is greatly increased, small files can be exponentially increased in burst increment, so that backlog of the synchronous tasks, throughput suddenly drop and even restart are caused, synchronization stability is thoroughly destroyed, meanwhile, a fixed value is extremely easy to deviate from a main stream block standard of HDFS 128MB/256MB, cross-block storage is caused, sequential I/O of inquiry is changed into random I/O, scanning efficiency is greatly reduced, unreasonable problem of the fraction is also caused, metadata is greatly bulked, inquiry delay is increased, or single barrel is excessively large, inquiry is blocked, the level adaptation relation between the partition and the barrel division is greatly increased, situations such as half-full barrels, empty barrels and the like occur, I/O loss is further caused, and therefore, the conventional HUDI data synchronization technology cannot realize real-time adjustment of the data of the large-scale of the lake data of the large-volume and the whole-volume data, and the large-volume data of the lake is greatly increased, and the increment of the data of the large-volume bin is not synchronized in the real-time. Disclosure of Invention The invention aims to solve at least one of the technical problems in the prior art to a certain extent, and solves the problems that when the data bin increment data of an electronic commerce is synchronized in real time, the prior HUDI data lake data synchronization technology cannot dynamically adjust the partition size and the partition size of a HUDI data lake according to the historical data bin increment condition and then synchronize the data bin increment according to the historical data bin increment condition by acquiring the finest time partition granularity of a HUDI data lake and acquiring the increment size of the data bin based on the corresponding time partition granularity, preprocessing and analyzing the increment size to acquire the optimal partition size, adjusting the time partition size of a HUDI data lake and acquiring the optimal partition size to acquire the optimal partition information, adjusting the partition size of a HUDI data lake and synchronizing the increment of the data bin in real time. In order to achieve the above purpose, the application provides a HUDI data lake-based real-time increment synchronization method for a number bin, which comprises the following steps: acquiring HUDI the finest time partition granularity of the data lake, and acquiring the increment size of a plurality of bins based on the corresponding time partition granularity to obtain initial increment data of the plurality of bins; preprocessing initial number bin increment data, and analyzing increment size to obtain optimal partition size; Adjusting HUDI the time partition size of the data lake according to the optimal partition size, and acquiring the optimal barrel size to obtain optimal barrel information; And adjusting HUDI the barrel dividing size of the data lake according to the optimal barrel dividing information, and synchronizing the increment of the data warehouse in real time. Further, obtaining HUDI the finest time partition granularity of the data lake, and collecting the increm