CN-116303790-B - Billion data synchronization method, device and storage medium based on distributed environment

CN116303790BCN 116303790 BCN116303790 BCN 116303790BCN-116303790-B

Abstract

The invention relates to a billion data synchronization method, a billion data synchronization device and a storage medium based on a distributed environment, which are applied to the technical field of big data synchronization and comprise the following steps: the method is characterized in that a distributed computer is used as a basis, a data synchronization task is split, so that a plurality of computers synchronously split subtasks, the data synchronization rate is greatly improved, a user can freely adjust the number of computers needing to work according to the data quantity to be synchronized, in the data synchronization process, according to a specified slicing algorithm in a scheduling method and the number of target tables in a target data source, operation is carried out, and therefore the target table to which each piece of data in an original data source should be synchronized is obtained, splitting single-table data in the original data source into multi-table data in the target data source is realized, and the read-write pressure of each target table in the target data source can be effectively reduced.

Inventors

ZHU CHAOYANG
ZHANG SHENGFEI

Assignees

上海中通吉网络技术有限公司

Dates

Publication Date: 20260512
Application Date: 20230321

Claims (6)

1. A method for synchronizing billions of data based on a distributed environment, the method comprising: The dispatching method of the distributed computers is configured in the distributed task dispatching tool, and the distributed computers which are deployed are triggered by the distributed task dispatching tool; After the distributed computer is triggered, determining a management computer and a plurality of subtask computers according to a dispatching method of the distributed computer configured in a distributed task dispatching tool; the management computer splits the data synchronization task into a plurality of subtasks according to a scheduling method, and distributes the split subtasks to the management computer and the plurality of subtask computers; The management computer splits the data synchronization task into a plurality of subtasks according to a scheduling method, and distributes the split subtasks to the management computer and the plurality of subtask computers, wherein the management computer comprises: After the management computer splits the data synchronization task into a plurality of subtasks according to a scheduling method, persisting the split subtasks into a task manager, and continuously inquiring the tasks from the task manager in a timing polling mode until the tasks are acquired by all the subtask computers and the management computer before running the data synchronization; The management computer and the subtask computer read data from the original data source according to own subtasks and operate according to the number of the target tables and the slicing algorithm specified in the scheduling method, so that the target table of the target data source to which each piece of data in the original data source should be synchronized is obtained until all original data are synchronized to the target of the target data source; after the management computer or the subtask computer acquires the subtasks, splitting the subtasks again according to a scheduling method to obtain second subtasks, and synchronously executing the second subtasks split by the subtasks of the management computer or the subtask computer; The operation is performed according to the number of the target tables and the slicing algorithm specified in the scheduling method, so as to obtain which target table of the target data source each piece of data in the original data source should be synchronized to, wherein the method comprises the following steps: ordering all data tables in the target data source, wherein each data table has a corresponding sequence number; The management computer or the subtask computer analyzes the numerical value of each column of the corresponding data from the original data source as a slicing value according to the second subtask of the management computer or the subtask computer, and calculates the residual value of the slicing value and the number of the data tables in the target data source, wherein the residual value is the serial number of the data table of the column of the data in the original data source, which should be stored in the target data source.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises, The scheduling method also comprises a data synchronization mode; the data synchronization mode comprises a full-quantity synchronization mode and a full-quantity plus increment synchronization mode; The full synchronization mode includes: When the management computer or the subtask computer starts to execute the subtask, taking the time point of starting to execute the subtask as the reference, and synchronizing all data in the original data source at the time point into the target data source; the full-add incremental synchronization mode includes: sensing incremental data generated in the full synchronization process through a plug-in of a database, and collecting the incremental data into a distributed queue; after the full synchronization is completed, all management computers or subtask computers which complete the full synchronization automatically consume the incremental data of the distributed queue; After the incremental data is consumed, the incremental data is still split into a plurality of subtasks, and the management computer or the subtask computer synchronizes the incremental data into a target table of a target data source according to the acquired subtasks.
3. The method of claim 2, wherein the step of determining the position of the substrate comprises, After the management computer splits the data synchronization task into a plurality of subtasks, the split subtask results are synchronized into a state manager by the management computer, and each time the management computer and the subtask computer complete a second subtask, the management computer synchronizes the completed state into the state manager.
4. The method of claim 1, wherein the step of determining the position of the substrate comprises, After the management computer or the subtask computer acquires the original data in the original data source, the original data is directly packaged and sent to a user in the form of method parameters; and the user performs various processing treatments on the original data in the form of the method parameters, and sends the processed data to a management computer or a subtask computer, and the management computer or the subtask computer synchronizes the processed data to a target data source.
5. Billion data synchronization apparatus based on a distributed environment, the apparatus comprising: The task triggering module is used for configuring a dispatching method of the distributed computer in the distributed task dispatching tool, and triggering the deployed distributed computer through the distributed task dispatching tool; the computer classification module is used for determining a management computer and a plurality of subtask computers according to a dispatching method of the distributed computers configured in the distributed task dispatching tool after the distributed computers are triggered; The management computer splits the data synchronization task into a plurality of subtasks according to a scheduling method, and distributes the split subtasks to the management computer and the plurality of subtask computers, wherein the management computer comprises: After the management computer splits the data synchronization task into a plurality of subtasks according to a scheduling method, persisting the split subtasks into a task manager, and continuously inquiring the tasks from the task manager in a timing polling mode until the tasks are acquired by all the subtask computers and the management computer before running the data synchronization; The task splitting module is used for splitting the data synchronization task into a plurality of subtasks by the management computer according to a scheduling method, and distributing the split subtasks to the management computer and the plurality of subtask computers; the data synchronization module is used for reading data from the original data source by the management computer and the subtask computer according to the subtasks of the management computer and the subtask computer, and calculating according to the number of the target tables and the slicing algorithm designated in the scheduling method, so that the target table of the target data source to which each piece of data in the original data source should be synchronized is obtained until all original data are synchronized to the target of the target data source is completed; after the management computer or the subtask computer acquires the subtasks, splitting the subtasks again according to a scheduling method to obtain second subtasks, and synchronously executing the second subtasks split by the subtasks of the management computer or the subtask computer; The operation is performed according to the number of the target tables and the slicing algorithm specified in the scheduling method, so as to obtain which target table of the target data source each piece of data in the original data source should be synchronized to, wherein the method comprises the following steps: ordering all data tables in the target data source, wherein each data table has a corresponding sequence number; The management computer or the subtask computer analyzes the numerical value of each column of the corresponding data from the original data source as a slicing value according to the second subtask of the management computer or the subtask computer, and calculates the residual value of the slicing value and the number of the data tables in the target data source, wherein the residual value is the serial number of the data table of the column of the data in the original data source, which should be stored in the target data source.
6. A storage medium storing a computer program which, when executed by a master, implements the steps of the method for synchronizing billions of data in a distributed environment as claimed in any one of claims 1-4.

Description

Billion data synchronization method, device and storage medium based on distributed environment Technical Field The invention relates to the technical field of big data synchronization, in particular to a billion data synchronization method, a billion data synchronization device and a storage medium based on a distributed environment. Background In the prior art, data synchronization is usually completed by adopting one computer, the synchronization rate is very slow under the background requirement of hundred million-level data synchronization, the aim of flexibly increasing or decreasing the number of computers according to the size of data volume can not be achieved by adopting one computer fixedly without expansibility, and meanwhile, in the prior art, in the data synchronization process, single-table data in an original data source are not supported to be split into a plurality of data tables in a target data source, so that the read-write pressure of the single target table is higher. Disclosure of Invention In view of the above, the present invention aims to provide a billion data synchronization method, apparatus and storage medium based on a distributed environment, so as to solve the problems that in the prior art, one computer is adopted to complete data synchronization, and the synchronization rate is slow and the number of computers cannot be flexibly increased or decreased according to the size of a data volume under the condition that the data to be synchronized is larger, and meanwhile, the problem that in the existing data synchronization process, splitting single table data in an original data source into multiple data tables in a target data source is not supported, so that the read-write pressure of the single target table is larger is solved. According to a first aspect of embodiments of the present invention, there is provided a method of billion-level data synchronization in a distributed environment, the method comprising: The dispatching method of the distributed computers is configured in the distributed task dispatching tool, and the distributed computers which are deployed are triggered by the distributed task dispatching tool; After the distributed computer is triggered, determining a management computer and a plurality of subtask computers according to a dispatching method of the distributed computer configured in a distributed task dispatching tool; the management computer splits the data synchronization task into a plurality of subtasks according to a scheduling method, and distributes the split subtasks to the management computer and the plurality of subtask computers; The management computer and the subtask computer read data from the original data source according to the subtasks of the management computer and the subtask computer, and operate according to the number of the target tables and the slicing algorithm specified in the scheduling method, so that the target table of the target data source to which each piece of data in the original data source should be synchronized is obtained, and the synchronization of all original data to the target of the target data source is completed. Preferably, the first and second channels are arranged in a row, The management computer splits the data synchronization task into a plurality of subtasks according to a scheduling method, and distributes the split subtasks to the management computer and the plurality of subtask computers, wherein the management computer comprises: After the management computer splits the data synchronization task into a plurality of subtasks according to the scheduling method, the split subtasks are persisted into a task manager, and all the subtask computers and the management computer continuously inquire the tasks from the task manager in a timing polling mode until the tasks are acquired before running the data synchronization. Preferably, the method comprises the steps of, After the management computer or the subtask computer acquires the subtasks, splitting the subtasks again according to a scheduling method to obtain second subtasks, and synchronously executing the second subtasks split by the subtasks of the management computer or the subtask computer. Preferably, the method comprises the steps of, The operation is performed according to the number of the target tables and the slicing algorithm specified in the scheduling method, so as to obtain which target table of the target data source each piece of data in the original data source should be synchronized to, wherein the method comprises the following steps: ordering all data tables in the target data source, wherein each data table has a corresponding sequence number; The management computer or the subtask computer analyzes the numerical value of each column of the corresponding data from the original data source as a slicing value according to the second subtask of the management computer or the subtask computer, and calculates the residual val