CN-121996727-A - Distributed data processing method and device for insurance industry

CN121996727ACN 121996727 ACN121996727 ACN 121996727ACN-121996727-A

Abstract

The invention discloses a distributed data processing method and device for the insurance industry, which can effectively solve the problems of performance bottleneck and poor expansibility of a traditional centralized database in the insurance industry under a massive data scene, realize the flow batch integrated processing of high concurrency transaction processing and batch analysis tasks through the combination of a distributed database middleware and an MPP framework, and improve the processing efficiency and usability of a system.

Inventors

QIAO LAN
ZHANG NA
WANG YANLONG
YANG MENG
MA NAN
CHEN HE
LI DAIJIANG

Assignees

人保信息科技有限公司
中国人民保险集团股份有限公司

Dates

Publication Date: 20260508
Application Date: 20251212

Claims (10)

1. A method for distributed data processing in the insurance industry, comprising: s1, constructing a distributed database middleware based on a domestic centralized database, and fragmenting service data into a plurality of database nodes according to a preset rule through an intelligent routing mechanism to realize vertical library separation and load balancing; s2, capturing real-time change of the OLTP service data in the distributed database, converting the changed data into a standardized data format, and transmitting the standardized data format to a message queue to form a heterogeneous data synchronization channel; s3, distributing the data stream in the message queue to an MPP architecture database cluster, generating a distributed execution plan based on SQL analysis, and pushing down to a corresponding node for parallel calculation; and S4, monitoring the node state of the database through the federal cluster management component, automatically switching to the standby node when the main node fails, and feeding back batch processing results to the front-end application.
2. The method of claim 1, wherein the constructing a distributed database middleware based on a domestic centralized database, fragmenting service data to a plurality of database nodes according to a preset rule through an intelligent routing mechanism, and implementing vertical database splitting and load balancing further comprises: S11, adding SHARDINGKEY: shardingValue identification in a request head of an application system, analyzing the identification by a PDFC-Sharding middleware and determining a target sub-library cluster by adopting a HASH algorithm; s12, the TARGETSERVERTYPE parameters in the JDBC connection string are configured to be master or slave, and automatic load balancing and fault transfer of the master node and the slave node are realized by combining autoBalance parameters.
3. The method of claim 1, wherein capturing real-time changes to OLTP service data in the distributed database, converting the changed data to a standardized data format and transmitting to a message queue, forming a heterogeneous data synchronization channel further comprises: s21, capturing the bottom log of the openGauss database in real time by utilizing a exBase tool, and generating an incremental data change record by analyzing the DML operation in the binlog; S22, carrying out serialization processing on the incremental data by adopting a JSON format, and realizing asynchronous transmission of cross-system heterogeneous data through a Kafka message queue.
4. The method of claim 1, wherein distributing the data stream in the message queue to an MPP architecture database cluster, generating a distributed execution plan based on SQL parsing and pushing down to a corresponding node for parallel computation further comprises: s31, reading the JSON data stream in the message queue through KAFKA STREAM, and converting the JSON data stream into a column type storage format which can be processed by an MPP framework; S32, splitting the calculation task into a plurality of subtasks based on the SQL analysis result, and pushing the subtasks down to MPP nodes containing corresponding data partitions according to the data distribution characteristics.
5. The method of claim 1, wherein monitoring the database node status by the federal cluster management component, automatically switching to the standby node when the primary node fails, and simultaneously feeding back batch processing results to the front-end application further comprises: S41, utilizing a openGauss self-contained CM cluster management component to monitor the CPU utilization rate, the memory occupancy rate and the disk IO throughput of each node in real time; And S42, when the main node fault is detected, selecting an optimal node from the standby nodes to perform main-standby switching according to a load balancing strategy configured by autoBalance parameters.
6. The method as recited in claim 1, further comprising: s5, data verification operation is carried out on the data stream transmitted to the MPP framework, and retransmission processing is carried out on lost or abnormal data by adopting a compensation mechanism through comparison of the difference of the hash values of the primary keys of the source database and the target MPP node.
7. An insurance industry distributed data processing apparatus, comprising: The distributed database middleware construction module is used for constructing distributed database middleware based on a domestic centralized database, and dividing service data into a plurality of database nodes according to a preset rule through an intelligent routing mechanism to realize vertical database division and load balancing; The real-time data capturing and synchronizing module is used for capturing real-time change of the OLTP service data in the distributed database, converting the changed data into a standardized data format and transmitting the standardized data format to the message queue to form a heterogeneous data synchronizing channel; The data stream distribution and parallel computing module is used for distributing the data stream in the message queue to the MPP architecture database cluster, generating a distributed execution plan based on SQL analysis and pushing down to a corresponding node for parallel computing; and the federal cluster monitoring and fault switching module is used for monitoring the node state of the database through the federal cluster management assembly, automatically switching to the standby node when the main node fails, and simultaneously feeding back the batch processing result to the front-end application.
8. The apparatus of claim 7, wherein the distributed database middleware building module is further to: By adding 'SHARDINGKEY: shardingValue' identification in a request head of an application system, analyzing the identification by a PDFC-Sharding middleware and adopting a HASH algorithm to determine a target sub-library cluster; And configuring 'TARGETSERVERTYPE' parameters in the JDBC connection string as master or slave, and combining 'autoBalance' parameters to realize automatic load balancing and fault transfer of the master node and the slave node.
9. The apparatus of claim 7, wherein the real-time data capture and synchronization module is further to: Capturing the bottom log of the openGauss database in real time by using exBase tools, and generating an incremental data change record by analyzing the DML operation in the binlog log; And carrying out serialization processing on the incremental data by adopting a JSON format, and realizing asynchronous transmission of cross-system heterogeneous data through a Kafka message queue.
10. The apparatus of claim 7, wherein the data stream distribution and parallel computation module is further to: Reading the JSON data stream in the message queue through KAFKA STREAM, and converting the JSON data stream into a column type storage format which can be processed by an MPP framework; and splitting the computing task into a plurality of subtasks based on the SQL analysis result, and pushing the subtasks down to MPP nodes containing corresponding data partitions according to the data distribution characteristics.

Description

Distributed data processing method and device for insurance industry Technical Field The invention relates to the technical field of computer software/information, in particular to a distributed data processing method and device in the insurance industry. Background With the deep popularization and popularization of the Internet, the continuous development of business in the insurance industry brings about rapid increase of data volume and business volume, the original monomer system can not bear the existing pressure, and various companies begin to split the business system in a dispute. Meanwhile, with digital transformation, the service data has greater and greater value. When the technology is upgraded and reformed, each company uses a distributed technology to solve storage and performance bottlenecks existing in the monomer service, such as distributed micro-service, distributed database and the like, and uses a stream computing and big data technology to solve data processing and mass data analysis problems, such as Flink, hadoop and the like. There is no good solution to how to solve the data dispersion caused by the splitting of the monomer system and how to better utilize the data of multiple systems. The popularization of internet application, the data volume is continuously increased, and after the database reaches a certain level, the performance is gradually reduced, and storage, IO and the like are a prominent bottleneck of the monomer database. The Oracle RAC cluster enables Oracle to be changed from a single machine mode to a multi-machine parallel mode, load balancing and fault switching of database nodes are achieved, and high availability of application is guaranteed. However, compared with a single machine, the RAC cluster bottom layer technology is complex, management is more complex, and because the cluster shares resources, the risk of resource contention exists, and in particular, the scalability is poor under the large-scale data and high concurrency scene, thereby causing system bottleneck and performance bottleneck. In the insurance industry, an Oracle RAC database is usually used for supporting transaction and batch type business, system data processing is realized by a storage process, with the increase of business data volume, the data volume of a single table is increased by hundreds of millions of records, the storage capacity of the database is increased to a T level, a database server is often tamped by multi-table association in the data processing process, such as a most typical merge operation (increment combination full amount), under the RAC architecture, the table of the hundreds of millions of records is used for doing the merge operation, the consumed time is conceivable, and the optimization work such as table partitioning, cache optimization and the like is needed, so that the time and the labor are consumed. In addition, the technology 'card' neck event happens occasionally in recent years, localization is realized, and autonomous control of the technology is a major subject faced by the insurance industry and is a breakthrough. Therefore, there is an urgent need for a new architecture to replace the above architecture features, actual services and technical needs. Disclosure of Invention The invention aims to provide a distributed data processing method in insurance industry, which aims at the problems of long core transaction time, uneven pressure distribution, on-line service surge, data scale surge, high service complexity, multiple batch processing tasks and the like of insurance service, and based on an open-source and domestic centralized database openGauss, a lightweight data middleware PDFC-Sharding for self-grinding of a human-care group is introduced, so that a database middleware, a centralized database, a data replication tool, stream processing and MPP are constructed, and a distributed data stream batch processing architecture is realized. Another object of the present invention is to provide a distributed data processing apparatus in insurance industry. To achieve the above objective, an embodiment of a first aspect of the present invention provides a distributed data processing method in insurance industry, including: s1, constructing a distributed database middleware based on a domestic centralized database, and fragmenting service data into a plurality of database nodes according to a preset rule through an intelligent routing mechanism to realize vertical library separation and load balancing; s2, capturing real-time change of the OLTP service data in the distributed database, converting the changed data into a standardized data format, and transmitting the standardized data format to a message queue to form a heterogeneous data synchronization channel; s3, distributing the data stream in the message queue to an MPP architecture database cluster, generating a distributed execution plan based on SQL analysis, and pushing down to a correspondin