CN-121997373-A - Big data privacy protection analysis system based on federal learning

CN121997373ACN 121997373 ACN121997373 ACN 121997373ACN-121997373-A

Abstract

The invention relates to the technical field of big data analysis and privacy protection, and discloses a big data privacy protection analysis system based on federal learning, which comprises a distributed data node cluster, federal coordination nodes, an encryption communication module, a privacy enhancement type federal learning module and a result verification and output module; the distributed data node cluster comprises a plurality of heterogeneous data nodes, each data node is provided with a data preprocessing sub-module and a data format conversion sub-module, and local original big data is stored. The invention realizes the full-flow privacy protection from data storage, preprocessing, transmission and model training to result output through a five-layer privacy protection mechanism of data preprocessing, de-identification, hybrid encryption communication, parameter homomorphic encryption, differential privacy disturbance and desensitization output, effectively avoids the risk of original data leakage and model parameter reverse reasoning, and completely meets the requirements of related laws and regulations on data privacy protection.

Inventors

YUAN NING
KANG JIANHUA
LI JINGHUI
ZHANG NING
LU SHIQING

Assignees

天津仁爱学院

Dates

Publication Date: 20260508
Application Date: 20260128

Claims (10)

1. The big data privacy protection analysis system based on federal learning is characterized by comprising a distributed data node cluster, federal coordination nodes, an encryption communication module, a privacy enhancement type federal learning module and a result verification and output module; The distributed data node cluster comprises a plurality of heterogeneous data nodes, each data node is provided with a data preprocessing sub-module and a data format conversion sub-module, local original big data are stored, only encrypted local model parameters are output to the outside, and original data and unencrypted model parameter information are not revealed; the data preprocessing sub-module is used for performing de-identification, outlier rejection and data standardization processing on the local original big data; The data format conversion sub-module is used for converting heterogeneous data into a unified tensor format, wherein the heterogeneous data comprises structured data, semi-structured data and unstructured data; The federation coordination node is used for initializing global model parameters, receiving encryption model parameters uploaded by each data node, executing global model aggregation update, and transmitting the updated global model parameters to each data node; The encryption communication module adopts a hybrid encryption mechanism to realize bidirectional data transmission encryption between the distributed data node cluster and the federal coordination node, and the hybrid encryption mechanism is combined with an asymmetric encryption algorithm to be used for key negotiation and the symmetric encryption algorithm to be used for data transmission; The privacy enhancement type federation learning module is deployed at each data node and federation coordination node and is used for completing local model training based on local data, and a differential privacy mechanism is introduced in the global model aggregation process to perform disturbance processing on aggregation parameters so as to realize privacy protection in the model training process; The result verification and output module is used for verifying the validity of the analysis result of the global model and outputting a final analysis report in a desensitization mode.
2. The big data privacy protection analysis system based on federal learning according to claim 1, wherein the de-identification processing of the data preprocessing sub-module comprises deleting a personal sensitive identifier and aiming at the identifier to perform encryption and desensitization, outlier rejection adopts a box diagram method, and data normalization processing is performed to normalize data to a [0,1] interval.
3. The big data privacy protection analysis system based on federal learning of claim 1, wherein the asymmetric encryption algorithm in the encryption communication module adopts RSA-2048 algorithm, the symmetric encryption algorithm adopts AES-256 algorithm, and the encryption communication module is further provided with a key dynamic update sub-module for periodically updating the symmetric encryption key based on the time stamp and the node identity information.
4. The big data privacy protection analysis system based on federal learning, which is characterized in that the privacy enhancement type federal learning module comprises a local training submodule, a parameter encryption submodule, a global aggregation submodule and a mode adaptation submodule, wherein the local training submodule adopts a gradient descent algorithm to complete local model training to generate local model parameters, the parameter encryption submodule carries out homomorphic encryption processing on the local model parameters and then uploads the same-state encryption processing to a federal coordination node, the global aggregation submodule adopts a weighted average algorithm to aggregate each encrypted local model parameter and adds Laplace noise to complete parameter disturbance through a differential privacy mechanism, and the mode adaptation submodule is used for automatically selecting a transverse federal learning mode, a longitudinal federal learning mode or a federal migration learning mode according to data distribution characteristics among data nodes.
5. The big data privacy protection analysis system based on federal learning of claim 4, wherein the privacy budget epsilon of the differential privacy mechanism can be dynamically adjusted according to the sensitivity level of the data node, the higher the sensitivity level is, the smaller the privacy budget epsilon is, the higher the noise disturbance intensity is, wherein epsilon [0.1,0.5] is when the sensitivity level is high, and epsilon [1.0,2.0] is when the sensitivity level is low.
6. The big data privacy protection analysis system based on federal learning according to claim 1, wherein the result verification and output module comprises a consistency verification sub-module and a desensitization output sub-module, the consistency verification sub-module verifies the validity of a global model by calculating the Mean Square Error (MSE) of each local model analysis result and the global model analysis result, when the MSE exceeds a preset threshold value, the global model retraining is triggered, and the desensitization output sub-module adopts data generalization and anonymization processing to delete sensitive information in an analysis report.
7. The big data privacy preserving analysis system based on federal learning of claim 6, wherein the predetermined threshold is 0.05.
8. The federal learning-based big data privacy protection analysis system according to claim 1, further comprising an anomaly monitoring module, wherein the anomaly monitoring module is configured to monitor an operation state, a data transmission flow rate, and a model training process of each data node in real time, and when an anomaly behavior is detected, generate alarm information and cut off a communication connection of the anomaly node.
9. The federal learning-based big data privacy preserving analysis system of claim 8, wherein the abnormal behavior comprises CPU utilization exceeding 90% for 10 minutes, data transmission traffic spikes above 50% or parameter update amplitudes exceeding ±50%.
10. The big data privacy protection analysis system based on federal learning according to claim 1, wherein the federal coordination node is further provided with a model optimization sub-module, and the super-parameters of the global model are dynamically adjusted according to the local data distribution characteristics of each data node based on a bayesian optimization algorithm.

Description

Big data privacy protection analysis system based on federal learning Technical Field The invention relates to the technical field of big data analysis and privacy protection, in particular to a big data privacy protection analysis system based on federal learning. Background Along with the rapid development of digital economy, big data collaborative analysis has become a core supporting means in the fields of financial management and control, medical diagnosis, government decision and the like. However, the cross-mechanism aggregation process of the original data has extremely high risk of privacy disclosure, and once sensitive data such as personal identity information, health files, transaction records and the like are stored or transmitted in a centralized manner, the sensitive data is easily stolen, tampered or abused, so that legal rights and interests of users are infringed, and mandatory requirements of laws and regulations such as personal information protection law, data security law, network security law and the like are possibly violated. In the prior art, big data privacy protection schemes are mainly divided into two types, namely a traditional privacy protection means comprising data encryption storage, static anonymization processing and the like. The encryption storage can only ensure the security of the data in a static state, the privacy leakage risk still faces when the data is decrypted for analysis, and the traditional anonymization processing (such as deleting identification information such as an identity card number and a mobile phone number) is easy for an attacker to restore the original data through association analysis (combining with public data or other dimensional information), so that the privacy protection effect is limited. The other type is a collaborative analysis scheme based on federal learning, the core thought is "data motionless model motion", and the data value mining is realized by collaborative training of a global model by each participant, but the existing federal learning system still has a plurality of technical defects: The communication safety is insufficient, namely, the model parameter transmission mostly adopts a single encryption mechanism, or the transmission process is not encrypted with high intensity, so that the transmission process is easy to intercept and crack, and the original data characteristics are leaked through the reverse reasoning of the model parameters; The privacy enhancement mechanism is imperfect, a privacy protection strategy which is dynamically adapted is lacking in the model aggregation process, fixed privacy parameters cannot meet the protection requirements of data with different sensitivity levels, the aggregation parameters are not effectively disturbed, and a reverse reasoning risk exists; The adaptability of the heterogeneous data is poor, namely, the cooperative analysis of structured data (such as database tables), semi-structured data (such as XML (extensive markup language), JSON (java server object) files) and unstructured data (such as texts and images) is difficult to be compatible, and the data format difference leads to low model training efficiency and poor precision; The abnormal monitoring and fault tolerance capability is lacking, namely the real-time monitoring of the running state, the data transmission process and the model training behavior of the data node is lacking, the abnormal conditions such as malicious node attack, hardware faults and the like cannot be timely identified, and an effective data complement and model training compensation mechanism does not exist after the abnormal node is separated, so that the system stability is influenced; the model optimization mechanism is stiff, namely the super parameters of the global model are fixed values, and the super parameters are not dynamically adjusted according to the data distribution characteristics of each node, so that the model is slow in convergence speed and insufficient in analysis precision. Therefore, developing a federal learning system with a multi-layer privacy protection mechanism, adapting heterogeneous data, supporting dynamic model optimization, and having both abnormal monitoring and fault tolerance capabilities becomes a key to solving the pain point of the prior art. Disclosure of Invention (One) solving the technical problems Aiming at the defects of the prior art, the invention provides a big data privacy protection analysis system based on federal learning, which solves the problems in the background art. (II) technical scheme The invention aims to realize the technical scheme that the big data privacy protection analysis system based on federal learning comprises a distributed data node cluster, federal coordination nodes, an encryption communication module, a privacy enhancement type federal learning module, a result verification and output module and an abnormality monitoring module, wherein the modules work cooperatively t