Search

CN-121980606-A - Privacy protection method based on federal learning and differential privacy fusion

CN121980606ACN 121980606 ACN121980606 ACN 121980606ACN-121980606-A

Abstract

The invention belongs to the technical field of privacy protection, and in particular relates to a privacy protection method based on federal learning and differential privacy fusion, which comprises the following steps of 1, determining a technical layered architecture; and step 2, implementing a privacy protection algorithm, and step 3, constructing a privacy protection system based on federal learning and differential privacy fusion. The method is based on a privacy protection method of fusion of federal learning (FEDERATED LEARNING, FL) and differential privacy (DIFFERENTIAL PRIVACY, DP), and has the core innovation that the advantages of the two technologies are deeply fused, and effective model training is realized on the premise of protecting user data privacy.

Inventors

  • HAN WEI
  • ZHANG YU
  • SONG XUANPEI
  • CHANG HAIBO
  • HU TING
  • JU YAN

Assignees

  • 航天科工智能运筹与信息安全研究院(武汉)有限公司

Dates

Publication Date
20260505
Application Date
20251230

Claims (10)

  1. 1. The privacy protection method based on fusion of federal learning and differential privacy is characterized in that in the method, federal learning is used for training by keeping data in local equipment, only model update is exchanged instead of original data, so that the risk of direct exposure of the data is reduced, however, only by using federal learning, malicious participants can infer sensitive information in the original training data through analysis of model update information; In order to remedy the potential vulnerability, after the local calculation model of the client is updated, noise conforming to differential privacy definition is strategically injected before the local calculation model of the client is sent to a central server for aggregation, the core principle of differential privacy is that the existence of any single data record has little influence on the final output result of an algorithm through a carefully designed randomization mechanism, so that extremely high privacy level is mathematically ensured, specifically, the added noise amount is precisely controlled by a key parameter, namely privacy budget epsilon, which quantifies the upper risk limit of privacy leakage, the smaller epsilon value is, the greater the added noise is, the stronger the privacy protection is, and the model precision is possibly correspondingly reduced.
  2. 2. The privacy preserving method based on federal learning and differential privacy fusion of claim 1, wherein the privacy preserving method based on federal learning and differential privacy fusion comprises: step1, determining a technical layered architecture; Step 2, implementing a privacy protection algorithm; And 3, constructing a privacy protection system based on federal learning and differential privacy fusion.
  3. 3. The privacy protection method based on federal learning and differential privacy fusion of claim 2, wherein in step 1, a technology layered architecture is determined, and the method specifically comprises the following sub-steps: Step 1.1, data collection stage The method uses blockchain to record the data contribution degree, realizes verifiable privacy protection and excitation mechanism, prevents the original data from going out of the local place and avoids the risk of centralized storage; Step 1.2, data processing stage In the data processing stage, the method processes based on mixed encryption and safe calculation, applies homomorphic encryption, supports direct operation on encrypted data, supports safe multi-party calculation, namely a secret sharing protocol based on a Beaver triplet; step 1.3, data publishing stage Generating privacy protection in a data release stage, namely integrating Wasserstein distance and gradient normalization in a generator by utilizing differential privacy generation countermeasure network technology, then performing synthetic data quality assessment, and using topology data analysis to verify the geometric consistency of data distribution; Step 1.4, access control procedure The zero knowledge proof system is utilized to construct an attribute proof protocol based on zk-SNARKs, and the integration HYPERLEDGER INDY is used for achieving decentralization avatar management, for example, a verifier does not need to know the specific age of a user, and can verify the authenticity of a proposition that the age is more than or equal to 18.
  4. 4. The privacy preserving method based on federal learning and differential privacy fusion of claim 3, wherein in step 2, a privacy preserving algorithm is implemented; Firstly, a central server initializes a global model and distributes the global model to participating clients; the method comprises the steps of enabling each client to locally use an independent calculation model update of a private data set, enabling the client to apply a differential privacy mechanism to inject noise conforming to differential privacy definition into a local update vector before the local update is sent back to a server, enabling the noise addition to serve as a key step, guaranteeing that specific information in original data of the client cannot be deduced reliably even if the server or other parties obtain the update of a single client, enabling the client to encrypt the update with noise in combination with a safety aggregation protocol when the malicious server is prevented from snooping the single update, enabling the server to only decrypt the sum of updates of all participants and not obtain updated content of any single client, enabling the server to execute safety aggregation decryption after receiving all encrypted updates with noise to obtain aggregated global update, enabling the server to use the aggregated global update to improve the global model and send the updated model to the client again for carrying out next round of training, enabling the process to be circulated until the model converges or reaches a preset round, and guaranteeing that the whole privacy consumption of the whole training process is strictly guaranteed when the whole privacy consumption of the whole training process is strictly guaranteed.
  5. 5. The privacy preserving method based on federal learning and differential privacy fusion of claim 4, wherein step 2 comprises: step 2.1, self-adaptive differential privacy; 2.2, quantum security homomorphic encryption; and 2.3, cross-mode privacy protection.
  6. 6. The privacy preserving method based on federal learning and differential privacy fusion of claim 5, wherein in step 2: step 2.1, self-adaptive differential privacy; The method comprises the steps of developing a privacy budget dynamic allocation algorithm based on reinforcement learning, automatically adjusting noise quantity when query sensitivity changes, wherein in the traditional FL, DP realizes privacy protection by adding noise to a model gradient uploaded by a client, but the problem caused by fixed noise strength epsilon value is that the model is not converged and the model precision is seriously reduced in early training period, the model is converged and small noise cannot fully protect privacy in later training period, in addition, the heterogeneity of client data, the data distribution and sensitivity difference of different clients are large, unified noise is unreasonable, the core idea of self-adaptive differential privacy is to dynamically adjust the noise strength and privacy budget allocation, (1) the adjustment is carried out according to the training period, the initial use of larger noise tolerance is high epsilon, the later step is gradually tightened, the adjustment is carried out according to the contribution of the client, the distribution of more budget to the client with high sensitivity or high importance is carried out, 3 the adjustment is carried out according to the importance of model parameters, the key parameters is added with less noise, the dynamic privacy budget is scheduled to automatically adjust epsilon value based on the model convergence state, and the exponential attenuation strategy is: Client sensitivity perception is the computation of client sensitivity using local data distribution Dynamic allocation Layering noise injection is to apply different noise to different layers of the model; In each round of communication of federal learning, the noise size of differential privacy is dynamically adjusted according to the current training state and the sensitivity of client data, firstly, in dynamic privacy budget allocation, the privacy budget decays exponentially with training rounds: wherein For initial budget, k is the decay coefficient, t is the current round, again, the perception of client sensitivity is the gradient sensitivity of client i: Wherein D and Then, adaptive noise injection, gaussian noise standard deviation is: where delta is failure probability and client-specific privacy budget is Wherein the method comprises the steps of As the reference sensitivity, there is provided, Is a protection threshold; finally, the layered noise mechanism is that the model parameters are Layer by layer grouping, the Layer noise: Wherein, the , As a layer-importance factor, Is the attenuation coefficient; 2.2, quantum security homomorphic encryption; The encryption scheme based on lattice, coding or hash is resistant to quantum computing attack, the method supports direct addition or multiplication operation on ciphertext, in federal learning, a client terminal uses post quantum cryptography-homomorphic encryption gradient, a server is aggregated in a ciphertext state and is not decrypted in the whole course, quantum security and homomorphic fusion are used, gradient fusion of quantum security is realized, homomorphic encryption calculation expenditure is reduced through gradient quantization, privacy protection noise is added before client terminal encryption, double protection of encryption and differential privacy is realized, and a quantum homomorphic encryption algorithm comprises the following steps: step 2.2.1 lattice-based CKKS encryption, plaintext gradient Encryption: Wherein s is a private key, a is a random polynomial, e is an error term, and delta is a scaling factor; Step 2.2.2, ciphertext aggregation, encryption gradient of client k Polymerization: step 2.2.3, quantum security noise injection, encryption after adding DP noise locally by the client: Step 2.2.4, decryption and updating, global model updating: Wherein, the For learning rate, decryption needs to satisfy ; Step 2.3, cross-modal privacy protection; When multi-modal data is processed in a federal learning environment, the cross-modal privacy protection core principle is that the invisible association existing between different modal data can become a new approach of privacy disclosure, wherein the new approach comprises that an attacker can update reversely related text features by analyzing and highlighting scale, the method develops a modal sensitivity assessment mechanism, and the privacy requirements of each mode are dynamically quantified by analyzing the reconstruction risk in a feature space; in addition, the method designs a special noise injection module of a fusion layer, which can adjust noise parameters in real time according to cross-modal association strength, and the hierarchical protection system fundamentally blocks the association attack path of 'deducing another mode through one mode' while guaranteeing the functional integrity of a multi-mode model; step 2.3.1, quantifying the modal sensitivity, and privacy risk of the modality m: In order to reconstruct the result of the feature, Is modal data distribution; Step 2.3.2, modality specific noise, noise standard deviation of modality m: is a sensitivity amplification factor; step 2.3.3 Cross-modal correlated noise, fusion layer noise injection Covariance matrix Is determined by modal mutual information: In order for the information to be of a mutual information, Is the temperature coefficient; Step 2.3.4, feature decoupling loss, sharing feature With private features Separating: for the estimation of the mutual information, Is a balance factor; The federal learning global objective function is: Wherein the method comprises the steps of As a regular term of the noise, And Is a trade-off coefficient; The privacy guarantee proves that: (1) Adaptive privacy protection satisfaction Privacy loss is accumulated by Moment accounting (Moment Account): Wherein C is a privacy loss random variable; (2) Cross-modal privacy guarantee, fusion layer output satisfies: Wherein the method comprises the steps of , To correlate leakage compensation terms.
  7. 7. The privacy protection method based on federal learning and differential privacy fusion of claim 6, wherein step 3, a privacy protection system based on federal learning and differential privacy fusion is constructed; The step 3 comprises the following steps: step 3.1, federal learning process management; (1) Client selection and scheduling Selecting a client participating in the current round of training from the available client pool according to a strategy; (2) Global model distribution Safely distributing the global model parameters of the current version to the selected participating clients; (3) Local training instructions Sending training instructions to the selected clients, wherein the training instructions comprise a used local data set, a local training round number, a local optimizer configuration and a local batch size; (4) Local model update collection Receiving local model updates uploaded from participating clients, the process requiring a secure communication channel; (5) Security model aggregation Coordinating a centralized aggregation process or a decentralized secure aggregation protocol, ensuring that a server cannot snoop updates of a single client before aggregation; Step 3.2, differential privacy mechanism implementation (1) Local gradient/update clipping Setting a clipping threshold C, limiting the norm of the update vector within C, which is a key step for meeting the DP sensitivity requirement; (2) Local noise injection Adding random noise meeting the differential privacy requirement on the model update after clipping on each participating client, wherein the most common noise is Gaussian noise or Laplacian noise, and the scale of the noise is determined by the factors of target privacy budget (epsilon, delta), clipping threshold C, the number q of participating clients and total training round number T; (3) Privacy amplification utilization The system automatically obtains privacy amplification effect by utilizing the inherent client sampling characteristic of federal learning, which means that the privacy guarantee actually reached is stronger than (epsilon, delta) achieved by applying the same noise injection on the corpus, namely epsilon_effect < epsilon; step 3.3 privacy budget management and tracking (1) Budget initialization and allocation Setting an initial total privacy budget (epsilon_total, delta) for the entire training task or for each participating client individual, delta typically being set to a small value; (2) Budget consumption calculation After each round of training is finished, precisely calculating privacy budgets (epsilon_round, delta_round) consumed by the round of training according to the DP combination theorem adopted; (3) Cumulative budget tracking Continuously accumulating the consumed budget of each round, and tracking the privacy budget (epsilon_used, delta_used) and residual budget (epsilon_remaining=epsilon_total-epsilon_used, delta_remaining=delta) of the current accumulated consumption in real time; (4) Budget exhaustion process When the accumulated consumed privacy budget reaches or exceeds the set total budget threshold, the system must terminate training or take strict measures to prevent further privacy disclosure, and the system will issue an explicit alarm; (5) Adaptive budget policy Dynamically adjusting the noise scale or sampling rate of subsequent rounds according to the model convergence condition, the residual budget and the client participation mode so as to better utilize the residual budget; Step 3.4, security model aggregation and communication (1) Encrypted communication channel Establishing and maintaining a secure encrypted communication link (such as TLS/SSL) between the server and all clients to prevent model parameters and updates from being eavesdropped or tampered in the transmission process; (2) Secure aggregation protocol The method comprises the steps of realizing and executing a secure multiparty computing protocol, wherein the protocol allows a server to calculate the sum of all client updates, but in the process of executing the protocol, the server and any single client cannot know the updated contents of other single clients; (3) Post polymerization treatment If the DP is selectively applied at the server side instead of the local client side, after the security aggregation is completed to obtain the total update, adding noise meeting the DP to the aggregation result at the server side; step 3.5, monitoring and auditing (1) Training process monitoring Monitoring key indexes in real time, wherein the key indexes comprise the number of participating clients, the client disconnection rate, the local training time consumption, the communication time consumption, the model performance, the current noise scale and the current accumulated privacy budget consumption (epsilon_used); (2) Privacy guarantee verification Recording all parameters related to DP calculation, including clipping threshold C, noise distribution type and scale The sampling rate q and the budget consumption calculation log of each round ensure that the actual execution process accords with preset DP parameters and combination theorem so as to carry out post audit verification; (3) Model utility assessment Analyzing the influence of DP noise on the final performance of the model, and evaluating privacy-utility trade-off; (4) Audit log record Recording system operation, client participation, budget consumption, critical events, potential anomalies or errors in detail; (5) Attack surface monitoring Monitoring whether the system has known attack signs (such as common sense of membership inference attack and model inversion attack) and evaluating the risk of privacy leakage of the model in actual deployment.
  8. 8. The privacy preserving method based on federal learning and differential privacy fusion of claim 7, wherein the method belongs to the technical field of privacy preservation.
  9. 9. The privacy protection method based on federal learning and differential privacy fusion according to claim 7, wherein the privacy protection method based on federal learning and differential privacy fusion is characterized in that the core innovation of the privacy protection method is to deeply fuse the advantages of two technologies, and effective model training is realized on the premise of protecting user data privacy.
  10. 10. The privacy protection method based on federal learning and differential privacy fusion of claim 7, wherein the method embeds a strict mathematical guarantee mechanism of differential privacy into a distributed training framework of federal learning, and applies privacy protection locally to data sources, thereby realizing strong protection of individual data privacy in the transmission and aggregation process of model update.

Description

Privacy protection method based on federal learning and differential privacy fusion Technical Field The invention belongs to the technical field of privacy protection, and particularly relates to a privacy protection method based on federal learning and differential privacy fusion. Background With the world of digitized wave mats, data has become a central element of production driving social development. According to IDC prediction, the total global data of 2025 reaches 175ZB, which contains a large amount of sensitive data such as personal health records, financial transaction information, position tracks and the like, however, fundamental contradiction exists between the release of data value and privacy protection all the time, and according to IBM '2022 data leakage cost report', the average cost of global single data leakage reaches 435 ten thousand dollars, and the medical industry reaches 1010 ten thousand dollars. In this context, strict data protection regulations (e.g., eu GDPR, china personal information protection law) are successively issued in various countries, requiring achievement of a core goal of "data availability invisible", and conventional privacy protection technologies face unprecedented challenges. Early privacy protection relied primarily on data anonymization techniques (k-anonymization, I-diversity), but university of cambridge study 2019 showed that 99.98% of the U.S. population could be uniquely identified by only 15 non-sensitive attributes. This motivates researchers to go to the distributed learning paradigm-2016 Google proposed a federal learning (FEDERATED LEARNING, FL) framework, leaving data on the local device, transmitting only model parameter updates. However, the native defects of federal learning are gradually revealed (1) privacy leakage risk: MIT team in 2020 proves that the original training image can be reconstructed through model gradient inversion attack, and the traditional differential privacy (DIFFERENTIAL PRIVACY, DP) technology needs to inject a large amount of noise (epsilon > 5), so that the model accuracy is reduced by more than 30%. (2) The centralized architecture is bottleneck in that the design relying on a single parameter server presents a single point of failure risk and cannot effectively motivate data contributors. (3) Heterogeneous data challenges are that the accuracy of the conventional FedAvg algorithm is reduced by 40% in the scene of Non-independent co-distributed (Non-IID) data common in the medical field. In the data calculation link, the encryption technology goes through three development stages, namely, an AES, RSA and other standard algorithms are adopted in the basic encryption stage, but the data can be used only after decryption, and the real-time calculation requirement can not be met. Secure multiparty computing (SMPC) based on Yao protocol and secret sharing (e.g., SPDZ framework), the computing may be secure with multiparty participation. However, 2021 AWS has shown that the communication overhead for joint query of millions of data exceeds 1TB. Breakthrough work (2009) with homomorphic encryption (FHE): gentry enabled ciphertext computation, but Microsoft SEAL library testing showed that it took more than 2 hours (i 9 processor) to infer the ResNet-50 model. The prior art route is in the way that SMPC is suitable for low-complexity calculation but has high communication cost, FHE supports complex calculation but has low calculation efficiency. Studies in the university of stent 2022 indicate that hybrid encryption architecture may be a breaking key, but how to achieve seamless joining of protocols remains an unsolved problem. Traditional data desensitization techniques (generalization, perturbation) have failed to address the privacy threat of the machine learning era. The generated privacy protection is generated, and the differential privacy synthesis data is that DP-GAN framework is proposed by 2017 Abadi, but the problem of mode collapse exists, and the FID value exceeds 50 on the MNIST data set. The federal generation model is 2021 NVIDIA CLARA that the system realizes the synthesis of the cross-mechanism medical images, but consumes more than 200GB of video memory. The evaluation system lacks that the existing work excessively depends on statistical indexes such as KL divergence and the like, and lacks verification of a data topological structure (such as continuous homology). While the Zero Knowledge Proof (ZKP) technology can realize 'minimum information disclosure', practical deployment faces serious challenges, zk-SNARKs still needs more than 3 seconds (single-core CPU) when proving 'age is more than or equal to 18' simple proposition, the size of a circuit file written in Circom language grows along with the index of logic complexity, and the breakthrough of a Shor algorithm on an RSA system alerts the potential safety hazard of the existing ZKP system. Federal learning+blockchain technology in 2023 Nature pape