CN-121983144-A - Federal learning classification system for single-cell RNA sequencing data privacy protection and method thereof

CN121983144ACN 121983144 ACN121983144 ACN 121983144ACN-121983144-A

Abstract

The invention discloses a federal learning classification system and a federal learning classification method for single-cell RNA sequencing data privacy protection, which relate to the technical field of data privacy classification, wherein the scRNA-seq data is standardized and features are enhanced through comparison learning and a self-encoder, a multi-model dynamic collaborative training module supports dynamic switching of models such as a graph convolution network, a cross-mode transducer and the like according to the data features, a causal inference intelligent evaluation module recommends an optimal model through a three-level evaluation system and a deep Q network, a hierarchical compression communication optimization module adopts a three-level compression strategy to reduce traffic, and a safe multiparty calculation privacy protection module integrates encryption and differential privacy mechanisms to ensure data safety. The method supports multi-model dynamic adaptation, improves classification precision and system universality, reduces communication overhead, adapts heterogeneous clients, ensures data privacy, and realizes efficient and safe federal learning classification of single-cell RNA sequencing data.

Inventors

HAN QINGBIN
ZHANG XIAN
LI NING

Assignees

江苏田倍丰农业科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251204

Claims (10)

1. The federal learning classification system for single-cell RNA sequencing data privacy protection is characterized by comprising: The self-adaptive data preprocessing module is used for executing per million ready count standardization processing on single-cell RNA sequencing original data, carrying out logarithmic transformation processing on an original count value, carrying out enhancement characteristic representation through a comparison learning framework, constructing a twin network by using a random mask of 15% gene expression values as a negative sample and utilizing a six-layer transducer encoder, and enhancing cell type discrimination characteristics by adopting a NT-Xent loss function; The multi-model dynamic collaborative training module supports parallel training and self-adaptive switching of four models, namely a graph rolling network of a graph structure is built based on a gene co-expression relationship, an extreme gradient lifting integrated model of a tree depth is dynamically adjusted, a capsule attention network which comprises two capsule layers and adjusts the routing iteration times to 4 times according to the data sparsity, and a four-layer cross-mode transducer encoder which fuses more than 100 known cell marker maps; The causal inference evaluation module is used for constructing a three-layer evaluation system statistical significance layer, carrying out Wilcoxon test on a model F1-score by adopting Benjamini-Hochberg correction control test, triggering a model isolation mechanism by using a Do-Calculus calculation model when CATE is smaller than 0 by using a causal effect value CATE of the model on a cell type, generating a computational power-precision pareto curve by using a resource efficiency layer through multi-objective optimization, inputting an evaluation result into a depth Q network, and realizing automatic model recommendation through an epsilon-greedy strategy.
2. The federal learning classification system for single-cell RNA sequencing data privacy protection of claim 1, further comprising: The hierarchical compression communication optimization module adopts a parameter sparsification technology, performs sparsification processing on neural network parameters by adopting L1 regularization, and adds an L1 norm term into a loss function Wherein As a function of the total loss of the model, As a function of the original loss, For the regularization coefficient(s), Is L1 norm; the client only uploads the index and the corresponding value of the non-zero parameter, the server generates a global consensus index by aggregating index sets of all the clients, and the global consensus index is broadcasted to all the clients to restore the complete gradient; the security computing privacy protection module integrates three layers of privacy enhancement mechanisms, and specifically comprises the following steps: and the input layer is used for executing encryption processing on the input data by adopting the Paillier homomorphic encryption technology, wherein an encryption formula is as follows: wherein In the form of a modulus, Is the data in the clear of the text, The original data in the ciphertext state is normalized and calculated; Training layer, namely, based on the secure multi-party computing MPC (MPC) garbled circuit technology, encrypting the thinned gradient by a client and uploading the encrypted gradient, and directly executing addition aggregation by a server in a ciphertext state Wherein Representing the average of all client gradients, Representing the average gradient In the form of a ciphertext of (a), Representing the local gradient calculated by the i-th client, Representing the i-th client local gradient M represents the total number of clients participating in the aggregation, The method is that the space modulus of ciphertext encrypted by Paillier is verified by non-interactive zero knowledge proof; the output layer adopts a differential privacy enhanced noise injection strategy, and adds noise conforming to Laplacian distribution to each non-zero parameter before gradient encryption Wherein In order to be sensitive to this, For privacy budget, calculation by frequency statistics in combination with cell type rarity quantification Wherein As a global average frequency of the signals, The larger the value of the value range [0,1] is, the rarer the cell type is, and the rarefaction of the gene expression is related to Combining to obtain Implementing dynamic privacy budget allocation when Generating an optimal balance scheme of calculation power, privacy and precision through pareto optimization; and in the decryption verification stage, the client uses the private key to decrypt the aggregation result, and verifies consistency to prevent the malicious server from being tampered.
3. The federal learning classification system for single-cell RNA sequencing data privacy protection according to claim 1, wherein the multi-model dynamic collaborative training module integrates a cross-mechanism knowledge migration mechanism, automatically triggers federal migration learning to extract graph topology features of a graph convolutional network from a source domain with sufficient data when the data volume of a client is less than 500 cells, aligns a target domain distribution by adapting an ADA algorithm to reduce domain differences by including a gradient inversion layer for an antagonism domain, the target domain refers to a client own data set, inputs the aligned features into a local model of the target domain, and performs supervision and fine tuning by using a small amount of target domain data to solve a modeling deviation problem caused by insufficient rare cell type samples.
4. The federal learning classification system for single-cell RNA sequencing data privacy protection of claim 1, wherein the causal inference evaluation module constructs a cell type-model causal graph, analyzes causal effect pathways of a model on cell types using a structural causal model SCM, and automatically triggers a model correction mechanism when the model is found to indirectly cause classification bias by reducing noise in gene expression.
5. The federal learning classification system for single-cell RNA sequencing data privacy protection of claim 2, wherein the hierarchical compressed communication optimization module introduces a neural network parameter predictor that predicts current round parameter changes based on a previous 3 rounds of parameter update pattern, and residual prediction errors of the transmitted actual parameters and predicted values are dynamically corrected by kalman filtering.
6. The federal learning classification system for single-cell RNA sequencing data privacy protection of claim 2, wherein the secure multiparty computing privacy protection module develops a model verification mechanism based on zero-knowledge proof ZKP, and the server verifies client parameter update legitimacy through non-interactive zero-knowledge proof NIZK while ensuring that the parameter update strictly conforms to the federal learning training procedure.
7. The federal learning classification system for single-cell RNA sequencing data privacy protection of claim 1, wherein the multi-model dynamic co-training module performs cross-modal transducer integrated gene expression-spatial location joint characterization, when the client has spatial transcriptome data, the local association information of cells within a radius of 50 μm is fused through an attention mechanism, wherein the adaptive graph rolling network generates a neighborhood weight matrix between cells, the matrix is used as input of a transducer attention layer to regulate association intensity weights of different cells, connection of graph network characteristics and transducer sequence modeling is realized, fusion effect of local cell interaction information is enhanced, cell type classification precision related to spatial location is improved, and the joint analysis requirement of spatial transcriptome and single-cell sequencing data is met.
8. A method of using the federal learning classification system for single-cell RNA sequencing data privacy protection of any one of claims 1-7, comprising: Performing CPM standardization and log2 conversion on scRNA-seq data, using a 15% gene mask as a negative sample through a contrast learning frame, constructing a twin network by using a six-layer Transformer encoder, calculating contrast loss by adopting a NT-Xent loss function, and strengthening cell type discrimination characteristics; Calculating the data distribution of local data and the candidate model adaptation distribution waserstein distance and sparsity, selecting a trans-modal transducer when the distance is more than 0.7 and the sparsity is more than 0.8, using a graph rolling network GRAPHSAGE aggregator when the distance is less than 0.3 and the sparsity is less than 0.5, selecting DyTree or CAN in the middle area through Bayesian optimization, and executing 1 iteration by adopting an Adam optimizer for each round of training; The causal inference evaluation step comprises the steps of adopting Benjamini-Hochberg correction control multiple test, calculating Wilcoxon test p value of a model F1-score, constructing causal graph calculation CATE by using Do-Calculus, and isolating the model when CATE is less than 0; The client executes L0 regularization screening Top-20% gradient on the parameters, local sensitive hash is implemented on the attention weight of the transducer, 3 rounds of aggregation are delayed by weak calculation nodes, and the server compensates through historical parameter interpolation; And the safe calculation step is that the input layer realizes ciphertext CPM calculation by Paillier encryption, the training layer completes MPC aggregation through a garbled circuit, the output layer adds differential privacy noise, and the optimal privacy budget is generated through Paillier optimization.
9. The federal learning classification method for single-cell RNA sequencing data privacy protection of claim 8, wherein in the self-supervision feature enhancement step, dynamic time-aligned cell differentiation tracks are adopted for time-series data, track alignment weights are adjusted according to differentiation stage differences, and cell feature similarity in the same stage is improved by weighting a comparison loss function, so that the method is suitable for tracking research scenes of cell differentiation dynamic processes.
10. The federal learning classification method for single-cell RNA sequencing data privacy protection of claim 8, wherein in the layered compression communication step, the LSTM parameter predictor is utilized to learn a previous 3 rounds of parameter update pattern, the transmission residual data prediction error is corrected by kalman filtering, and meanwhile, the participation efficiency of the weak calculation power client is improved by reducing the communication rounds, so that federal learning scene deployment is supported.

Description

Federal learning classification system for single-cell RNA sequencing data privacy protection and method thereof Technical Field The invention relates to the technical field of data privacy classification, in particular to a federal learning classification system and a federal learning classification method for single-cell RNA sequencing data privacy protection. Background With the rapid development of bioinformatics, single cell RNA sequencing (scRNA-seq) technology has become a key means to study cellular heterogeneity, disease occurrence mechanisms, and promote accurate medical development. The technology can accurately describe gene expression profiles on a single cell level, and is widely applied to a plurality of important scenes such as immune cell analysis, cancer typing, stem cell development track research and the like. By classifying and modeling the large-scale and high-dimensional scRNA-seq data, the automatic identification of cell types can be realized, and further, the deep fusion of basic biological research and clinical application is promoted. In practical applications, to improve the accuracy and generalization ability of classification models, scientific research institutions often wish to integrate data from multiple experimental platforms, different tissue sources, or different institutions. However, scRNA-seq data, which contain individual gene expression characteristics, are highly sensitive biological privacy data, and are limited by strict laws and regulations such as the general data protection Act (GDPR), the Health Insurance Portability and Accountability Act (HIPAA). The method for directly centralizing the original data for modeling faces huge privacy risks and compliance challenges, and a technical scheme capable of protecting the data privacy and realizing multi-party data joint analysis is needed. In recent years, federal learning, which is a data-free local collaborative modeling mode, is an important path for exploring multi-party joint analysis in the biomedical field. The method has the defects that when the method is applied to scRNA-seq data classification, a plurality of classification algorithms adapting to single-cell sequencing technology are lack of unified federal learning frames, algorithm selection requirements caused by data distribution heterogeneity of different institutions are difficult to deal with, algorithm performance comparison and selection mechanisms are lack, model training strategies and algorithm selection depend on manual experience excessively, automatic optimization cannot be achieved, federal support of a complex model of a transducer is insufficient, problems of large parameters, long training time, high communication cost and the like are caused, deployment in a resource-limited scene is not facilitated, dynamic management mechanisms of heterogeneous clients are lacked, and the method is difficult to adapt to the conditions of quantity change of clients, uneven data sources, calculation resource difference and the like, so that unstable model performance or aggregation failure is easy to cause. Disclosure of Invention The invention provides a federal learning classification system and a federal learning classification method for protecting single-cell RNA sequencing data privacy, which are used for solving the problems in the prior art. In order to achieve the purpose, the invention adopts the following technical scheme that the federal learning classification system for protecting the privacy of single-cell RNA sequencing data comprises: The adaptive data preprocessing module is used for carrying out per million ready count (Counts Per Million normalization, CPM) standardization processing on single-cell RNA sequencing original data, carrying out log 2 (CPM+1) conversion, carrying out enhancement characteristic representation through a contrast learning framework, constructing a twin network by using a six-layer transducer encoder by taking 15% of gene expression value random masks as negative samples, strengthening cell type discrimination characteristics by adopting standardized temperature scaling cross entropy loss (Normalized Temperature-Scaled Cross-Entropy Loss, NT-Xent) loss functions, carrying out dimension reduction on data with the number of genes exceeding 8000 through a clustering self-encoder, wherein the encoder is composed of 2048, 1024 and 512-dimensional full-connection layers, controlling reconstruction errors to be within 4.2% by combining KL divergence regularization, and simultaneously carrying out visual verification on cell cluster structure retention through t-distribution random neighborhood embedding (t-Distributed Stochastic Neighbor Embedding, t-SNE); The multi-model dynamic collaborative training module supports parallel training and self-adaptive switching of four models, namely constructing a graph rolling network of a graph structure based on a gene co-expression relationship, dynamically adjusting extrem