CN-121706896-B - Multi-mode federal learning model training method for client missing mode

CN121706896BCN 121706896 BCN121706896 BCN 121706896BCN-121706896-B

Abstract

The invention relates to the technical field of machine learning, in particular to a multi-mode federal learning model training method for a client missing mode. The method comprises the steps of initializing a global model containing a leachable compensation vector by a server, identifying a modal state by a client according to a binary mask in local training, introducing the compensation vector as a proxy feature for a missing modal, drawing semantic information from other available modalities by utilizing a self-shielding interaction mechanism to generate a reconstruction feature, and dynamically adjusting a feature fusion proportion through self-adaptive gating residual fusion. The method introduces double alignment constraint of instance-level contrast loss and distribution-level maximum mean difference loss, and suppresses client drift caused by modal deletion and data heterogeneity from a microscopic semantic and macroscopic statistical level. According to the method, complex data generation and reconstruction are not needed, the computing and communication expenditure of the edge equipment is obviously reduced, and the prediction accuracy, the robustness and the training stability of the global model in the modal missing heterogeneous environment are improved.

Inventors

REN WEI
ZHOU HAIXIA

Assignees

西南大学

Dates

Publication Date: 20260505
Application Date: 20260214

Claims (5)

1. A multi-mode federal learning model training method for a client missing mode is characterized by comprising the following steps: S1, initializing a system model, constructing a global multi-mode model for a missing mode of a client, initializing global model parameters based on a server, and respectively setting the number of multi-mode federal learning communication rounds, the sampling proportion of the client and the weight parameters of a loss function according to the requirement of a target learning task; S2, based on the sampling proportion of the clients, extracting clients participating in training in the current training round from all the clients, constructing a training participation subset, and broadcasting current global model parameters to all the clients in the training participation subset; s3, based on global model parameters, carrying out local model initialization on each client in the participating training subset, and carrying out independent local training on each client by combining a local multi-mode data set of each client and a loss function weight parameter to obtain updated local parameters after independent local training of each client; S4, uploading the updated local parameters to a server, and carrying out parameter aggregation on the updated local parameters through a federal average algorithm based on the data volume of the client to construct a new round of global model parameters; S5, repeating the steps S2-S4 until the preset number of multi-mode federal learning communication rounds is reached according to the new round of global model parameters, and obtaining a trained global multi-mode model; The global model parameters in the step S1 comprise model structure parameters and leachable compensation vector parameters, wherein the model structure parameters comprise feature extractor parameters, feature fusion module parameters and decoder parameters; Step S3 comprises the following sub-steps: S301, each client in the participation training subset receives global model parameters to initialize a local model, and an available mode set and a missing mode set are respectively constructed based on a local multi-mode data set of each client; S302, mapping each mode in the available mode set to a high-dimensional semantic feature space through a feature extractor to obtain a real feature representation; S303, carrying out semantic relation modeling according to the unified feature set to obtain a semantic relation matrix, generating a cross-modal reconstructed feature set corresponding to the unified feature set through a self-shielding interaction weight algorithm based on the semantic relation matrix, and carrying out residual fusion on the unified feature set and the cross-modal reconstructed feature set through a self-adaptive gating unit to obtain a fused feature set; S304, according to the fusion feature set, performing instance level semantic alignment and distribution level statistics alignment respectively through a double alignment regularization algorithm to obtain instance level comparison loss values and distribution level maximum mean difference values, and calculating to obtain local total loss values by combining loss function weight parameters; S305, carrying out back propagation on the local total loss value, and carrying out local model parameter updating on each client through a random gradient descent optimizer to obtain updated local parameters after independent local training of each client.
2. The method for training a multi-modal federal learning model for a missing client mode according to claim 1, wherein step S2 includes the steps that the server terminal randomly extracts all registered clients based on a preset client sampling ratio to form a current turn of participation training subset, and issues current global model parameters to each client in the participation training subset.
3. The method for training a multi-modal federal learning model for client-side miss modality according to claim 1, wherein step S303 further comprises the sub-steps of: S303-1, mapping each mode characteristic to a public projection space through a projection head, calculating cosine similarity of all mode pairs under a current sample, and constructing a semantic relation matrix; s303-2, performing temperature scaling and self-shielding operation on the semantic relation matrix, setting diagonal elements to minus infinity and normalizing the diagonal elements to generate an interaction weight matrix, and calculating cross-modal reconstruction characteristics of each mode; s303-3, dynamically adjusting a gating coefficient through a gating unit according to the information reliability of the input features, and fusing the original features and the cross-modal reconstruction features by utilizing a residual form to obtain a fused feature set.
4. The method for training a multi-modal federal learning model for a missing client modality according to claim 1, wherein the example-level semantic alignment in step S304 includes pulling the distances of different modality features of the same sample in semantic space by comparing learning losses, and pushing the feature distances of different samples; the distribution level statistical alignment includes minimizing the differences in overall statistical distribution of different modal features within a batch by maximizing the mean difference loss.
5. The multi-modal federal learning model training method for client-side miss modality according to claim 1, wherein the federal averaging algorithm based on the client-side data amount in step S4 is expressed as: ; Wherein t represents the current training round, t+1 represents the new training round, Global model parameters representing a new round of training, k representing the client, Representing a participating training subset, j representing a j-th client in the participating training subset, Representing the amount of data on the client side, Representing the updated local parameters uploaded by the kth client.

Description

Multi-mode federal learning model training method for client missing mode Technical Field The invention relates to the technical field of machine learning, in particular to a multi-mode federal learning model training method for a client missing mode. Background Along with the rapid development of the internet of things technology, intelligent terminals and wearable equipment, multi-mode data such as images, voice, texts, physiological signals and the like are widely collected and used in application scenes such as medical health monitoring, intelligent driving, emotion calculation, emergency event analysis and the like. The multi-mode learning technology can fully mine the complementary relation among modes by carrying out joint modeling on different mode information, so that the feature expression capacity and the prediction precision of the model in complex tasks are improved, and the multi-mode learning technology becomes one of important research directions in the current artificial intelligence field. However, in the scenes of medical treatment, finance, personal terminal equipment and the like with high requirements on privacy protection and compliance, the unified training of directly and intensively collecting the multi-mode data is often difficult to implement. Federal learning is used as a distributed machine learning paradigm, and model parameters or intermediate information are only exchanged on the premise of not sharing original data, so that multiparty collaborative modeling is realized, and the problems of data privacy and compliance are effectively relieved. Based on the method, the multi-modal learning and the federal learning are combined to form a multi-modal federal learning technology, namely, each client independently trains a model based on local multi-modal data, and the server side aggregates the model parameters of the clients so as to obtain a global model. Existing multi-modal federal learning methods are typically built on the assumption that each client has a complete and consistent set of modal data. Under the assumption, the client locally performs splicing, weighting or fusion processing on different modal characteristics, and the server aggregates model parameters uploaded by the client to realize multi-modal collaborative modeling across the client. However, in an actual application environment, due to the reasons of hardware configuration difference of terminal equipment, inconsistent sensor deployment, different acquisition strategies, equipment faults and the like, different clients often only have partial modal data, and the phenomenon of modal deletion at the client level is common. For example, in a medical health monitoring scenario, part of terminals may only acquire physiological signals without behavior or image modalities, and in an intelligent perception or internet of things scenario, there is also a significant difference in the sensor types configured by different devices. The situation causes that the existing multi-mode federal learning method based on the assumption of complete modes is difficult to be directly applied, and the model performance and training stability are seriously affected. Aiming at the problem of client missing mode, the prior art mainly comprises two solutions of filling type and non-filling type: The fill class scheme includes two methods, the first is a simple fill strategy, such as zero-fill or mean-fill, i.e., feature vectors of missing modalities are replaced with fixed values locally at the client to maintain consistency of the model input dimensions. The method is simple in implementation mode, semantic relativity among different modes is not considered, and irrelevant or distorted characteristic information is easily introduced into the model, so that the characteristic representation quality and the model training effect are affected. The second is to generate a reconstruction filling strategy, which reconstructs the characteristics or data of the missing modes by constructing a generating model, such as a condition-based generating network, a variational self-encoder or other cross-mode modeling mode, and tries to infer the missing mode content by using the information of the existing modes. Although the method can recover the association between the modes to a certain extent, the method generally depends on a generation model with larger parameter scale, so that the calculation complexity and the communication cost are higher, the method is difficult to be efficiently deployed in the edge equipment and federal learning scene with limited resources, and meanwhile, the method can also face the problems of unstable training and deviation of the generated characteristics from the actual distribution. The second is a non-fill class scheme that utilizes graph structures, prototype alignment methods, or other methods to attempt to bypass explicit reconstruction, compensating for the lack by establishing associations in the repre