CN-121998039-A - Federal multi-modal preference learning and alignment method for lack of modal conditions

CN121998039ACN 121998039 ACN121998039 ACN 121998039ACN-121998039-A

Abstract

The invention discloses a federal multi-mode preference learning and alignment method of a missing mode condition, which is oriented to the problem of unstable alignment caused by missing of image-text data modes, preference noise and cross-domain conflict in multi-client privacy protection. The method comprises the steps of identifying a mode state at a client and constructing conditional preference pairs, wherein images and texts are scored and screened for good and bad responses according to cross-mode consistency when the images and texts are complete, introducing visual attribute lexicon punishment to inhibit hypothesis when the images and texts are absent, performing entity existence verification punishment illusion when the images and texts are absent, combining a sample level based on prediction uncertainty and a client level two-stage gating to reduce noise updating, and a server side forms multi-preference expert by utilizing low-dimensional preference signature clustering and issues fusion according to similarity routing. According to the method, the consistency and the robustness of multi-mode generation are improved under the mode-lack scene, and the negative migration and the illusion are reduced.

Inventors

LI CHUNPEI
Mou Nengqiong
LI XIANXIAN
YE JIAYONG
Cheng Guijuan

Assignees

广西师范大学

Dates

Publication Date: 20260508
Application Date: 20260213

Claims (8)

1. A federal multi-modal preference learning and alignment method for modality-deficient conditions, comprising a system of local clients and a server, the method comprising the steps of: Step one, server side initialization configuration issuing: before the federal training starts, the server performs system initialization and issues global configuration and public resources related to alignment training consistency to each client, wherein the issuing content comprises: 1.1. The server issues a model parameter set for federal alignment training to the client, the global model adopts a structure of a shared base and an expert model, the local model of the client is formed by integrating the issued shared base and an updatable expert model, and the expert model is an adjustable parameter module attached to the shared base model; 1.2. The random projection matrix consistency configuration is issued, in order to support the generation and privacy compression of the subsequent low-dimensional preference signature, the server generates a unified random seed in the system initialization stage The client generates a random matrix based on the random seed and obtains a fixed random orthogonal projection matrix R through QR decomposition, wherein matrix parameters of the random orthogonal projection matrix R are kept unchanged in a plurality of federal training rounds, so that preference signatures generated in different rounds are ensured to be positioned in the same low-dimensional representation space, and similarity calculation and clustering processing of the server are supported; 1.3. the server issues unified rule configuration of modal deletion judgment, namely, the condition that null value/damage/low quality threshold value is mapped to be deleted, and visual attribute word stock for T-only scene The client terminal is enabled to follow the consistent constraint caliber when carrying out modal mask judgment and conditional preference pair construction; 1.4. Training and issuing gating super parameters, namely issuing super parameters and a threshold range to a client by a server, wherein the super parameters and the threshold range comprise preference optimization temperature coefficient Sample-level gating temperature parameters Entity presence threshold, penalty coefficient Proportion of inferior sampling position The value range of the candidate response quantity N is used for ensuring that the local training of each client is consistent and controllable with the behavior of the gating update; Step two, multi-mode input state identification and mode mask generation: Setting local multimodal samples of clients , wherein, Representing the image modality data, Representing text modality data, subscript i representing the ith sample data, corresponding to i of the modality mask variable; the client scans the sample through a modal detection operator to generate a modal mask variable For identifying the modality state of the current sample: Wherein Representing that both the image modality and the text modality exist; Indicating that only a text mode exists and an image mode is absent; indicating that only the image mode exists and the text mode is missing; Step three, conditional preference pair construction based on the modal mask: client-side utilizing current local model For the sample Generating a candidate response set: Wherein the method comprises the steps of For candidate response quantity, selecting the value range as ; And according to the mask Performing different automated screening strategies to construct preferred responses And poor response Is a preference data pair of (1); step four, two-stage gradient gating based on prediction entropy: to prevent low mass gradient contamination model of the missing mode sample, two-stage dynamic weighting is implemented: 4.1. Sample level gating; 4.2. client-level gating-client computing reliability scores ; When the server is aggregated, a corresponding aggregation weight is distributed to each client The aggregate weight is determined by the sample size of the client Reliability score corresponding to the client Is determined jointly by the products of (a); Step five, client side representation based on low-dimensional preference signature: After the local alignment training of each round is finished, calculating a preference signature according to the updated gradient projection of the local parameters, wherein the implementation mode is as follows: , to protect the privacy of clients and reduce computational pressure, high-dimensional vectors are projected into a low-dimensional space by fixing a random orthogonal matrix Projection: , Wherein the method comprises the steps of Generating random seeds by a server, and generating an orthogonal matrix by a client using QR decomposition ; Issuing and fixing in the initialization stage of the federal system; And carrying out normalization processing on the projected vector to obtain a preference signature vector: , Wherein t represents the t-th wheel; One of 128, 256 or 512 is selected, Selecting according to the number of clients/communication bandwidth; The larger the cluster, the finer but the higher the communication; step six, clustering and multi-expert aggregation based on signature: the server receives the model parameter updating quantity uploaded by each client Preference signature vector Reliability score Number of samples The server performs clustering algorithm processing on the preference signature vectors, divides clients with similar preferences into the same cluster, performs weighted aggregation on the client updating amount in each cluster according to the reliability scores of the clients, and generates a plurality of corresponding global preference expert models, wherein the implementation formula is as follows: ; step seven, expert issuing and local route fusion: the server establishes a mapping table and records the center of the preference cluster corresponding to each expert, namely all preference signature vectors of the cluster Based on the mean value of the preference signature vector Selecting p specialists to send to corresponding clients, and the clients receive the specialist model and update the local specialist model; The client side is controlled according to the gating coefficient Fusing expert models to obtain a multi-expert integrated model : , And the client processes the input according to the fused expert integration model and by combining the frozen base together, and generates a corresponding candidate set.
2. The federal multi-mode preference learning and alignment method for a missing mode condition of claim 1, wherein the mode detection operator in step two is implemented in any one or combination of the following ways and supports unified mapping of null/damage/low quality thresholds as missing, wherein the method comprises judging whether a data structure field is null or not; And the client stores for each sample , ) Triplet and handle As conditional variables for all subsequent operators.
3. The method for federal multi-modal preference learning and alignment for modality-deficient conditions as claimed in claim 1, wherein the step three is based on a mask Performing different automated screening strategies to construct preferred responses And poor response The specific steps of the preference data pair of (a) are as follows: 3.1 for IT data when Time, i.e. complete graphics and text, performing cross-modal fine granularity alignment by computing each answer using the CLIP model And image Is a visual-semantic consistency score of: Sequencing the candidate responses in descending order of scores to obtain a sequence Selecting the highest scoring response as the preferred response From the ordered position Processing inferior selection response Forcing the local model to distinguish between high quality descriptions and illusions, where ; 3.2 For T-only data when When the method is used, only text is used, visual attribute density penalty is carried out, namely loading a preset visual attribute word stock The word stock contains visual description words which cannot be supported by text in the absence of visual modalities, and is responsive to each candidate Constructing a visual penalty word set: ( )= , and calculating a scoring function: ( ) = - The response with the highest score is selected as Low response as Acquisition of Specific method and step 3.1 acquisition of (a) The same method as in (2); 3.3 for I-only data when When the method is used, only images are generated, the closed loop verification of entity existence is performed, wherein the closed loop verification is the same as an IT scene, but entity penalty items are introduced, and candidate responses are responded Extracting entity nouns to obtain an entity set For each entity Constructing a prompt template and calculating the existence probability of the entity in the image : If (if) Below a threshold value A withhold mechanism is triggered in which the presence threshold The final scoring function is defined as: ( )= , Wherein the method comprises the steps of The value range is [0.1,2.0] for punishment coefficient; An indication function for judging if the object involved in the answer belongs to the phantom object, when an object is judged as the phantom object, the indication function takes a value of 1 to trigger the corresponding deduction penalty, when the object is judged as actually existing in the image, the indication function takes a value of 0 to not trigger the deduction penalty, which forces the local model to adopt a conservative strategy obtained by seeing when no user instruction is provided to inhibit the phantom, the deduction mechanism is that the candidate response is responded to Each entity extracted from the computer Calculating the entity in the image Probability of existence in (a) If (1) < Then the entity is determined to be an illusive entity and a fixed penalty is applied to the entity in the scoring function If (1) ≥ No penalty is applied, and the final deduction is divided into the number of phantom entities and penalty coefficients And subtracting from the base graph-text similarity score; Selection of ( ) Response with highest score as Selecting ( ) Response with low score as Acquisition of Specific method and step 3.1 acquisition of (a) The same method as in (2); preference pairs for construction Optimizing the loss function using the conditional preference: , Wherein: Pi (..) is the conditional probability, ref is the reference strategy, beta temperature coefficient, (..) Is a Sigmoid function.
4. The method for learning and aligning federal multi-modal preference for a modality-deficient condition of claim 1, wherein the step four of performing two-stage dynamic weighting comprises the steps of: 4.1. sample level gating If one sample has high uncertainty in answering the local model due to lack of modes, the contribution of the sample to the updating of the local model is expected to be smaller; First, it is necessary to quantify the uncertainty of the local model to the current sample, calculate the uncertainty of the local model to the current sample Token-level prediction entropy of (2): , Entropy according to Token level Constructing a nonlinear weight function: , weighting preference loss, weighted loss back propagation updating local expert integration model : Parameter updating of the high uncertainty sample is automatically stopped, and wrong preference data pollution model parameters are prevented; 4.2. client-level gating Reducing speaking rights of data with poor client quality in global aggregation; Client computing reliability scores : Client terminal After a round of local training, the average value of all training sample weights is counted: , if a client is all high quality and low uncertainty data, About 1, if the data are all data with large noise and lack of modes, ≈0; When the server is aggregated, the weight of each client Sample size by each client And (3) with Automatically reducing the contribution of low quality data nodes, as calculated by: 。
5. The method for learning and aligning federal multi-mode preference under the condition of lacking modes as set forth in claim 1, wherein the specific process of the server performing clustering algorithm processing on the preference signature vectors in the sixth step is that the server determines the cluster number by adopting an adaptive clustering algorithm according to the distribution characteristics of the preference signature vectors uploaded by each client The server calculates the contour coefficients under different cluster numbers before each round of aggregation, and selects the optimal value as the expert model number of the current round, so that balance is achieved between personalized adaptation and global consistency.
6. The federal multi-modal preference learning and alignment method for modality-deficient conditions of claim 1, wherein the server in step seven signs vectors according to preference The specific process of the distance screening of the client comprises the steps that the server calculates the latest uploaded signature of the client And (3) with Cosine similarity among expert cluster centers, and sorting from high to low according to similarity, and before issuing And (5) giving the client k an expert.
7. The method for federal multi-modal preference learning and alignment for modality-deficient conditions of claim 1, wherein in step seven, for gating coefficients Client k initializes a vector of length P , Setting random decimal value for real number vector and ensuring that the sum of all expert weights is 1 And the weight is positive, pair And (3) performing Softmax operation to obtain a gating coefficient: 。
8. The method for learning and aligning federal multi-modal preference under the condition of lack of modality according to claim 1, wherein the global model and the reference policy issuing in step 1.1 are that the server issues a model parameter set for federal alignment training to the client, and the global model adopts a structure of shared base+expert model, wherein: 1.1.1. Shared base issuing, namely server issuing shared base model , Comprises a visual encoder and a language model backbone, wherein each client shares the same client in the whole federal training process To reduce communication overhead, the server issues only once during the initialization phase In the subsequent federation round, theta is kept frozen and does not participate in federation aggregation update; 1.1.2. the initial multi-expert model issuing, wherein the server initializes a group of expert model sets based on the shared base model And transmitting the shared base model and the corresponding expert model to the client, wherein 1 Can be taken, namely, single expert initializing is carried out when When >1, the server uses the same initialization for each expert and shares the base model with each expert Together forming an initial set of multiple expert models ; The initialization mode transmits Top-p experts to the client in the model transmitting stage, and the client performs seamless connection of a local route fusion mechanism; 1.1.3 reference strategy Composition and round update rules of (1) reference policy With the current policy The same structure is adopted, but the local training is kept frozen in the round, and the local training is only used for conditioning preference optimization loss In the 1 st round of federal training, the client uses a strategy starting point model formed by the shared base model parameters issued by the server in the initialization stage and the matched initial expert model parameter set as a reference strategy In the t >1 round, expert model parameters issued by the server of more than one round of clients are combined with a shared base model theta which is locally stored and kept frozen by the clients to form a round training starting point strategy model, and the training starting point strategy model is copied to generate a reference strategy model Before the local training of the round starts, the parameters of the reference strategy model are frozen, so that the parameters of the reference strategy model do not participate in parameter updating in the training process of the round, and the parameters are only used for calculating reference items in preference optimization.

Description

Federal multi-modal preference learning and alignment method for lack of modal conditions Technical Field The invention relates to the field of artificial intelligence and distributed learning, in particular to a training control method for parameter/statistic exchange between a client and a server, and particularly relates to a federal multi-mode preference learning and alignment method for a modal-missing condition. Background Along with the progress of the multi-mode large-scale language model (MLLMs) and the Visual Language Model (VLMs) in tasks such as visual question-answering, image description and cross-mode retrieval, the multi-mode generation system gradually enters application scenes such as medical treatment, finance, government affairs and industry with higher requirements on reliability. However, existing multi-modal generation and alignment techniques often face two types of engineering challenges at the same time when truly deployed (1) modality missing/incomplete input (text only or image only, etc.), and (2) generating illusions (output inconsistent with available modality evidence or without factual support). In federal learning (FEDERATED LEARNING, FL) scenarios under privacy and data compliance constraints, the above problems are further superimposed, making it difficult to stably implement multi-mechanism collaborative alignment in the prior art. In the first prior art, a multi-mode alignment method based on preference optimization and the defects thereof. To alleviate the illusion problem of multimodal models, there have been studies attempting to introduce a preference optimization paradigm (e.g., DPO-like methods) into a multimodal alignment procedure. For example, fu et al propose a multi-modal illusion oriented preference optimization concept (HDPO) indicating that the illusion mitigation using DPO directly for multi-modal models has the phenomena of improvement instability, inconsistent yields, etc., and attempted to improve by preference optimization of a more focused illusion. As another example, wu et al propose multi-modal preference optimization (EMPO) at the entity center, indicating that existing preference alignment methods tend to focus on "catering preferences", but ignore explicit alignments of the teletext modalities, resulting in excessive reliance on language priors and inducing illusions. Although the above approach proposes improvements in the "multimodal preference data construction", "preference target design" level, there are at least three key disadvantages from an engineering deployment and federal scenario perspective: 1. Implicit assumptions on modality completeness lead to insufficient evidence of lack of modality scene preference, namely that the above-mentioned preference contains enough multi-modality evidence (such as simultaneous existence of graphics or complete instructions) for construction and preference targets, and when the actual input is in the absence of modalities such as text-only or image-only, the basis for distinguishing the better response and the worse response from the candidate set is weakened or even not established, so that preference tag noise is introduced and the updating direction of the model parameters of the objective function is biased. 2. The system lacks an explicit modeling and training gating mechanism for the uncertainty of the lack of a mode, namely when the mode information is insufficient, the multi-mode model often shows higher uncertainty, if parameters are updated according to uniform intensity, unreliable samples are easily introduced into training and preference noise is enlarged, more focusing preferences work on a construction or alignment target in the prior art, and an adaptive gating updating mechanism driven by the uncertainty of the lack of the mode samples is lacking. 3. The method is mainly discussed under centralized training or single domain setting, the related method usually relies on centralized data or unified data distribution to perform preference alignment, and for federal environments with multi-mechanism data isolation, heterogeneous distribution and inconsistent modal combination, the preference is difficult to directly apply to the construction and optimization flow or the convergence is stable [3]. In the second prior art, a federal multi-mode learning framework and the defects thereof are overcome. Under the privacy protection requirement, federal learning realizes multiparty collaboration through a paradigm of 'client local training-server aggregate update'. Aiming at multi-mode data distributed training, personalized aggregation and cross-mode alignment strategies are proposed by the existing federal multi-mode learning work. For example, zhang et al propose FedEPA to employ personalized aggregate weights and unsupervised modality alignment strategies in multi-modality federal learning to emphasize performance improvement under multi-modality classification tasks and limited labeling c