CN-122024009-A - Sign language translation test self-adaption method based on sign language person perception mixed expert network
Abstract
The invention discloses a sign language translation test self-adaption method based on a sign language person perception mixed expert network, and aims to solve the problems of poor field deviation adaption instantaneity, insufficient individuation and weak stability of the existing sign language translation technology in a real scene. The method is characterized in that a pluggable sign language person perception mixed expert (SAME) module is embedded in a pre-training sign language translation model encoder, online self-adaption is realized by combining unsupervised adaptation loss, and the SAME module adopts low-rank expert design and sign language person perception gating mechanism, and avoids expert collapse through distributed loss initialization. According to the invention, the target domain is not required to be marked with data, the self-adaptive learning can be performed on a new domain during testing, the performance of the pre-training model is improved, and the model is helped to adapt to the dynamic environment change of the cross-sign language person.
Inventors
- ZHANG JINSHAN
- YANG LUJIA
- JIN TAO
Assignees
- 浙江大学
- 浙江大学软件学院(宁波)创新与管理中心
Dates
- Publication Date
- 20260512
- Application Date
- 20260202
Claims (10)
- 1. The sign language translation test self-adaption method based on the sign language person perception mixed expert network is characterized by comprising the following steps of: initializing a sign language translation model, namely adopting a pre-training transducer encoder-decoder architecture as a backbone network, freezing backbone network parameters, inserting a sign language person perception mixed expert module after each encoder layer, and carrying out initialization training; Receiving a target domain sample without labels, extracting visual features, key point features and sign language identity features from a sign language video, and preprocessing the sign language identity features into embedded vectors; inputting the extracted features into backbone network encoders, outputting the features at each encoder layer, and then sending the features into a sign language person perception mixed expert module, and realizing expert selection and feature adaptation through a sign language person perception gating network; The unsupervised online self-adaptation loss formed by the weighted sum of the minimum entropy minimization, the minimum category confusion and the pseudo tag supervision is used, and the parameter of the sign language perception mixing expert module is only subjected to online test; And restoring the model to an initial state, and preparing for self-adaption in the next test.
- 2. The method for self-adapting sign language translation test based on a sign language person perception mixed expert network according to claim 1, wherein the sign language translation model comprises a video encoder, a key point encoder, a fusion encoder and a text decoder, and a pluggable sign language person perception mixed expert module is inserted after each layer of Transformer blocks of the video encoder and the key point encoder.
- 3. The method for self-adapting sign language translation test based on a sign language person perception mixed expert network according to claim 1, wherein said sign language person perception mixed expert module comprises a plurality of expert branches, 1 straight-through branch and 1 sign language person perception gate control network, said expert is realized based on LoRA, For the encoder layer to input features, T is the length of the input feature, d is the dimension of the feature, kth expert The outputs of (2) are: Wherein, the Is the function of the activation and, In order to scale the coefficient of the power consumption, 、 Is a trainable parameter of LoRA's expert, , The decomposition rank number is the low rank expert.
- 4. The method according to claim 3, wherein the gating network adopts a simple perceptron structure, adopts Softmax as the last layer output activation function, and inputs the spliced vector of the encoder feature h and the sign speaker feature s And outputting weight distribution values for each expert and the straight-through branch, wherein a gating network calculates a weight formula as follows: Wherein, the Is a function of the gating of the signal, Is a routing score that is a function of the routing, And selecting different experts for each frame in the video to learn the fine change among the frames for the parameters of the gating network respectively, and setting the score of the Top-k function with the non-Top k name as minus infinity to ensure that only k experts are activated.
- 5. The self-adaptive method for sign language translation test based on the sign language person perception mixed expert network according to claim 1, wherein the sign language video is preprocessed before feature extraction, including frame normalization, deblurring, redundant region clipping, then feature extraction is performed to obtain RGB features as visual features, skeleton joint features as key point features, and sign language person ID and face features as sign language person identity features.
- 6. The method for adaptation in sign language translation test based on a sign language person perception mixed expert network according to claim 1, wherein the initializing training comprises: standard-based cross entropy supervision sign language translation loss combined with designed distributed regularization loss to initialize sign language person perception hybrid expert module, expert output Where B is the batch size, M is the number of experts, T is the sequence length, D is the expert output, hidden dimension, and the regularization penalty is calculated as follows: Wherein, the The flattened expert output for sample b, I is the identity matrix, For the Frobenius norm, an early-stopping strategy is adopted in the training process, and the SAME module initialization parameters are saved.
- 7. The method for sign language translation test based on the mixed expert network for sign language perception according to claim 1, wherein the entropy minimization loss is specifically that the output uncertainty is reduced, the driving model generates more credible predictions, For the class prediction distribution of the i-th frame, C is the class number T is the sequence length: 。
- 8. the method for self-adapting sign language translation test based on sign language person perception mixed expert network as claimed in claim 1, wherein the minimum category confusion loss is specifically regularized category distribution, avoiding excessive deviation of model to few categories, minimizing excessive confidence caused by loss of balance entropy, Predicting a matrix for a sample Is shown in column c: 。
- 9. the method for self-adapting sign language translation test based on a sign language person perception mixed expert network according to claim 1, wherein the pseudo tag supervision loss is specifically that a model is utilized to self-generate a pseudo tag to provide a stable supervision signal, For a pseudo tag, CE is the cross entropy loss, For the class prediction distribution of the i-th frame, C is the class number, and T is the sequence length: 。
- 10. A sign language translation test self-adapting device based on a sign language person perception mixed expert network, comprising a memory and one or more processors, wherein executable codes are stored in the memory, and the processor executes the executable codes to realize the sign language translation test self-adapting method based on the sign language person perception mixed expert network according to any one of claims 1-9.
Description
Sign language translation test self-adaption method based on sign language person perception mixed expert network Technical Field The invention relates to the technical field of sign language translation and self-adaptive intersection during testing, in particular to a sign language translation model which is applicable to a cross-sign language person dynamic environment and is based on a sign language person perception mixed expert network. Background The sign language translation technology is used as a key bridge for connecting a hearing impairment group and a hearing-aid society, and the key goal is to automatically convert a sign language video sequence into a spoken language text. With the development of deep learning technology, modern sign language translation models have realized better translation performance under a controlled scene, but in real world deployment, the problem of domain shift caused by individual differences of sign language persons is inevitably faced, and the domain shift can cause significant reduction of model translation quality to a certain extent, which is insufficient for practical use. In the prior art, the solutions aiming at the sign language translation field deviation mainly have the following three defects: 1. The traditional supervision fine tuning method based on target domain annotation data is limited by the problems of high acquisition cost and long period of sign language annotation data, and dynamic adaptation cannot be realized in a real-time deployment scene; 2. The adaptation efficiency of the non-supervision domain adaptation scheme is low, the existing non-supervision domain adaptation technology (such as a method based on batch normalization) needs to rely on batch statistical information or long-time offline adaptation, cannot meet the real-time requirement of a streaming input scene, and is easy to introduce model training instability; 3. The existing self-adaptive (TestTimeAdaptation, TTA) method in test is mainly focused on static tasks such as image classification, lacks special designs for generating tasks of sequences such as sign language translation, cannot effectively decouple multi-level domain shifts (such as shallow visual feature shifts and deep semantic feature shifts) in the sign language translation, and does not consider personalized adaptation requirements of individual features of sign language users; 4. The mixed expert-of-experiments (MoE) technology is blank in that the MoE architecture has been used for realizing efficient parameter expansion in the fields of natural language processing and computer vision, but is not used for self-adaptive scenes in the test of sign language translation, and the existing MoE variants lack optimal design aiming at the field deviation dynamics and sign language person specificity, so that the problems of expert collapse, poor adaptation stability and the like exist. In summary, the prior art cannot achieve the real-time performance, individuation and stability of the adaptation in the sign language translation test, and a technical scheme which is light in weight, pluggable and capable of rapidly adapting to the dynamic domain offset is needed. Disclosure of Invention The invention aims to overcome the defects of the prior art and provides a sign language translation test self-adaption method based on a sign language person perception mixed expert network. The invention aims at realizing the technical scheme that the self-adaptive method for sign language translation test based on the sign language person perception mixed expert network comprises the following steps: initializing a sign language translation model, namely adopting a pre-training transducer encoder-decoder architecture as a backbone network, freezing backbone network parameters, inserting a sign language person perception mixed expert module after each encoder layer, and carrying out initialization training; receiving a target domain sample without labels, and extracting visual features, key point features and sign language identity features from sign language videos; inputting the extracted features into backbone network encoders, outputting the features at each encoder layer, and then sending the features into a sign language person perception mixed expert module, and realizing expert selection and feature adaptation through a sign language person perception gating network; The unsupervised online self-adaptation loss formed by the weighted sum of the minimum entropy minimization, the minimum category confusion and the pseudo tag supervision is used, and the parameter of the sign language perception mixing expert module is only subjected to online test; And restoring the model to an initial state, and preparing for self-adaption in the next test. Further, the sign language translation model comprises a video encoder, a key point encoder, a fusion encoder and a text decoder, wherein a pluggable sign language person perception mixed e