CN-122004759-A - VR motion sickness detection method based on Mamba-transducer fusion

CN122004759ACN 122004759 ACN122004759 ACN 122004759ACN-122004759-A

Abstract

The invention discloses a VR motion sickness detection method based on Mamba-transducer fusion, which comprises the steps of obtaining multi-mode time sequence data in a virtual reality environment, preprocessing the multi-mode time sequence data to obtain original eye motion characteristics and original head motion characteristics, wherein the multi-mode time sequence data comprise eye motion data reflecting high-frequency visual attention and head motion data reflecting low-frequency space gestures, and inputting the original eye motion characteristics and the original head motion characteristics into a pre-trained VR motion sickness detection network to detect motion sickness, so as to obtain motion sickness detection results. The invention constructs a dual-path time sequence modeling framework with physiological adaptability, overcomes the limitation of a single model in processing multi-scale physiological signals, and realizes high-precision and personalized prediction of motion sickness degree in a virtual reality environment.

Inventors

XIE HONGYANG
Qiao Mengyu

Assignees

北方工业大学

Dates

Publication Date: 20260512
Application Date: 20260127

Claims (10)

1. A VR motion sickness detection method based on Mamba-transducer fusion, comprising: Acquiring multi-mode time sequence data in a virtual reality environment, and preprocessing the multi-mode time sequence data to obtain original eye movement characteristics and original head movement characteristics, wherein the multi-mode time sequence data comprises eye movement data reflecting high-frequency visual attention and head movement data reflecting low-frequency space gesture; and inputting the original eye movement characteristics and the original head movement characteristics into a pre-trained VR motion sickness detection network to detect motion sickness, so as to obtain motion sickness detection results.
2. The method of claim 1, wherein the eye movement data comprises eyelid opening and closing degrees of left and right eyes, three-dimensional coordinates of gaze direction, three-dimensional coordinates of gaze origin, pupil diameter, and the head movement data comprises three-dimensional coordinates of head position and head rotation angle.
3. The method of claim 1, wherein the preprocessing is specifically to resample the multi-mode time sequence data according to a unified set frequency, unify time granularity, and map the data normalized by unifying the time granularity to a [0,1] interval to obtain the preprocessed multi-mode time sequence data.
4. The method of claim 1, wherein the VR motion sickness detection network comprises: the multi-view embedding module is used for aligning the original eye movement characteristics and the original head movement characteristics based on a cross attention mechanism to obtain embedded representation containing modal conflict information; The dual-path time sequence modeling network is used for capturing the primary global features and the primary local features based on the embedded representation containing the modal conflict information; The path fusion sub-network is used for dynamically calculating the weight of the global feature and the local feature, and carrying out weighted fusion on the global feature and the local feature based on the obtained weight to obtain a weighted fusion feature; And the motion sickness prediction module is used for predicting motion sickness scores based on the weighted fusion characteristics.
5. The method according to claim 4, wherein the alignment of the original eye movement features and the original head movement features based on a cross-attention mechanism results in an embedded representation containing modal conflict information, in particular: Mapping any one mode of the original eye movement characteristic and the original head movement characteristic into a query vector, and mapping the other mode into a key vector and a value vector; The cross attention weight vector is calculated, and the expression is as follows: In the formula, Representing the cross-attention weight vector, Representation of The function is activated and the function is activated, The query vector is represented as a result of which, Representing the key vector and the superscript T representing the transpose of the matrix, Representing the dimensions of the key vector; a fused embedded representation is calculated based on the cross-attention weight vector, the expression of which is: In the formula, In order to fuse the embedded representation, Is a value vector; Extracting a token embedding of the original eye movement characteristics and the original head movement characteristics; And calculating an embedded representation containing modal conflict information based on the fusion embedded representation, wherein the expression is as follows: In the formula, An embedded representation of the representation implying modal conflict information, Token embedding representing the original eye movement features or the original head movement features.
6. The method of claim 4, wherein the dual path timing modeling network includes a global path for capturing preliminary global features and a local path for capturing preliminary local features; the specific method for capturing the preliminary global features comprises the following steps: inputting the embedded representation containing the modal conflict information into a Mamba module, and carrying out recursion state update by using a state space model SSM in the Mamba module to obtain a global time sequence characteristic; inputting the global time sequence feature into a KAN mixed expert module to obtain a preliminary global feature, wherein the expression is as follows: In the formula, Representing a preliminary global feature of the object, Representing the expert network index in the KAN hybrid expert module, Representing the number of expert networks, Represent the first The activation weight of the individual expert network, Represent the first The expert network performs a non-linear transformation on the input features by means of spline functions, Representing a global timing feature.
7. The method of claim 6, wherein the specific method of capturing preliminary local features comprises: respectively carrying out position coding on the original eye movement characteristics and the original head movement characteristics; separating the position-coded features into a low-frequency trend term and a high-frequency residual term; Slicing operation is carried out on the low-frequency trend item and the high-frequency residual item respectively, so that overlapped time slices are obtained; applying a local window attention mechanism to the overlapped time slices to obtain trend branch output and residual branch output; and adding the trend branch output and the residual branch output, and carrying out average pooling on channel dimensions to obtain the primary local characteristics.
8. The method of claim 4, wherein the preliminary global features comprise preliminary eye movement global features and preliminary head movement global features; for the preliminary global feature, the global feature is generated by: Respectively carrying out linear transformation on the preliminary eye movement global features and the preliminary head movement global features to respectively obtain an eye movement mode significance score and a head movement mode significance score; Splicing the eye movement mode significance score and the head movement mode significance score, and performing softmax normalization operation to generate an attention weight vector meeting probability distribution; Dynamically scaling the original eye movement features and the original head movement features by element level multiplication by using attention weight vectors meeting probability distribution; adding the dynamically scaled features element by element to obtain global features; For the preliminary local features, local features are generated in a way that generates global features.
9. The method according to claim 4, wherein the weighting of the global feature is dynamically calculated by a path fusion subnetwork, in particular comprising: Adding the original eye movement characteristics and the original head movement characteristics element by element to obtain an input sequence; Mapping the input sequence into a unified hidden space representation through a linear projection layer to obtain a feature of dimensional alignment; compressing the dimension aligned features by adaptive averaging pooling; based on the compressed features, weights of the normalized global features and weights of the local features are generated by the fully connected layer and softmax activation function.
10. The method of claim 1, wherein the loss function employed in training the VR motion sickness detection network is: In the formula, Representing the total loss function of the device, Representing the mean square error loss function, In order for the coefficient of balance to be present, Loss terms for regularization.

Description

VR motion sickness detection method based on Mamba-transducer fusion Technical Field The invention relates to the technical field of virtual reality and man-machine interaction, in particular to a VR motion sickness detection method based on Mamba-transducer fusion. Background Motion sickness is a major side effect of prolonged exposure of a user to a Virtual Environment (VE), and symptoms include nausea, disorientation, and eye fatigue. With the popularity of VR technology in entertainment, medical and training, it is becoming critical to effectively predict and alleviate motion sickness. The prior art has shifted from earlier subjective questionnaire evaluations (e.g., SSQ proposed by Kennedy et al) to objective automatic predictions based on physiological signals. Despite the introduction of eye movement and head movement data, significant shortcomings remain in the technical means and modeling depth. First, modality depth of interaction is insufficient and differences in physiological properties are ignored. Existing studies, while beginning to integrate multimodal data, are too simplistic in fusion mechanism. For example, the MS-STTN model proposed by Jeong and Han, although employing a transducer to extract spatiotemporal features of eye movement and head movement, still faces the problems of reliance on synchronous data acquisition and excessive tag spacing. In addition, the existing scheme is used for simply splicing the modal data at an early stage, neglects the physical attribute difference of the eye movement signal (high frequency, fine and reflecting instantaneous disturbance) and the head movement signal (low frequency, spatial direction and reflecting length Cheng Yitu) on the time sequence characteristics, and leads to the difficulty of capturing the core cause 'vision-vestibular conflict' of the motion sickness by the model. Second, multi-scale timing modeling capability is lost, and it is difficult to balance global and local features. The development of motion sickness involves both "long-term cumulative discomfort" and "localized mutations" caused by sudden visual disturbances. Existing CNN-LSTM architectures (e.g., physioDNN or HMDPrediction) or underlying Transformer architectures (e.g., informer) have difficulty capturing features at both dimensions. For example, deepLSTM models tend to lose long-range dependencies due to recursive structure limitations when processing long sequences, whereas traditional global attention mechanisms suffer from high computational complexity and insufficient sensitivity when processing fine local fluctuations. Third, there is a lack of adaptive regulation mechanisms for individual differences. The perception of motion sickness has extremely strong individual variability (Individual variability). Although some schemes, such as MAC, introduce a dual-attention mechanism for fusion, the architecture is mostly static, and the prediction emphasis point cannot be dynamically adjusted according to the real-time physiological feedback of the user. Keshavarz et al, although using machine learning to quantify motion sickness, lack of long-term dynamic modeling results in a model that is less generalizable and predictive robust across users, across scenes. Aiming at the limitations of the prior proposal in the aspects of modal fusion depth, multi-scale modeling and individual self-adaptability, how to design a prediction model which can adapt to eye movement/head movement specific physical properties and can self-adaptively balance global trend and local disturbance is a technical problem which is to be solved in the current VR health monitoring field for a long time. Disclosure of Invention Aiming at the defects in the prior art, the VR motion sickness detection method based on Mamba-transducer fusion solves the problems that a single model of the existing motion sickness detection technology has limitations in processing multi-scale physiological signals, has insufficient modal fusion depth, is difficult to consider motion sickness 'long-term accumulation' and 'instantaneous mutation' dual induction mechanisms, and has weak model generalization capability caused by individual differences of users. In order to achieve the aim of the invention, the technical scheme adopted by the invention is that the VR motion sickness detection method based on Mamba-transducer fusion comprises the following steps: Acquiring multi-mode time sequence data in a virtual reality environment, and preprocessing the multi-mode time sequence data to obtain original eye movement characteristics and original head movement characteristics, wherein the multi-mode time sequence data comprises eye movement data reflecting high-frequency visual attention and head movement data reflecting low-frequency space gesture; and inputting the original eye movement characteristics and the original head movement characteristics into a pre-trained VR motion sickness detection network to detect motion sickne