CN-121979387-A - AR system cognition security interaction method based on multi-mode large language model reasoning and bidirectional individuation

CN121979387ACN 121979387 ACN121979387 ACN 121979387ACN-121979387-A

Abstract

The invention discloses an AR system cognition safety interaction method based on multi-mode large language model reasoning and bidirectional individuation, which uses a cross attention mechanism guided by an inertia measurement unit signal to explicitly model and remove brain electric signal motion artifacts, compared with the existing method of directly using the electroencephalogram signals or simple filtering, the personalized attention sensing anti-interference capability is strong, the attention recognition is more accurate, the distraction moment of the user can be accurately captured, and the false alarm rate is reduced. In addition, the personalized preference configuration information is generated through man-machine alignment and is embedded into the reasoning process of the multi-mode large language model, so that the subjective preference of a user to risks is read and understood, the alarm fatigue is solved, and the user trust is established. The invention realizes real intelligent closed loop, is a complete intelligent architecture with 'perception (physiological denoising) -cognition (LLM reasoning combined with preference) -action (self-adaptive interaction)', and can adapt to physiological characteristics and psychological preferences of different users.

Inventors

Pei Yunqiang
WANG GUOQING
LI TIANYU
XUE PENG
WANG PENG
YANG YANG

Assignees

电子科技大学

Dates

Publication Date: 20260505
Application Date: 20260120

Claims (3)

1. The AR system cognition safety interaction method based on multi-mode large model reasoning and bidirectional personalization is characterized by comprising the following steps: (1) Building and training a personalized attention perception model resisting motion disturbance; the inertial measurement unit signal is used as a reference noise source, and motion artifacts are precisely stripped from the electroencephalogram signals through a cross attention mechanism, so that a personalized attention perception model special for the user and resisting motion interference is trained, and whether the user is in a concentration environment state or a cognitive distraction state is judged in real time; (2) Constructing a human-machine aligned personalized scene cognition reasoning layer based on the multi-mode large language model; Capturing subjective risk preference of a user through a calibration stage in advance, generating personalized preference configuration information, embedding the personalized preference configuration information serving as a core knowledge base into a prompt word of a multi-mode large language model, forcibly calling the embedded personalized preference configuration information for comparison in the cognitive pushing process based on self-center visual angle images captured by real-time cameras of an AR system, enabling the multi-mode large language model to understand the unique view of the user on danger, guiding the multi-mode large language model to infer and generating alarm necessity, namely whether an alarm is generated, so that a personalized scene cognitive reasoning layer aligned by human and machine is obtained; (3) Performing adaptive personalized intervention; In the process of using the AR system, triggering environmental risk assessment when the personalized attention perception model detects that a user is in a cognitive distraction state, wherein a human-computer aligned personalized scene cognitive reasoning layer carries out cognitive reasoning based on self-center visual angle images captured by a real-time camera of the AR system and generates alarm necessity, namely whether an alarm is given or not; if the alarm is needed, the low-interference and high-perception personalized alarm is presented in the AR system interface in combination with the interaction habit preset by the user.
2. The AR system cognition security interaction method based on multi-mode big model reasoning and bidirectional individuation according to claim 1, wherein in step (1), the inertial measurement unit signal is used as a reference noise source, and motion artifacts are precisely stripped from the electroencephalogram signal through a cross attention mechanism, so that an individuation attention perception model of anti-motion interference specific to the user is trained, and whether the user is in a concentration environment state or a cognition distraction state is judged in real time as follows: 1.1 Synchronously collecting multi-mode heterogeneous signals; the AR system synchronously collects two paths of signals on a millisecond-level timestamp, wherein a main signal is a user electroencephalogram signal, and a reference signal is inertial measurement unit data containing 9 characteristic components, and the 9 characteristic components are triaxial acceleration, triaxial angular velocity and triaxial Euler angles; 1.2 Data preprocessing and time-frequency characteristic construction; preprocessing and feature engineering are carried out on the two paths of collected signals to construct a feature sequence for model input: The method comprises the steps of signal decomposition and filtering, namely, carrying out band-pass filtering on the electroencephalogram signal of a user to reserve an effective frequency band of 0.5-45Hz, and decomposing the effective frequency band into power characteristics of six frequency bands, namely, energy values of six frequency bands of Delta, theta, alpha, low Beta, high Beta and Gamma; The method comprises the steps of extracting and normalizing sliding window characteristics, namely respectively segmenting power characteristics of six frequency bands and smoothed inertial measurement unit data containing 9 characteristic components by adopting a sliding window technology, setting the window length to be 10 seconds, setting the moving step length to be 1 second, and executing normalization processing on signals in each window, so that an electroencephalogram characteristic sequence of a group of six frequency bands and an inertial measurement unit data characteristic sequence containing 9 characteristic components are obtained every second; downsampling and class balancing, namely downsampling the extracted electroencephalogram characteristic sequences of six frequency bands and the data characteristic sequences of the inertial measurement unit containing 9 characteristic components to 1 Hz, and introducing a class weighting strategy into a loss function in a personalized attention perception model training stage; The down-sampled electroencephalogram characteristic sequences of six frequency bands form electroencephalogram characteristics, namely EEG characteristics The down-sampled inertial measurement unit data containing 9 feature components form an inertial measurement unit signal feature, i.e. IMU feature ; 1.3 Cross-attention based motion artifact stripping); encoding, namely encoding EEG features and IMU features by using a two-way long-short-term memory network respectively, and capturing dynamic changes in time; role assignment, query term For coded EEG features Key item Sum value item Are all IMU features after coding ; Attention weight calculation, calculating query terms And key item Dot product of (2) ; Motion artifact synthesis from value terms based on computed similarity, i.e., weighted attention weights I.e. weighting in IMU features to synthesize an estimated motion artifact : ; Differential denoising, subtracting motion artifact components from EEG features Obtaining clean EEG features that preserve cognitive information but remove motor disturbances : ; 1.4 A personalized attention state classification); Characterization of purified clean EEG The method comprises the steps of inputting a fully-connected neural network classifier, outputting the current cognitive distraction state probability of a user, and judging that the user is in the cognitive distraction state when the current cognitive distraction state probability is larger than a set threshold value and is not larger than the set threshold value, wherein the user is in the concentration environment state.
3. The AR system cognition security interaction method based on multi-mode big model reasoning and bidirectional individuation according to claim 1, wherein in step (2), subjective risk preference of a user is captured in advance through a calibration stage, individuation preference configuration information is generated and is used as a core knowledge base to be embedded into a prompt word of a multi-mode big language model, in cognition pushing based on a self-center visual angle image captured by an AR system real-time camera, the embedded individuation preference configuration information is forcedly invoked to be compared, so that the multi-mode big language model understands a unique view of the user on danger, and the multi-mode big language model reasoning is guided to generate alarm necessity, namely whether an alarm is generated, so that a human-computer aligned individuation scene cognition reasoning layer is obtained: 2.1 Constructing personalized preference configuration information for achieving human-machine cognitive alignment; when the AR system is used for the first time or calibrated regularly, subjective preferences of a user are captured through an interactive flow, and structured data are generated: the AR system displays a series of risk types to the user; The AR system firstly displays objective evaluation of the scene by the general large model; User subjective correction, namely correcting the evaluation of the universal AI based on self experience; the AR system records the difference between the general AI evaluation and the subjective correction of the user, and generates personalized preference configuration information; 2.2 Constructing a chain of thought containing preference constraints; inputting a self-center visual angle image captured by a camera into a multi-mode large language model, constructing a prompt word containing instructions, and guiding the multi-mode large language model to infer according to the steps: the role setting instruction model is that 'you are not a general observer, you are personalized security assistants specially used for the user'; the context injection, namely embedding the personalized preference configuration information generated in the step 2.1) into a prompt word as a core knowledge base; visual perception that a multi-modal large language model firstly identifies objects, distances and dynamic relations in an image; Preference matching and reasoning, namely forcibly calling the embedded preference configuration information by the multi-mode large language model for comparison; and (3) outputting a decision, namely generating alarm necessity, namely whether to alarm or not.

Description

AR system cognition security interaction method based on multi-mode large language model reasoning and bidirectional individuation Technical Field The invention belongs to the technical field of mobile augmented reality cognitive interaction, and particularly relates to a AR (Augmented Reality) system cognitive security interaction method based on multi-mode large language model reasoning and bidirectional personalization. Background With Mobile augmented reality (Mobile AR) devices (such as holonens, apple Vision Pro) gradually blending into daily life, users have become normal using AR devices in dynamic scenarios such as walking, climbing, etc. However, the virtual information presented by the AR device often preempts the visual attention resource of the user, resulting in "non-attention blind view" (Inattentional Blindness), that is, the user fails to process the risk information in the environment at the cognitive level although the line of sight is directed to the real environment, and security accidents are very easy to be caused. To solve this problem, the existing AR system security interaction mainly goes through evolution from "passive defense" to "general intelligent assistance", and the main technical scheme includes: (1) Passive defense based on explicit behavioral triggers: Early AR systems relied primarily on explicit behavioral characteristics of the user, such as line-of-sight (size) dwell time or head pose changes. For example, the AR system may initiate an environmental scan when it detects that the user gazes at virtual content beyond a certain threshold, such as 1 second. Such schemes are simple in logic, but often have hysteresis, and the cognitive state of the user cannot be predicted. (2) Auxiliary system based on general physiological calculation and computer vision: more advanced schemes, such as AttentionAR, begin to introduce multi-modal sensing techniques. The typical flow is as follows: Attention state monitoring, namely, physiological signals are acquired through wearable devices such as EEG (EEG) head bands and IMU (inertial measurement Unit) sensors, and whether the user is in a 'concentration outside' state or a 'cognitive distraction' state is judged by using a general machine learning model such as SVM (support vector machine) or general BiLSTM. Environmental risk identification once "cognitive distraction" is detected, the AR system invokes a generic object detection model, such as YOLO or visual language model (VLM/MLLM), to analyze the environmental image. Alarm feedback, namely sending a prompt to a user based on preset objective rules, such as 'identifying an obstacle, namely alarming'. This type of cognitive security interaction scheme, which attempts to build a security mesh by combining physiological perception (User Awareness) and scene understanding (SCENE AWARENESS), is the dominant direction of the current technology. Although the prior art introduces multi-modal perception, in the face of complex mobile scenarios and highly personalized user needs, there are still serious problems of "model commonality and individual specificity mismatch" and "objective algorithm and subjective cognition misalignment", which are expressed in the following manner: 1. Physiological computing level-inability to cope with the dual challenges of "individual heterogeneity" and "movement artefacts Generalized failure due to individual differences human brain electrical (EEG) signals have extremely high "Inter-subject variability" (Inter-subject Variability). Different users may have distinct patterns of neural activation while performing the same cognitive task. The prior art typically employs a "One-size-fits-all" Generic Model, i.e., a group data training Model, to serve all users. Experiments prove that the accuracy rate of the universal model is greatly reduced when facing users with physiological characteristics deviating from 'average people', so that the distraction moment of the users cannot be accurately captured. Signal pollution by movement noise in dynamic scenarios such as walking, climbing stairs, etc., body movement can generate strong myoelectricity (EMG) and movement artifacts (Motion Artifacts), and these noises tend to overlap in frequency spectrum with EEG bands (such as Theta, alpha waves) reflecting attention. Although simple filtering (such as bandpass filtering) is used in the prior art, the lack of deep denoising mechanism combined with IMU motion data makes it difficult to strip the "cognitive signal" from the "motion noise", resulting in high false alarm rate. 2. Cognitive interaction layer-semantic gap between objective risk assessment and subjective psychological threshold " Risk perceived misalignment (MISALIGNMENT) risk is a subjectively constructed concept, not a mere physical fact. The prior art includes generic multimodal big models that typically define risk classes (Low/Moderate/High) based on objective criteria such as object class, distance. Ho