CN-121982749-A - Video learner attention intelligent prediction method based on decoupling space-time state space model

CN121982749ACN 121982749 ACN121982749 ACN 121982749ACN-121982749-A

Abstract

The invention discloses an intelligent prediction method of the attention of a video learner based on a decoupling space-time state space model, which comprises the steps of obtaining a learning content video, uniformly sampling a plurality of frames, extracting initial characteristics of each frame, obtaining space enhancement characteristics through a space module, constructing a position-level time sequence according to spatial position alignment, obtaining time enhancement characteristics through a time module, splicing the space and time enhancement characteristics, linearly projecting and fusing the space and time enhancement characteristics to obtain the space-time enhancement characteristics, decoding and outputting a frame-by-frame attention saliency map, intelligently generating a learner video attention area, and training to adopt a composite target optimization model parameter. The invention can realize the advanced perception and intelligent prediction of the attention area of the video learner, and optimally adjust the important area of the teaching video according to the prediction result, thereby improving the learning effect. In a word, compared with the prior art, the method decouples spatial feature modeling and temporal feature modeling, and has the advantages of good prediction effect, high intelligent level and outstanding application value of educational scenes.

Inventors

Hong Daocheng
ZHENG JIAQI
Song Tairui
PENG SHIHAO

Assignees

华东师范大学

Dates

Publication Date: 20260505
Application Date: 20260225

Claims (8)

1. The intelligent prediction method for the attention of the video learner based on the decoupling space-time state space model is characterized by comprising the following modules: The frame-level feature extraction module is used for carrying out block embedding and encoding on each frame obtained by sampling and extracting initial features of each frame; The space state space module is used for executing two-dimensional selective scanning on the initial features to aggregate global contexts in frames so as to obtain corresponding space enhancement features; The time state space module is used for performing bidirectional one-dimensional selective state space modeling on the space enhancement features to obtain corresponding time enhancement features; The decoding prediction module is used for executing decoding operation on the space-time enhancement features to obtain a frame-by-frame attention saliency map, wherein the space-time enhancement features are obtained by splicing the space enhancement features and the time enhancement features and fusing linear projections; the intelligent prediction method for the attention of the video learner specifically comprises the following steps: step 1, acquiring a learning content video sequence, and representing the video sequence as a multi-frame image sequence arranged in time sequence; Step 2, outputting initial characteristics of each frame through a shared frame-level characteristic extraction module for each frame of image according to the multi-frame image sequence obtained in the step 1; step 3, inputting the initial characteristics of each frame into a space state space module according to the initial characteristics of the frames obtained in the step 2, and carrying out global characteristic aggregation on each frame based on two-dimensional selective scanning to obtain space enhancement characteristics containing intra-frame global context information; step 4, aligning the spatial enhancement features among frames according to the spatial positions according to the spatial enhancement features obtained in the step 3, and taking the features of the positions of each frame for each spatial position to form a sequence, so as to obtain a time sequence of each spatial position; Step 5, inputting the time sequence corresponding to each space position into a time state space module according to the time sequence obtained in the step 4, so as to obtain time enhancement characteristics; Step 6, splicing and linear projection fusion of the space enhancement features obtained in the step 3 and the time enhancement features obtained in the step 5 to obtain space-time enhancement features; Step 7, inputting the space-time enhancement features in the step 6 into a decoding prediction module to generate a frame-by-frame attention saliency map; and 8, generating a Gaussian thermodynamic diagram corresponding to each frame based on the frame-by-frame attention saliency map in the step 7, and displaying the Gaussian thermodynamic diagram and the corresponding original video frame in a superposition way to obtain a visual image of the attention area of the learner.
2. The intelligent prediction method for the attention of the video learner according to claim 1, wherein the step 1 specifically includes obtaining a learning content video sequence, uniformly sampling the video sequence to obtain frame images, wherein a time interval between adjacent sampling frames is a preset frame interval, and representing the video sequence as a multi-frame image sequence arranged in time sequence.
3. The intelligent prediction method according to claim 1, wherein the frame-level feature extraction module in step 2 comprises block embedding and linear projection of each frame of image using a two-dimensional convolution layer, wherein the convolution kernel size and stride of the two-dimensional convolution layer are equal to the patch size to map non-overlapping image blocks into corresponding token features.
4. The intelligent prediction method of attention of video learner in accordance with claim 1, wherein the spatial state space module in step 3, the internal modeling process comprises: 3-1, performing cross scanning on the input feature map along four different scanning directions to form a plurality of sequences, wherein the four different scanning directions are from top to bottom, from bottom to top, from left to right and from right to left; 3-2, respectively performing selective state space modeling on each sequence, and performing feature transformation by adopting a residual error connection structure and combining normalization and a feedforward network in the modeling process; 3-3, reducing each sequence output into a two-dimensional characteristic map through cross combination, and obtaining the space enhancement characteristic through a projection layer.
5. The intelligent prediction method according to claim 1, wherein the spatial positions in step 4 are aligned between frames, specifically, the spatial enhancement features of each frame are taken to be corresponding channel vectors at the same two-dimensional grid positions, and time sequence is formed in time sequence.
6. The intelligent prediction method of attention of video learner in accordance with claim 1, wherein the time state space module in step 5, the internal modeling process comprises: 5-1, performing forward selective state space scanning on the time sequence to obtain a forward output sequence; 5-2, performing reverse selectivity state space scanning on the reverse sequence of the time sequence to obtain a reverse output sequence; 5-3, inverting the reverse output sequence to be aligned with the forward output sequence in the time dimension, splicing the reverse output sequence and the forward output sequence, and fusing the reverse output sequence and the forward output sequence through linear projection to obtain the time enhancement feature.
7. The intelligent prediction method according to claim 1, wherein the decoding prediction module in step 7 includes a multi-scale feature fusion and progressive upsampling process to generate a frame-by-frame attention saliency map corresponding to the resolution of the input frame.
8. The intelligent prediction method according to claim 1, wherein the determining process of the learner video attention area in step 8 includes normalizing attention saliency maps of each frame to obtain a saliency matrix, performing gaussian filtering on the saliency matrix to obtain a gaussian thermodynamic diagram, performing pseudo-color mapping on the gaussian thermodynamic diagram to obtain a color thermodynamic diagram, and performing linear superposition on the corresponding original frame with a preset transparency coefficient to obtain a visualized frame, namely, the learner video attention area.

Description

Video learner attention intelligent prediction method based on decoupling space-time state space model Technical Field The invention relates to the field of artificial intelligence and big data, in particular to an intelligent prediction method for the attention of a video learner based on a decoupling space-time state space model. Background With the popularization of online education and digital learning, learning contents are increasingly recorded and transmitted in video form. In order to objectively evaluate the learning process and assist teaching feedback, it is important to automatically obtain the attention related information of the learner from the learning scene video, especially to be able to give the attention distribution result of the learner in a frame-by-frame and spatial region form. In the prior art, one category of schemes relies on special equipment such as an eye tracker to acquire a gaze point or a gaze track, and has higher precision, but high cost and high deployment threshold. The other scheme is based on the video collected by the common camera for analysis, has the advantage of easy deployment, is easily influenced by factors such as illumination change, shielding, head posture change, resolution, shooting view angle and the like, and is output in a coarse granularity state such as attention/distraction, and the like, so that an interpretable frame-by-frame attention space region is difficult to provide. In order to obtain a fine-granularity interpretable result, related research often adopts a video saliency/attention prediction idea, and a frame-by-frame saliency map is output by performing space-time feature modeling on a video sequence. However, the learning content video in the learning scene has cross-frame motion and gesture changes, accurate modeling generally requires cross-frame feature alignment and time sequence information propagation, and an alignment process in the modeling is challenging and introduces additional complexity. The existing methods adopt a 'space-time joint modeling' mode, namely, the intra-frame spatial relationship and the inter-frame temporal relationship are processed simultaneously in a single operator or a single module. Such joint modeling often requires that spatial positional relationship modeling and cross-time correlation/alignment be accomplished simultaneously in the same computing process, and joint space-time modeling tends to introduce significant computing and storage overhead when the video resolution is higher or the number of frames is longer. At the same time, related studies have also indicated that separating spatial information from temporal information is better in efficiency than joint spatiotemporal attention at higher spatial resolution or longer video. Therefore, education practice urgent needs a novel learner video attention intelligent prediction method, and a space-time representation learning scheme capable of decoupling space modeling and time modeling is needed, so that models can learn intra-frame space context and cross-frame time sequence dependence respectively, implementation complexity and resource cost caused by 'single operator simultaneous alignment of two dimensions' are reduced, and stability and deployment performance on long-sequence learning scene videos are improved. Disclosure of Invention The invention aims to provide a video learner attention intelligent prediction method based on a decoupling space-time state space model, aiming at the problems that a space-time joint modeling operator in the video attention prediction technology in the existing learning scene needs to process the space relation in a frame and the time relation between frames simultaneously, the complexity of cross-dimensional alignment is high, the calculation and storage cost is high under a long sequence or high resolution, the stability and the deployability of an output result are insufficient and the like. According to the method, the intra-frame space context modeling and the inter-frame time dependency modeling are decoupled, so that the model can learn the long-range propagation relation between the global semantics of the space dimension and the time dimension respectively, the realization complexity and the resource cost caused by the simultaneous alignment of a single operator and two dimensions are reduced, meanwhile, a frame-by-frame and interpretable attention saliency map is output, the attention area of a learner is determined according to the attention saliency map, and a fine granularity basis is provided for on-line teaching feedback, learning process evaluation and personalized guide. In a word, compared with the prior art, the invention provides a novel intelligent solution for the video attention area prediction of learners. The purpose of the invention is realized in the following way: a video learner attention intelligent prediction method based on a decoupled spatiotemporal state space model, the decoupl