CN-121997250-A - Visual intelligent interactive education model system

CN121997250ACN 121997250 ACN121997250 ACN 121997250ACN-121997250-A

Abstract

The invention discloses a visual intelligent interactive education model system, which combines edge calculation and cloud cooperation to realize real-time acquisition, fusion and intelligent analysis of multi-modal data in a classroom, wherein the system synchronously preprocesses texts, voices, videos, images, three-dimensional gestures, interaction events and the like in the classroom by constructing multi-modal data acquisition and processing environments at the edge end and the cloud cooperation, inputs the multi-modal data acquisition and processing environments into a cross-modal transformer network, realizes bidirectional fusion among modes under a multi-head cross-attention mechanism to generate holographic representation of unified semantic space, adopts a self-supervision comparison learning strategy to complete semantic alignment and discrimination enhancement of multi-modal characteristics, can perform emotion recognition, behavior analysis and interaction intention prediction based on the holographic representation, supports modal deletion complementation, and adopts a lightweight model at the edge end to realize low-delay processing and high-precision reasoning and optimization at the cloud.

Inventors

YANG LI
REN CHUNHUA
SUN XIAOLI
MIAO HUI
CHE XUE

Assignees

长春电子科技学院

Dates

Publication Date: 20260508
Application Date: 20251225

Claims (7)

1. A visual intelligent interactive education model system is characterized by comprising the following modules: The multi-mode data acquisition and preprocessing module is used for synchronously acquiring multi-mode data in a classroom teaching scene and denoising the data; The edge lightweight processing module is deployed on the classroom local computing equipment and is used for carrying out real-time feature extraction and compression coding on the preprocessed multi-mode data, so that network transmission delay and bandwidth occupation are reduced; The cross-modal feature fusion and representation generation module comprises a cross-modal transformer network and a multi-head cross-attention mechanism, and is used for carrying out bidirectional information interaction and fusion on different modal features to generate holographic representation vectors under a unified semantic space; the self-supervision contrast learning optimization module is used for maximizing the similarity of different modal representations of the same event and minimizing the similarity of different event representations by constructing multi-modal positive and negative sample pairs so as to optimize the feature encoder and the fusion network parameters and realize semantic alignment and discrimination enhancement: the intelligent classroom analysis module is used for carrying out classroom emotion recognition, behavior analysis and interactive intention prediction based on the holographic representation vector and generating a corresponding analysis result; The visualization and teaching management interface module is used for displaying the classroom analysis result, the emotion change trend and the interaction prediction in a structured visualization mode, interacting with the cloud teaching management system and realizing teaching playback, behavior statistics and teaching optimization; The cloud-edge cooperation and data synchronization module is used for realizing bidirectional synchronization of multi-mode data, model parameters and analysis results between the edge end and the cloud end, and comprises clock synchronization, time stamp marking and a buffer scheduling mechanism, so that real-time performance and consistency of cloud-edge cooperation are ensured.
2. The visual intelligent interactive education model system according to claim 1, wherein the cross-modal transformer network comprises a multi-modal encoder, a multi-head cross attention module and a fusion decoder, wherein the multi-modal encoder is used for independently extracting each modal characteristic representation, the multi-head cross attention module is used for calculating the correlation weight among modes and realizing bidirectional fusion, and the fusion decoder is used for generating a globally unified semantic representation vector.
3. The visual intelligent interactive education model system according to claim 1, wherein the self-supervision contrast learning optimization module optimizes parameters of each modal feature encoder and the cross-modal fusion module based on contrast loss function to enhance semantic consistency and discrimination capability of multi-modal representation.
4. The visual intelligent interactive education model system according to claim 1, wherein the classroom intelligent analysis module is configured to perform comprehensive calculation on the classroom emotion state based on the holographic representation vector, and the calculation includes extracting emotion related information from different modal features and fusing the emotion related information to generate a classroom overall emotion state label.
5. The visual intelligent interactive education model system as set forth in claim 1, wherein the cloud-edge collaboration and data synchronization module has a clock synchronization error of less than 1 millisecond and a multi-modal data buffer queue and time alignment scheduler for timing alignment and deletion complement of different modal data streams.
6. The visual intelligent interactive education model system as set forth in claim 1, wherein the fusion decoder predicts the feature vector of a certain mode or directly generates the mode content to ensure the integrity of the multi-mode representation when the mode data is missing in the reasoning stage.
7. The visual intelligent interactive education model system as set forth in claim 1, wherein the edge lightweight processing module and the cloud cross-mode transformer network employ a parameter hierarchical update mechanism, and when the network condition is abnormal, the edge can continuously output the analysis result based on the cache data and the local reasoning model until the network is restored.

Description

Visual intelligent interactive education model system Technical Field The invention belongs to the technical field of interactive education models, and particularly relates to a visual intelligent interactive education model system. Background Currently, multi-modal educational data analysis techniques have been widely used in smart class, online education, and hybrid teaching, among other scenarios; in the traditional method, modeling is carried out on multi-modal data in a class based on a graph convolution neural network; The graph convolution neural network captures local relevance among different modal data by utilizing a graph structure and realizes feature fusion to a certain extent, however, the method has the following defects: The method has the advantages of limited capability in processing time sequence and long-distance dependency, difficulty in fully capturing global semantic association of cross modes and cross time periods in a class, easiness in obviously reducing the overall performance of a graph rolling neural network model when certain mode data is absent or the quality is reduced, lack of an effective mode complement and speculation mechanism, limitation in real-time performance, delay in real-time processing and feedback of large-scale class data based on the graph rolling neural network, influence on teaching interactive experience, facing complex scenes of simultaneous access of data types such as videos, voices, texts and three-dimensional gestures, and the like, and lack of flexibility in model structure adjustment and feature fusion strategy optimization in the traditional graph rolling neural network scheme; As shown in a multi-mode student classroom behavior analysis system and method of CN201711469436.5, the existing multi-mode classroom behavior analysis system can analyze student classroom behaviors through video and audio feature extraction, but adopts a mode of independently extracting features of each mode and then carrying out statistical analysis, has lower fusion degree, has limited adaptability to mode deletion and real-time requirements, and has no advantage when cross-mode depth interaction is needed or complex dynamic teaching scenes are dealt with; Therefore, a novel visual intelligent education model system which can realize high-efficiency cross-mode fusion, has the self-adaptive complementary capability of a missing mode and supports real-time interactive analysis is needed, so that the classroom interactive quality and analysis precision are improved. Disclosure of Invention First, the technical problem to be solved The invention aims to overcome the defects of the conventional multi-mode classroom data analysis method based on the graph convolution neural network in the aspects of global semantic modeling, missing modal processing, real-time performance, expandability and the like, and provides a visual intelligent interactive education model system which can cooperatively operate at an edge end and a cloud end and has the cross-modal bidirectional fusion and self-supervision alignment capability, so that the accuracy and the real-time performance of classroom emotion recognition, behavior analysis and interactive prediction are improved. (II) technical scheme The invention provides a visual intelligent interactive education model system, which is characterized by comprising a multi-mode data acquisition and preprocessing module, a data processing module and a data processing module, wherein the multi-mode data acquisition and preprocessing module is used for synchronously acquiring multi-mode data in a classroom teaching scene and denoising the data; The edge lightweight processing module is deployed on the classroom local computing equipment and is used for carrying out real-time feature extraction and compression coding on the preprocessed multi-mode data, so that network transmission delay and bandwidth occupation are reduced; The cross-modal feature fusion and representation generation module comprises a cross-modal transformer network and a multi-head cross-attention mechanism, and is used for carrying out bidirectional information interaction and fusion on different modal features to generate holographic representation vectors under a unified semantic space; the self-supervision contrast learning optimization module is used for maximizing the similarity of different modal representations of the same event and minimizing the similarity of different event representations by constructing multi-modal positive and negative sample pairs so as to optimize the feature encoder and the fusion network parameters and realize semantic alignment and discrimination enhancement: the intelligent classroom analysis module is used for carrying out classroom emotion recognition, behavior analysis and interactive intention prediction based on the holographic representation vector and generating a corresponding analysis result; The visualization and teaching management inte