CN-121997110-A - Automatic analysis method and device for classroom session mode

CN121997110ACN 121997110 ACN121997110 ACN 121997110ACN-121997110-A

Abstract

The invention relates to the technical field of analysis of classroom session modes, and discloses an automatic analysis method and an automatic analysis device of classroom session modes, wherein the method comprises the steps of firstly extracting classroom video and audio, transcribing text, and dividing session fragments to obtain sub-audio and sub-text; merging the sub-audio conversation main body, the sub-text conversation level and the type to generate macro-features, converting the sub-text sentences into semantic features and merging the semantic features with the macro-features to obtain a merged feature vector; and finally, identifying the fusion feature vector through a preset Gaussian mixture model, and outputting a class session mode. The invention realizes the deep fusion of macroscopic behavior features and semantic content features, solves the problem of insufficient fusion of multi-source heterogeneous features, accurately digs the interactive deep structure and potential mode of a classroom, efficiently identifies complex representative conversation modes, provides powerful technical support for improving teaching quality, and makes up the application limit of the prior art.

Inventors

FENG XIAOYING
HUI NING

Assignees

北京师范大学

Dates

Publication Date: 20260508
Application Date: 20251201

Claims (11)

1. A method for automated analysis of classroom session patterns, the method comprising: extracting an audio file of a classroom video, and transcribing the audio file into text data; dividing the audio file and the text data according to the session fragment to generate a plurality of sub-audio files and a plurality of sub-text data; Combining the conversation main body of each sub-audio file, the conversation level and the conversation type of each sub-text data to generate macro-features; Converting each sentence of each sub-text data into semantic features, fusing the semantic features and macroscopic features of the same conversation fragment, generating a plurality of fused feature vectors, and inputting the fused feature vectors into a preset target Gaussian mixture model; And identifying each fusion feature vector through the target Gaussian mixture model to obtain a classroom session mode of the classroom video.
2. The method of claim 1, wherein extracting the audio file of the classroom video and transcribing the audio file into text data comprises: Acquiring a classroom video; Adopting a multimedia video coding tool to carry out audio separation on the classroom video to obtain an audio file of the classroom video; And transcribing the audio file into text data.
3. The method of claim 1, wherein the generating macro-features in combination with the session body of each of the sub-audio files, the session hierarchy of each of the sub-text data, and the session type comprises: Distinguishing conversation bodies in the sub audio files; determining the conversation level of each piece of sub-text data based on the concept level and the practice level of the conversation content in each piece of sub-text data; determining the conversation type of each piece of sub-text data according to the teaching behavior, the questioning behavior, the feedback behavior, the indicating behavior, the sharing behavior and the supplementing behavior of the conversation process in each piece of sub-text data; And generating macro features by adopting all the session main bodies, the session layers and the session types.
4. The method according to claim 1, wherein the converting each sentence of each sub-text data into semantic features, merging semantic features and macroscopic features of the same conversation fragment, generating a plurality of merged feature vectors, and inputting the merged feature vectors into a preset target gaussian mixture model, includes: Inputting each piece of sub-text data into a preset sentence embedding model; Each sentence of each sub-text data is converted into a high-dimensional vector through the sentence embedding model, and semantic features are generated; based on a multi-head attention mechanism, fusing semantic features and macro features of the same session segment to generate a plurality of fused feature vectors; and inputting all the fusion feature vectors into a preset target Gaussian mixture model.
5. The method of claim 4, wherein prior to the step of inputting all of the fused feature vectors into a pre-set target gaussian mixture model, the method further comprises: Initializing model parameters of a preset initial Gaussian mixture model to generate an updated Gaussian mixture model; Carrying out iterative solution on the updated Gaussian mixture model by adopting an expectation maximization algorithm, and determining optimal model parameters according to a solution result; and updating the updated high-new mixed model by adopting the optimal model parameters to obtain the target-standard mixed model.
6. The method according to claim 1, wherein the identifying each of the fusion feature vectors by the target gaussian mixture model to obtain a classroom session mode of the classroom video includes: calculating posterior probability of each fusion feature vector through the optimal model parameters of the target Gaussian mixture model; Classifying each fusion feature vector into a corresponding session mode with the maximum posterior probability; and clustering all the fusion feature vectors to obtain a classroom session mode of the classroom video.
7. The method of claim 6, wherein the method further comprises: calculating cosine similarity between the session segments corresponding to the fusion feature vectors and the clustering center; And determining a plurality of session fragments with highest similarity as representative fragments based on the cosine similarity, and outputting complete text data of the representative fragments.
8. An automated classroom session model analysis device, the device comprising: The extraction module is used for extracting an audio file of the classroom video and transcribing the audio file into text data; the dividing module is used for dividing the audio file and the text data according to the session fragments to generate a plurality of sub-audio files and a plurality of sub-text data; The combining module is used for combining the session main body of each sub-audio file, the session level and the session type of each sub-text data to generate macroscopic features; The input module is used for converting each sentence of each sub-text data into semantic features, fusing the semantic features and macroscopic features of the same conversation fragment, generating a plurality of fused feature vectors and inputting a preset target Gaussian mixture model; And the identification module is used for identifying each fusion feature vector through the target Gaussian mixture model to obtain a classroom session mode of the classroom video.
9. An electronic device, comprising: A memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the classroom session model automated analysis method of any one of claims 1-7.
10. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the classroom session pattern automated analysis method of any one of claims 1-7.
11. A computer program product comprising computer instructions for causing a computer to perform the classroom session model automated analysis method of any one of claims 1-7.

Description

Automatic analysis method and device for classroom session mode Technical Field The invention relates to the technical field of classroom session mode analysis, in particular to an automatic analysis method and device for a classroom session mode. Background Along with the continuous evolution of intelligent education technology, classroom interaction is used as a core carrier of a teaching process, and efficient and objective analysis of the classroom interaction becomes a key link for improving teaching quality and accurately evaluating teaching effects. The classroom interaction covers multidimensional information such as language communication, behavior feedback, thinking collision and the like among teachers and students, and can provide scientific basis for teaching strategy optimization, personalized teaching implementation and teaching effect quantification through deep analysis of the interaction data, so that the teaching aid is an important support for promoting education digital transformation and realizing accurate teaching. However, the traditional classroom interaction analysis mainly relies on manual observation and coding modes, and the mode needs to be high in labor cost and long in time consumption, is easily influenced by subjective experience and cognitive deviation of analysts, and is difficult to ensure objectivity and consistency of analysis results, so that the method cannot adapt to the requirements of large-scale and dynamic modern classroom scenes. Therefore, at present, an automatic classroom analysis technology is generally adopted, part of schemes of the technology are used for descriptive statistics based on a behavior view angle, deep analysis of a conversation semantic layer is not involved, part of schemes are used for simply classifying text conversation depth and lack of systematic mining of interaction logic, and part of schemes depend on traditional speech wheel recognition frameworks such as IRF and the like and can only realize basic speech wheel structure division. Therefore, the technical method cannot realize deep fusion of multi-source heterogeneous features such as macroscopic behavior features and semantic content features in a classroom session, and is difficult to mine deep structures and potential modes behind interaction, and further cannot accurately identify complex and representative classroom session modes, so that sufficient technical support cannot be provided for teaching quality improvement. Disclosure of Invention The invention provides an automatic analysis method and device for a classroom session mode, which are used for solving the problems that the prior art cannot realize deep fusion of multi-source heterogeneous characteristics such as macroscopic behavior characteristics and semantic content characteristics in a classroom session, and is difficult to mine a deep structure and a potential mode behind interaction, so that complicated and representative classroom session modes cannot be accurately identified, and sufficient technical support cannot be provided for teaching quality improvement. In a first aspect, the present invention provides a method for automatically analyzing a session mode in a class, the method comprising: extracting an audio file of a classroom video, and transcribing the audio file into text data; dividing the audio file and the text data according to the session fragment to generate a plurality of sub-audio files and a plurality of sub-text data; Combining the conversation main body of each sub-audio file, the conversation level and the conversation type of each sub-text data to generate macro-features; Converting each sentence of each sub-text data into semantic features, fusing the semantic features and macroscopic features of the same conversation fragment, generating a plurality of fused feature vectors, and inputting the fused feature vectors into a preset target Gaussian mixture model; And identifying each fusion feature vector through the target Gaussian mixture model to obtain a classroom session mode of the classroom video. The method comprises the steps of firstly extracting classroom video and audio, transcribing texts, dividing conversation fragments to obtain sub-audios and sub-texts, fusing sub-audio conversation bodies, sub-text conversation levels and types to generate macroscopic features, converting sub-text sentences into semantic features, fusing the semantic features with the macroscopic features to obtain fused feature vectors, and finally identifying the fused feature vectors through a preset Gaussian mixture model to output a classroom conversation mode. The invention realizes the deep fusion of macroscopic behavior features and semantic content features, solves the problem of insufficient fusion of multi-source heterogeneous features, accurately digs the interactive deep structure and potential mode of a classroom, efficiently identifies complex representative conversation modes, provides power