CN-121999412-A - Classroom video segmentation and catalog creation method based on multi-mode information

CN121999412ACN 121999412 ACN121999412 ACN 121999412ACN-121999412-A

Abstract

The invention relates to the technical field of intelligent education video processing, and discloses a classroom video segmentation and catalog creation method based on multi-mode information. The method synchronously collects video picture, audio and screen text signals, extracts picture change, voice energy outline and text keyword flash characteristics in frames after time domain alignment, fuses the picture change, voice energy outline and text keyword flash characteristics into mixed characteristics, and detects classroom rhythm mutation points according to evolution tracks of the mixed characteristics so as to perform initial segmentation. And calculating the content fingerprint of each video paragraph, constructing a paragraph association network diagram, identifying the paragraph groups closely related to the semantics and recombining the paragraph groups into coherent knowledge units, and finally generating a structured video catalog according to the time marks of the knowledge units. According to the scheme, the conversion boundary of the teaching content is accurately positioned through multi-mode feature collaborative analysis, and the semantic network is utilized to reorganize cross-period associated content, so that automatic video catalog construction conforming to cognitive logic is realized.

Inventors

XIANG TAO
ZHANG DI
SHI XINGJIE

Assignees

无锡铭思会学科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260122

Claims (10)

1. The classroom video segmentation and catalog creation method based on the multi-mode information is characterized by comprising the following steps: The method comprises the steps of collecting visual picture signals, voice audio signals and screen text signals in classroom videos in parallel, and carrying out time domain alignment on the visual picture signals, the voice audio signals and the screen text signals to form synchronous multi-mode signal streams; Framing the synchronous multi-mode signal stream, extracting picture change characteristics, voice energy profile characteristics and text keyword flash characteristics aiming at each time frame, and combining the picture change characteristics, the voice energy profile characteristics and the text keyword flash characteristics into mixed characteristics of a current frame; detecting abrupt change boundary points of the rhythm of the classroom according to the evolution track of the mixed characteristics of each time frame, and segmenting the classroom video into a plurality of original video segments according to the abrupt change boundary points; Calculating a paragraph fingerprint for each original video paragraph, constructing a paragraph association network graph according to cosine distances among the paragraph fingerprints of all original video paragraphs, and identifying dense subgraphs in the paragraph association network graph; Mapping the dense subgraphs back to a time domain, and reorganizing and splicing corresponding original video paragraphs to generate semantically coherent knowledge units; and marking a division point on the original classroom video by utilizing the start-stop time of the knowledge unit, and generating a hierarchical video catalog draft according to the division point.
2. The method for segmenting and creating the catalogue of the classroom video based on the multi-modal information according to claim 1, wherein the detecting the abrupt change boundary point of the classroom rhythm according to the evolution track of the mixed characteristic of each time frame comprises the following steps: sliding an analysis window on a time axis, and calculating main components of all time frame mixing characteristics in the current analysis window; tracking the rotation angle of the main component between the continuous analysis windows, and recording the central point of the current analysis window as a singular point to be verified when the rotation angle exceeds a preset angle mutation threshold; Expanding a verification interval before and after the singular point to be verified, and respectively calculating mixed characteristic mean vectors of the front part and the rear part of the verification interval; And calculating the Marshall distance between the mixed characteristic mean vectors, and judging the singular point to be verified as an effective classroom rhythm mutation boundary point when the Marshall distance is larger than a preset distance confirmation threshold value.
3. The multi-modal information based classroom video segmentation and catalog creation method as claimed in claim 1 wherein said segmenting the classroom video into a plurality of original video segments based on said abrupt boundary points comprises: Arranging all effective classroom rhythm mutation boundary points according to a time sequence, and intercepting video streams between two adjacent mutation boundary points to form an original video paragraph; Checking the duration length of each original video paragraph, and if the duration length is smaller than a preset shortest paragraph threshold value, forcedly combining the original video paragraph with the subsequent original video paragraph; If the duration length of the combined paragraphs exceeds the preset longest paragraph threshold, performing secondary division in the combined paragraphs according to the lowest valley point of the speech energy profile features to ensure that the duration length of each finally obtained original video paragraph is within a preset range.
4. The multi-modal information based classroom video segmentation and catalog creation method as described in claim 1 wherein said computing a paragraph fingerprint for each original video paragraph comprises: the paragraph fingerprints are formed based on statistical histograms corresponding to mixed features in the original video paragraphs; Connecting picture change characteristics of all time frames in an original video segment into a characteristic matrix, carrying out wavelet transformation on the characteristic matrix in a time dimension, and extracting a low-frequency wavelet coefficient as a picture steady-state characteristic; Counting the proportion of the voice energy profile characteristic exceeding an activity threshold value in the original video segment to be used as voice activity characteristic; collecting all the flash text keywords in the original video paragraph to form a keyword set, and converting the keyword set into a weighted average vector based on a pre-training word vector as text semantic features; and normalizing and cascading the picture steady-state features, the voice activity features and the text semantic features to form paragraph fingerprints representing the original video paragraphs.
5. The method for creating the classroom video segmentation and catalogue based on the multi-modal information according to claim 1, wherein the constructing the paragraph association network graph according to the cosine distance between the paragraph fingerprints of all the original video paragraphs comprises: Taking each original video paragraph as a network node, and calculating cosine similarity between paragraph fingerprints corresponding to any two network nodes; Setting a similarity link threshold, and when cosine similarity between two network nodes is higher than the similarity link threshold, establishing an undirected edge between the two network nodes, wherein the weight value of the edge is the cosine similarity; Traversing all network node pairs to complete the establishment of undirected edges, thereby forming a weighted paragraph association network graph, wherein the paragraph association network graph reflects the adjacent relation of the original video paragraphs in the feature space.
6. The multi-modal information based classroom video segmentation and catalog creation method as described in claim 5 wherein said identifying dense subgraphs in said paragraph association network graph comprises: running a graph clustering algorithm on the paragraph association network graph with the weight, and aggregating network nodes which are tightly connected and have higher edge weights into a community; calculating the ratio of the average edge weight of each community to the average edge weight among communities, and marking communities with the ratio larger than the community aggregation threshold as candidate dense subgraphs; And if the network nodes are discontinuous, dividing the candidate dense subgraphs into a plurality of time-continuous fragments according to the time sequence, wherein each time-continuous fragment is the final dense subgraph.
7. The multi-modal information-based classroom video segmentation and catalog creation method of claim 6 wherein said mapping the dense subgraphs back to a time domain, reorganizing and stitching corresponding original video paragraphs to generate semantically coherent knowledge units comprises: the network nodes contained in each dense subgraph are arranged in sequence on a time axis according to the time stamp of the original video paragraph represented by the network nodes; the arranged original video paragraphs are connected end to end and combined into a long video paragraph sequence; performing stability test on the mixed features at the start and stop positions of the long video paragraph sequence, if the features of the start position are not stable, fine-tuning the start point forward to the feature stability, and if the features of the end position are not stable, fine-tuning the end point backward to the feature stability, so as to determine the accurate start and stop time of the knowledge unit; each long video segment sequence with an accurate start-stop time is defined as a knowledge unit.
8. The multi-modal information based classroom video segmentation and catalog creation method as claimed in claim 1 wherein said generating a hierarchical video catalog draft from said segmentation points comprises: Taking each knowledge unit as a leaf node, and generating a title by extracting screen text keywords which occur frequently in the knowledge units; analyzing the time adjacent relation and paragraph fingerprint similarity relation between knowledge units, merging a plurality of knowledge units which are continuous in time and have the paragraph fingerprint similarity exceeding a merging threshold value into a father node, and generating a generalized title for the father node; The merging operation is performed recursively until a new parent node cannot be formed, so that a directory tree with a multi-level hierarchy, namely a hierarchical video directory draft, is built from bottom to top.
9. The multi-modal information based classroom video segmentation and catalog creation method of claim 2 wherein sliding an analysis window on the timeline calculates the principal components of all timeframe blending features within the current analysis window comprising: setting an analysis window with a fixed time length, and sliding the analysis window on a time axis in a fixed step length; Extracting mixed features of all time frames falling in the analysis window for each analysis window position, and arranging the mixed features in time sequence to form a feature matrix; carrying out centering treatment on the feature matrix, namely subtracting the average value of each dimension of features in the feature matrix to obtain a centered feature matrix; Calculating a covariance matrix of the centralized feature matrix, and then carrying out feature value decomposition on the covariance matrix; and selecting a plurality of first feature vectors with the largest feature values, and taking the feature vectors as main components to represent the main direction of the evolution of the mixed features in the current analysis window.
10. The method for segmenting and creating a class video based on multimodal information according to claim 6, wherein running a graph clustering algorithm on the weighted paragraph association network graph aggregates closely connected network nodes with higher edge weights into a community, comprising: initializing the paragraph association network graph, and regarding each network node as an independent community; calculating module gain generated after all adjacent communities are combined, wherein the module gain is used for measuring community division quality; searching and executing community merging operation capable of bringing maximum modularity gain, and merging corresponding communities; Repeating the process of calculating the module gain and executing the merging operation until the module of the whole network is not increased any more; The community division result obtained at this time is the final community, and each community comprises a group of network nodes which are tightly connected in the network and have higher side weight.

Description

Classroom video segmentation and catalog creation method based on multi-mode information Technical Field The invention relates to the technical field of intelligent education video processing, in particular to a classroom video segmentation and catalog creation method based on multi-mode information. Background Currently, structured processing for classroom video relies primarily on analysis of single modality information. Common practices include utilizing visual scene cut detection, audio-based silence segment or energy mutation point recognition, and simple keyword matching and time stamping of screen text. These techniques typically operate independently, with a preliminary slicing of the video stream based only on threshold variations in its signal dimension. Such a scheme based on single mode threshold discrimination has drawbacks. The teaching rhythm conversion in the classroom is the result of the combined action of multiple elements such as teacher language, blackboard writing content, courseware demonstration and student attention, and the starting and stopping of knowledge points are difficult to accurately represent only by single severe change of pictures or sounds. A brief pause in audio may be a thinking gap rather than a chapter transition, while a courseware turn-up in a continuous lecture is also not equivalent to the end of one knowledge unit. Meanwhile, the video segments obtained by purely cutting in time sequence often cut the discontinuous distribution of the same subject content in time. A teacher may review, deepen or supplement and explain a core concept in a course many times, so that related teaching contents are scattered in different periods, and the existing linear segmentation method cannot effectively aggregate and reorganize the semantically related and time-scattered paragraphs. The technology is needed to comprehensively utilize multisource information of a classroom, accurately sense substantial turning in teaching rhythm, intelligently identify and reorganize teaching contents which are discrete in time but are highly related in semantics, so that a knowledge unit system conforming to cognitive logic is formed, and a foundation is provided for generating a high-quality video catalog. Disclosure of Invention The invention aims to provide a classroom video segmentation and catalog creation method based on multi-mode information so as to solve the problems in the background technology. In order to achieve the above object, the present invention provides a method for video segmentation and catalog creation in a classroom based on multimodal information, the method comprising: The method comprises the steps of collecting visual picture signals, voice audio signals and screen text signals in classroom videos in parallel, and carrying out time domain alignment on the visual picture signals, the voice audio signals and the screen text signals to form synchronous multi-mode signal streams; Framing the synchronous multi-mode signal stream, extracting picture change characteristics, voice energy profile characteristics and text keyword flash characteristics aiming at each time frame, and combining the picture change characteristics, the voice energy profile characteristics and the text keyword flash characteristics into mixed characteristics of a current frame; detecting abrupt change boundary points of the rhythm of the classroom according to the evolution track of the mixed characteristics of each time frame, and segmenting the classroom video into a plurality of original video segments according to the abrupt change boundary points; Calculating a paragraph fingerprint for each original video paragraph, constructing a paragraph association network graph according to cosine distances among the paragraph fingerprints of all original video paragraphs, and identifying dense subgraphs in the paragraph association network graph; Mapping the dense subgraphs back to a time domain, and reorganizing and splicing corresponding original video paragraphs to generate semantically coherent knowledge units; and marking a division point on the original classroom video by utilizing the start-stop time of the knowledge unit, and generating a hierarchical video catalog draft according to the division point. Preferably, the detecting the mutation boundary point of the class rhythm according to the evolution track of the mixed characteristic of each time frame includes: sliding an analysis window on a time axis, and calculating main components of all time frame mixing characteristics in the current analysis window; tracking the rotation angle of the main component between the continuous analysis windows, and recording the central point of the current analysis window as a singular point to be verified when the rotation angle exceeds a preset angle mutation threshold; Expanding a verification interval before and after the singular point to be verified, and respectively calculating mixed characteristic