CN-122018691-A - Spoken English scenario simulation teaching method and system based on somatosensory interaction

CN122018691ACN 122018691 ACN122018691 ACN 122018691ACN-122018691-A

Abstract

The invention discloses a spoken English scenario simulation teaching method and system based on somatosensory interaction, which belong to the technical field of spoken English scenario simulation teaching, acquire head rotation, limb gesture, face orientation and hand action of a learner in a teaching scenario to form an initial interaction feature sequence, synchronously acquire spoken English output content to construct a voice feature matrix, integrate the features through time sequence alignment and a multi-channel attention mechanism to generate a somatosensory-voice joint interaction feature map, match simulation task nodes with non-player character scripts based on the map, dynamically adjust dialogue rhythm and context complexity, combine completion status and interaction performance of the learner to generate a capability assessment report, and recommend customized training tasks according to low response indexes in assessment results.

Inventors

YAN XUE

Assignees

江苏师范大学科文学院

Dates

Publication Date: 20260512
Application Date: 20260202

Claims (8)

1. A spoken English scenario simulation teaching method based on somatosensory interaction is characterized by comprising the following steps: acquiring action behavior data of a learner in a teaching scene, wherein the action behavior data comprise head rotation, limb gesture, face orientation and hand action, and forming an initial interaction characteristic sequence; acquiring the oral english output content of a learner, classifying semantic tags, evaluating fluency and recognizing intonation of the voice content based on a voice semantic analysis model, and constructing a voice feature matrix; Performing time sequence alignment and feature fusion on the interaction feature sequence and the voice feature matrix to form a somatosensory-voice combined interaction feature map; Based on the somatosensory-voice joint interaction characteristic spectrum, selecting a corresponding simulation task node and NPC role script, and dynamically adjusting dialogue rhythm, context difficulty and role feedback strategy; Based on the current task completion state of the learner and the somatosensory-voice joint interaction characteristic spectrum, calculating task completion degree, language interaction response time and interaction fidelity, and generating a capability assessment report; And automatically recommending customized training tasks containing set scenes, vocabulary topics and behavioral interaction requirements according to the marked low-response indexes in the capability assessment report to form a learner personalized circulating training path.
2. The method for simulating and teaching the spoken English scene based on somatosensory interaction according to claim 1, wherein the process for forming the somatosensory-voice joint interaction characteristic map comprises the following steps: The method comprises the steps of carrying out time sequence alignment on an initial interaction feature sequence and a voice feature matrix based on a unified time stamp, adopting a dynamic time warping algorithm to calculate an optimal matching path between the aligned feature sequences, establishing a response mapping relation between voice and actions, carrying out weighted fusion on the alignment feature through a multichannel attention mechanism, extracting a joint feature with a key effect on task feedback in joint behavior expression, and constructing a graph structure by taking nodes as units by the fused joint feature to generate a somatosensory-voice joint interaction feature map.
3. The method for simulating and teaching the English spoken language scene based on somatosensory interaction according to claim 1, wherein the process of selecting the corresponding simulated task node and NPC character script based on the somatosensory-voice joint interaction characteristic map comprises the following steps: Extracting a key node set with high semantic relevance and behavior continuity from a somatosensory-voice joint interaction characteristic map as an expression subgraph of the current learner behavior state; Performing vector similarity matching on the node feature vector in the expression subgraph and a task template in a preset task knowledge graph, and determining a best-fit simulation task node; based on the context semantic tags of the simulation task nodes, searching a corresponding non-player character behavior script library, and selecting a character response script with situation adaptability; And carrying out dynamic parameter replacement on the language expression mode and the interaction triggering condition of the character response script to enable the character response script to conform to the current learner state, and realizing personalized task generation.
4. The method for simulating and teaching the spoken English scene based on somatosensory interaction of claim 3, wherein the process of dynamically adjusting the dialogue rhythm, the context difficulty and the role feedback strategy comprises the following steps: calculating the current cognitive load level of the learner based on the voice fluency index of the node in the somatosensory-voice combined interaction characteristic map and the limb response delay time; according to the cognitive load level, selecting a matched context complexity level from a preset context parameter library, and correspondingly adjusting the background task instruction length, the keyword quantity and the grammar structure level; and dynamically adjusting the language output speed, pause interval and guidance prompt mode of the non-player character by combining response time and semantic deviation rate in the learner history interaction characteristics.
5. The method for simulating and teaching the spoken English scene based on somatosensory interaction according to claim 1, wherein the process of calculating the task completion, the language interaction response time and the interaction authenticity comprises the following steps: extracting a key behavior path corresponding to a current simulation task node in a somatosensory-voice joint interaction characteristic map as a task behavior track; Calculating the task completion according to the semantic tag matching rate and the action execution integrity in the task behavior track; Analyzing the average time interval between learner voice output and non-player character voice input to obtain language interaction response time; and comprehensively calculating the interaction authenticity score by using time sequence cooperativity, intonation and emotion consistency indexes of the action features and the voice features.
6. The method for simulating teaching of spoken English scene based on somatosensory interaction according to claim 1, wherein the process of obtaining learner action behavior data comprises obtaining head pitch angle, yaw angle and roll angle, bone joint point position coordinates, gesture morphological parameters and face orientation vectors of a learner through a depth camera, an infrared recognition device and a three-dimensional gesture estimation algorithm, and organizing the learner action behavior data into an initial interaction feature sequence according to time stamps.
7. The method for simulating and teaching the spoken English scene based on somatosensory interaction according to claim 1, wherein the speech semantic analysis model is constructed based on a deep learning structure, semantic intention recognition, grammar structure analysis and keyword extraction are performed on speech transcription content by adopting a two-way long-short-term memory-conditional random field model or a transform model, and a speech feature matrix comprising semantic tags, fluency scores and intonation features is generated.
8. An english oral scenario simulation teaching system based on somatosensory interaction, which is used for realizing the english oral scenario simulation teaching method based on somatosensory interaction according to any one of claims 1-7, and is characterized by comprising the following steps: The somatosensory behavior acquisition module is used for acquiring action behavior data of a learner in a teaching scene, including head rotation, limb gesture, face orientation and hand action, so as to form an initial interaction characteristic sequence; The voice feature analysis module is used for acquiring the oral English output content of the learner, carrying out semantic tag classification, fluency assessment and intonation recognition on the voice content based on the voice semantic analysis model, and constructing a voice feature matrix; the multi-mode feature fusion module is used for carrying out time sequence alignment and feature fusion on the interaction feature sequence and the voice feature matrix to form a somatosensory-voice joint interaction feature map; the interaction generation module is used for selecting corresponding simulation task nodes and NPC role scripts based on the somatosensory-voice joint interaction characteristic map, and dynamically adjusting dialogue rhythm, context difficulty and role feedback strategies; The performance analysis module is used for calculating the task completion degree, the language interaction response time and the interaction fidelity based on the current task completion state of the learner and the somatosensory-voice joint interaction characteristic map and generating a capability assessment report; And the training recommendation module is used for automatically recommending customized training tasks comprising set scenes, vocabulary topics and behavioral interaction requirements according to the marked low-response indexes in the capability assessment report to form a learner personalized circulating training path.

Description

Spoken English scenario simulation teaching method and system based on somatosensory interaction Technical Field The invention relates to the technical field of language scene simulation teaching, in particular to a spoken English scene simulation teaching method and system based on somatosensory interaction. Background At present, the oral English teaching widely adopts means such as video teaching, scene dialogue, AI speech recognition scoring and the like, and has effects on standard speech training, but the problems of lack of immersive interaction, low participation of students and missing of real scenes of oral output generally exist. Especially in primary and middle school stages, student English expression depends on hard back, and has weak ability to understand and respond to the environment, and is difficult to be qualified in real communication scenes. Prior art attempts to introduce Virtual Reality (VR), augmented Reality (AR) or speech recognition to assist english teaching, but still have the following problems: (1) The equipment is expensive, the operation is complex, and the method is not suitable for wide popularization; (2) Speech recognition has low fault tolerance to dialects, speech speed and intonation variations, resulting in evaluation distortion; (3) Most systems lack guidance on non-linguistic factors (such as limb language, eye communication, etc.), and cannot simulate a real dialogue atmosphere; (4) The lack of feedback mechanisms by the learner in the simulation scenario results in discontinuities in the learning path and training fragmentation. Especially in the 'task driven' spoken language training scene, such as airport question, shopping in shopping mall, hospital registration, etc., the traditional technology is difficult to establish a dynamic feedback mechanism according to the interactive behavior of the limbs of the learner, the language output state and the multidimensional interactive model built in the system, so that the closed loop process of 'dialog rhythm guidance-language output training-interactive feedback assessment' cannot be effectively realized, and the practical application capability of the student language is difficult to be improved. Disclosure of Invention The invention aims to provide a spoken English scene simulation teaching method and system based on somatosensory interaction, which are used for solving the defects in the background technology. In order to achieve the purpose, the invention provides the following technical scheme that the spoken English scene simulation teaching method based on somatosensory interaction comprises the following steps: acquiring action behavior data of a learner in a teaching scene, wherein the action behavior data comprise head rotation, limb gesture, face orientation and hand action, and forming an initial interaction characteristic sequence; acquiring the oral english output content of a learner, classifying semantic tags, evaluating fluency and recognizing intonation of the voice content based on a voice semantic analysis model, and constructing a voice feature matrix; Performing time sequence alignment and feature fusion on the interaction feature sequence and the voice feature matrix to form a somatosensory-voice combined interaction feature map; Based on the somatosensory-voice joint interaction characteristic spectrum, selecting a corresponding simulation task node and NPC role script, and dynamically adjusting dialogue rhythm, context difficulty and role feedback strategy; Based on the current task completion state of the learner and the somatosensory-voice joint interaction characteristic spectrum, calculating task completion degree, language interaction response time and interaction fidelity, and generating a capability assessment report; And automatically recommending customized training tasks containing set scenes, vocabulary topics and behavioral interaction requirements according to the marked low-response indexes in the capability assessment report to form a learner personalized circulating training path. The method comprises the steps of aligning an initial interaction feature sequence with a voice feature matrix based on a unified time stamp, calculating an optimal matching path between the aligned feature sequences by adopting a dynamic time warping algorithm, establishing a response mapping relation between voice and actions, carrying out weighted fusion on the aligned features by a multichannel attention mechanism, extracting joint features with key effects on task feedback in joint behavior expression, and constructing a graph structure by taking the fused joint features as units to generate the somatosensory-voice joint interaction feature map. Preferably, based on the somatosensory-voice joint interaction characteristic spectrum, the process of selecting the corresponding simulation task node and NPC role script comprises the following steps: Extracting a key node set with high semantic rele