CN-122024544-A - AI teaching-aid robot supporting multi-path media stream synchronization and interaction

CN122024544ACN 122024544 ACN122024544 ACN 122024544ACN-122024544-A

Abstract

The invention relates to the technical field of intelligent interaction robots, in particular to an AI teaching aid robot supporting multi-path media stream synchronization and interaction, which comprises a robot body, wherein an AI teaching aid interaction system is arranged in the robot body, the AI teaching aid interaction system comprises a streaming media synchronization module, a multi-mode timestamp alignment algorithm and a multi-mode time-stamp analysis module, wherein the streaming media synchronization module is used for carrying out depth fusion on video frame motion characteristics based on an optical flow method and audio time delay estimation based on phase cross-correlation.

Inventors

JIANG CHAO
LIU JIANPING
QU DONGMING
WANG MINGMEI

Assignees

河北华发教育科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260402

Claims (10)

1. The AI teaching aid robot supporting the synchronization and interaction of multiple media streams comprises a robot body (1), and is characterized in that an AI teaching aid interaction system is arranged in the robot body (1), and the AI teaching aid interaction system comprises the following modules: The streaming media synchronization module is used for carrying out depth fusion on video frame motion characteristics based on an optical flow method and audio time delay estimation based on phase cross correlation by adopting a multi-mode time stamp alignment algorithm, and synchronously locking the identity and the voice content of a speaker at an acquisition source; the sound-picture combined tracking module is used for introducing an ant colony algorithm to dynamically optimize the beam forming direction of the microphone array, and predicting the target motion trail by fusing a visual Kalman filter to drive the sound-picture sensing function of the robot to combine with active tracking; The priority learning scheduling module is used for dynamically scheduling the processing priorities of various data streams by adopting a Q-learning algorithm in reinforcement learning, and obtaining an optimal scheduling strategy through trial-and-error learning; The interaction hot spot perception allocation module is used for integrating a visual focus in a visual attention mechanism analysis scene, perceiving the current interaction hot spot as the state input of Q-learning, and guiding a scheduling algorithm to incline the resource to a real interaction subject; and the control module is used for receiving the audio-visual data, the physical state data and the resource data of each module, constructing a complete context, making interactive decisions and converting the interactive decisions into specific control instructions so as to obtain stable resource support.
2. The AI teaching aid robot supporting multi-path media stream synchronization and interaction of claim 1, wherein the streaming media synchronization module comprises a multi-mode timestamp alignment unit and an audio and video identity synchronization locking unit; The multi-mode time stamp alignment unit is used for deeply fusing video frame motion characteristics based on an optical flow method and an audio time delay estimation algorithm based on phase cross correlation at the source of audio and video acquisition, dynamically calibrating time stamps from different hardware by analyzing time differences of lip inching and sound wave arrival, removing time stamp disorder of multiple paths of streams, and enabling each frame of image to be aligned with corresponding voice fragments in time; The audio and video identity synchronous locking unit is used for binding the synchronized audio features and video features based on the accurate synchronous data provided by the multi-mode time stamp alignment unit, and associating the speaker identity with the voice content through joint feature locking.
3. The AI teaching aid robot of claim 2, wherein the multi-modal timestamp alignment unit performs the following steps: synchronously acquiring a video frame sequence output by a robot camera and an audio stream picked up by a microphone array, and respectively extracting optical flow motion feature vectors of mouth regions in each frame of images and phase information of each audio fragment in a frequency domain; Inputting the extracted optical flow motion characteristics and an audio time delay estimated value calculated based on a phase cross-correlation algorithm into a depth fusion network, and constructing a dynamic time offset model by analyzing lip movement starting moment and sound wave arrival time difference; and (3) calibrating hardware clock drift of the camera and the microphone in real time by using a dynamic time offset model, dynamically compensating the time stamp of the video frame and the time stamp of the audio packet, and eliminating the time stamp disorder of the multipath stream at the acquisition source.
4. The AI teaching aid robot of claim 2, wherein the audio-visual identity synchronization locking unit performs the following steps: Receiving the synchronized audio and video stream output by the multi-mode timestamp alignment unit, and extracting voiceprint feature vectors in the synchronized audio clips and face/mouth type structural features in the corresponding video frames; Constructing a cross-modal feature joint embedding space, and carrying out similarity measurement and association matching on voiceprint features and mouth-shaped dynamic features to form a sound-picture identity binding mapping relation; and locking the identity of the current speaker based on the binding mapping relation, and associating the collected voice content to the corresponding identity in real time, so that the speaker and the voice content are accurately corresponding.
5. The AI teaching aid robot supporting multi-path media stream synchronization and interaction of claim 2, wherein the sound and picture joint tracking module comprises a beam forming dynamic optimizing unit and a vision-motion prediction driving unit; The beam forming dynamic optimizing unit is used for dynamically optimizing the beam forming direction of the microphone array by introducing an ant colony algorithm aiming at the random movement of a speaker, quickly finding the optimal path of a sound source in a sound field environment and dynamically adjusting the pick-up direction; The vision-motion prediction driving unit is used for predicting the motion trail of the target by fusing a vision Kalman filter and driving the chassis of the robot to smoothly rotate in advance.
6. The AI teaching aid robot of claim 5, wherein the beam forming dynamic optimizing unit performs the steps of: initializing beam forming parameters of a microphone array, regarding beam directions in all directions as search paths in an ant colony algorithm, and taking the signal-to-noise ratio of picked signals as a pheromone concentration evaluation index; Simulating the moving process of the ant colony in the sound field environment, and enabling the beam forming direction to be converged to the optimal direction of the current sound source rapidly through iterative search by using pheromone updating and path selection probability calculation; and dynamically adjusting the weighting coefficient of the array according to the optimal beam direction output by the ant colony algorithm, updating the pickup beam direction in real time under the environments of walking of multiple people and noise interference, and continuously capturing high signal-to-noise ratio voice.
7. The AI teaching aid robot of claim 5, wherein the vision-motion prediction driving unit performs the following steps: Receiving a sound source area locked by a beam forming dynamic optimizing unit, synchronously acquiring an image sequence of a target speaker acquired by a vision sensor, and extracting position information of the target in an image coordinate system; Inputting the target position sequence into a visual Kalman filter, and carrying out state estimation and prediction on the position and the speed at the next moment by combining a preset target motion model to generate smooth motion trail prediction data; and calculating the required rotation angle and angular velocity of the chassis of the robot in advance according to the motion trail prediction data, and generating a smooth following instruction to drive the robot to execute combined active tracking.
8. The AI teaching aid robot of claim 5, wherein the priority learning scheduling module performs the following steps: Defining multiple media stream types which exist in a system in parallel as a scheduling object set, setting initial priority queues corresponding to various types of streams, and initializing a state space and an action space of a Q-learning algorithm; Monitoring the data volume, processing delay and system resource occupancy rate of each media stream in real time, taking the current resource state as the input state of a Q-learning algorithm, and selecting priority adjustment action according to an epsilon-greedy strategy; after the priority adjustment action is executed, a reward value is calculated according to the processing delay change and the frame loss condition, and the optimal priority allocation scheme with minimized resource competition is gradually converged through iterative updating of the Q table optimization scheduling strategy.
9. The AI teaching aid robot of claim 8 wherein the interactive hot spot aware distribution module performs the following steps: acquiring classroom scene images acquired by cameras, extracting salient regions in the scene through a visual attention mechanism, and analyzing visual focus distribution and human eye gazing directions in each region; Fusing the visual focus distribution characteristics with the speaking states in each region, identifying the interaction hot spot region and the corresponding hot spot character in the current scene, and quantifying the heat weight of each interaction subject; And transmitting the heat weight of each interaction body as a state input to a priority learning scheduling module, and guiding the Q-learning algorithm to distribute computing and transmission resources to the real interaction body with the highest heat in an inclined manner.
10. The AI teaching aid robot of claim 9, wherein the control module performs the following steps: the output audio-visual binding data, physical state data and resource allocation data are collected to construct a classroom interaction context containing multidimensional information; inputting the constructed interaction context into an interaction decision engine, and generating interaction feedback content to be executed and a corresponding robot action sequence according to a preset teaching interaction rule; and applying for stable resource guarantee required by executing the action sequence to the priority learning and scheduling module, converting the interactive feedback content into a specific voice broadcasting instruction and a chassis movement instruction to be issued and executed, and realizing multi-user alternate interactive feedback.

Description

AI teaching-aid robot supporting multi-path media stream synchronization and interaction Technical Field The invention relates to the technical field of intelligent interaction robots, in particular to an AI teaching-aid robot supporting multi-path media stream synchronization and interaction. Background Along with the progress of science and technology, traditional teaching mode faces to high-efficient, interactive and individualized education service's higher requirement, and AI helps teaching robot to utilize artificial intelligence technique, can simulate teacher's teaching action through methods such as deep learning and natural language processing, provides intelligent tutoring for the student. For example, the multi-mode interaction control method, system and robot of the multifunctional teaching aid robot of CN121572277A can complete instruction input and information feedback through other interaction channels when a certain perception channel is affected by environmental noise, illumination change or personnel shielding and other factors, so that a multi-mode coordination and redundancy mechanism is formed in an actual teaching scene, and the continuity and stability of a man-machine interaction process are improved. In the prior art, when the problem of disturbance of the acquisition time stamp of the multipath audio and video streams occurs in a classroom scene, the robot is difficult to accurately align the identity and the voice content of a speaker, the situation of abnormal interaction feedback occurs, on the basis, the fixed beam direction of a mobile speaker is difficult to track in real time, and then the phenomena of intermittent voice pickup and separation of the line of sight occur. Disclosure of Invention In order to overcome the defects in the background technology, the invention provides an AI teaching-aid robot supporting multi-path media stream synchronization and interaction, which can effectively solve the problems related to the background technology. The invention provides an AI teaching aid robot supporting multi-path media stream synchronization and interaction, which comprises a robot body, wherein an AI teaching aid interaction system is arranged in the robot body, and the AI teaching aid interaction system comprises the following modules: The streaming media synchronization module is used for carrying out depth fusion on video frame motion characteristics based on an optical flow method and audio time delay estimation based on phase cross correlation by adopting a multi-mode time stamp alignment algorithm, synchronously locking the identity and the voice content of a speaker at an acquisition source, realizing microsecond synchronization at an audio and video acquisition source, and eliminating the problem of audio and video dislocation caused by time stamp disorder; The sound-picture combined tracking module is used for introducing an ant colony algorithm to dynamically optimize the beam forming direction of the microphone array on the basis of accurate synchronization of sound and picture, predicting the target motion track by fusing a visual Kalman filter, driving the sound-picture sensing function of the robot to realize combined active tracking, and ensuring the continuity of voice pick-up and the following performance of the sight; The priority learning scheduling module is used for dynamically scheduling the processing priorities of various data streams by adopting a Q-learning algorithm in reinforcement learning aiming at resource contention when multiple media streams are concurrent, and obtaining an optimal scheduling strategy through trial-and-error learning, so that resource competition minimization is realized through autonomous learning; The interaction hot spot perception allocation module is used for integrating a visual focus in a visual attention mechanism analysis scene, perceiving a current interaction hot spot as the state input of Q-learning, guiding a scheduling algorithm to incline resources to a real interaction main body, and guaranteeing low-delay response and no-frame-loss interaction experience in a multi-user natural rotation scene; The control module is used for receiving the audio-visual data, the physical state data and the resource data of each module, constructing a complete context, making interactive decisions and converting the interactive decisions into specific control instructions so as to obtain stable resource support, and ensuring that the feedback of the robot is timely, natural and coherent in the complex interaction of multiple people for rotation. Preferably, the streaming media synchronization module comprises a multi-mode time stamp alignment unit and a sound picture identity synchronization locking unit; The multi-mode time stamp alignment unit is used for deeply fusing video frame motion characteristics based on an optical flow method and an audio time delay estimation algorithm based on phase cross correlation at the sou