CN-122018768-A - Multi-mode interaction method and system for intelligent display terminal

CN122018768ACN 122018768 ACN122018768 ACN 122018768ACN-122018768-A

Abstract

The invention discloses a multi-mode interaction method and a system of an intelligent display terminal, which relate to the technical field of intelligent display terminal interaction and comprise a sensor array formed by a touch sensor, a voice collector, an image sensor and an eye tracker, wherein the sensor array is used for collecting touch position data, voice instruction signals, continuous hand action video sequences and user sight coordinate sequences in real time; the method comprises the steps of carrying out targeted conversion and extraction processing on various data, determining interaction scene types and corresponding mode priority rules by combining the current application state of a terminal and a scene classification model, synchronously inputting the processed data and the mode priority rules into a multi-mode fusion intention recognition model to generate a final interaction intention instruction, and controlling the terminal to execute operation and update the application state. The method improves the accuracy and interaction continuity of the interaction intention recognition, and is suitable for intelligently displaying the terminal interaction scene.

Inventors

LI JUNFENG
XIONG YINBING

Assignees

甘肃工业职业技术大学
武汉超擎数智科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (10)

1. A multi-mode interaction method of an intelligent display terminal is characterized by comprising the following steps: Collecting multi-dimensional user input data in real time through a sensor array arranged on the intelligent display terminal, wherein the sensor array comprises a touch sensor, a voice collector, an image sensor and an eye tracker, and the multi-dimensional user input data comprises touch position data, a voice instruction signal, a continuous hand action video sequence and a user sight line coordinate sequence; converting the touch position data into a screen coordinate set, converting the voice command signal into voice text information, extracting a dynamic gesture track from the continuous hand action video sequence, and identifying a gazing area and gazing residence time from the user sight coordinate sequence; Acquiring current application state information of the intelligent display terminal, determining an interaction scene type of a current user based on the current application state information and a preset scene classification model, and inquiring a mode priority rule corresponding to the interaction scene type; Synchronously inputting the screen coordinate set, the voice text information, the dynamic gesture track, the gazing area, the gazing residence time and the determined modal priority rule into a pre-trained multi-modal fusion intention recognition model to generate a final interaction intention instruction of the current user; and controlling the intelligent display terminal to execute corresponding operation according to the final interaction intention instruction, and updating the current application state information.
2. The method of claim 1, wherein extracting dynamic gesture tracks from the continuous hand motion video sequence comprises: detecting hand key points of each frame of images of the continuous hand motion video sequence, and positioning pixel position coordinates of a plurality of hand joint points in each frame of images; carrying out time sequence tracking on the pixel position coordinates of the same hand joint point in continuous multi-frame images to form a moving path of each hand joint point; Calculating a motion vector sequence of the whole hand in a three-dimensional space based on the moving paths of all hand joint points, wherein the motion vector sequence comprises a translational motion component and a rotational motion component; identifying a coherent action segment which accords with the preset starting point, the preset ending point and the preset path characteristics from the motion vector sequence, wherein the coherent action segment forms the dynamic gesture track; timestamp information and confidence scores are appended to the dynamic gesture trajectories.
3. The method of claim 2, wherein identifying gaze areas and gaze residence times from the sequence of user gaze coordinates comprises: denoising and smoothing the user sight line coordinate sequence to eliminate coordinate jump caused by physiological tremor and blink in the eye movement process; Calculating the moving amplitude of the sight line coordinates in a preset time window after processing, judging that the sight line of the user enters a gazing state when the moving amplitude is smaller than a standstill judging threshold value, and recording the starting moment of gazing; Calculating the average value of all the sight coordinates in the gazing state to obtain the coordinates of the gazing center point; mapping the coordinates of the gaze center point to a screen coordinate system of an intelligent display terminal, determining an interface function area in which the coordinates of the gaze center point fall by combining with the layout of interface elements on a screen, and marking the interface function area as a gaze area; And continuously counting from the gazing state until the sight line movement amplitude exceeds the standstill judging threshold value, recording the ending time, and calculating the time length from the starting time to the ending time to obtain gazing residence time.
4. The multi-modal interaction method of the intelligent display terminal according to claim 3, wherein determining an interaction scene category of a current user based on the current application state information and a preset scene classification model, and querying a modal priority rule corresponding to the interaction scene category, comprises: The current application state information comprises an application program identifier operated by a foreground, a current operation interface layout of an application program and an operation instruction set allowed by the application program; Extracting feature vectors from the current application state information, wherein the feature vectors comprise application program identification codes, current operation interface layout codes, active window position information, current interface element interaction hot area distribution information and system background noise level; inputting the feature vector into the scene classification model, outputting class probability distribution of a corresponding user interaction scene by the scene classification model, and selecting a class with highest probability from the class probability distribution as the interaction scene class, wherein the interaction scene class comprises a conference demonstration scene, an audio-visual entertainment scene, a document editing scene, a system setting scene and a leisure browsing scene; And inquiring a modal priority rule base pre-bound with the interaction scene category to acquire a corresponding modal priority rule, wherein the modal priority rule defines confidence weight coefficients and activation conditions corresponding to different input modalities under the corresponding interaction scene category.
5. The method for multi-modal interaction of an intelligent display terminal according to claim 4, wherein the modal priority rule defines confidence weight coefficients and activation conditions corresponding to different input modalities in the corresponding interaction scene category, and the method comprises: For a conference demonstration scene, the mode priority rule assigns the highest confidence coefficient for a voice instruction signal mode and a dynamic gesture track mode, the confidence coefficient of a touch position data mode is set to be the lowest, and an activation condition of eye movement tracking data is set to be used for auxiliary selection only when a preset wake-up gesture is detected; the eye movement tracking data are a gazing area and gazing residence time which are acquired by the eye movement tracker and are obtained by recognition processing in a user sight line coordinate sequence; For a document editing scene, the mode priority rule assigns the highest confidence coefficient to a touch position data mode and an eye tracking data mode, and the activation condition of a voice command signal mode is set to be identified only when a voice command is input in a specific functional area of a document editing application program; For an audio-visual entertainment scene, the mode priority rule distributes the highest confidence coefficient for a dynamic gesture track mode and a voice command signal mode, and the activation condition of the touch position data mode is set to be accepted only in a specific control column area of the screen; the confidence coefficient and the activation condition which are inquired from the modal priority rule base are transmitted to the multi-modal fusion intention recognition model.
6. The method of claim 5, wherein the step of synchronously inputting the screen coordinate set, the voice text information, the dynamic gesture track, the gaze area, the gaze residence time, and the determined modal priority rule into a pre-trained multimodal fusion intention recognition model to generate a final interaction intention instruction of the current user comprises the steps of: filtering the screen coordinate set, the voice text information, the dynamic gesture track, the gazing area and the gazing residence time according to the activation conditions in the modal priority rule, and shielding input data which do not meet the activation conditions; Assigning corresponding fusion weights to the input data of each mode according to the confidence coefficient weight coefficient in the mode priority rule for the input data meeting the activation condition; Inputting the multimodal input data endowed with the fusion weight into a feature coding layer of the multimodal fusion intention recognition model, wherein the feature coding layer respectively codes the input data of each modality into feature vectors with uniform dimensions; the feature vectors with unified dimensions from different modes are subjected to weighted splicing according to the corresponding fusion weights to form a multi-mode fusion feature vector; inputting the multi-modal fusion feature vector to an intention decision layer of the multi-modal fusion intention recognition model, wherein the intention decision layer outputs an intention recognition result corresponding to the current moment, and the intention recognition result is mapped into the final interaction intention instruction executable by the intelligent display terminal.
7. The method for multi-modal interaction of the intelligent display terminal according to claim 6, wherein the assigning a corresponding fusion weight to the input data of each modality according to the confidence weight coefficient in the modality priority rule to the input data satisfying the activation condition comprises: reading a preset reference confidence coefficient for each input modality from the modality priority rule; acquiring an environmental interference evaluation value of the current environment, wherein the environmental interference evaluation value is calculated based on the real-time environmental noise level and the illumination intensity acquired by the sensor array; dynamically adjusting the reference confidence coefficient according to the environmental interference evaluation value, and downwards adjusting the reference confidence coefficient of the input mode affected by the current environmental interference evaluation value to form an adjusted real-time confidence coefficient; And normalizing the real-time confidence coefficient so that the sum of the weight coefficients of all the activated modes is a constant value, wherein the normalized coefficient is used as the fusion weight for feature fusion.
8. The method for multi-modal interaction of the smart display terminal according to claim 7, wherein the inputting the multi-modal input data given with the fusion weight to the feature encoding layer of the multi-modal fusion intention recognition model, the feature encoding layer encodes the input data of each mode into feature vectors of uniform dimensions, respectively, includes: processing the screen coordinate set of the touch position data mode by adopting a position sequence encoder, extracting the spatial distribution mode and click time interval characteristics of the touch points, and outputting a first characteristic vector; Processing the voice text information in the voice command signal mode by adopting a natural language semantic encoder, extracting semantic vectors and command keywords of the text, and outputting a second feature vector; For the dynamic gesture track of the dynamic gesture track mode, a motion sequence encoder is adopted for processing, the kinematic characteristics and track shape characteristics of the gesture are extracted, and a third characteristic vector is output; For the gazing area and gazing residence time of the eye movement tracking data mode, a visual attention encoder is adopted for processing, semantic importance of the gazing area and interest degree characteristics based on residence time are extracted, and a fourth feature vector is output; Wherein the first, second, third and fourth feature vectors have the same dimensions.
9. The method for multi-modal interaction of an intelligent display terminal according to claim 8, further comprising the step of performing online optimization on the multi-modal fusion intention recognition model after generating the final interaction intention command: After the final interaction intention instruction is executed, feedback information of a user is obtained, wherein the feedback information comprises whether the user has a cancel operation or whether the user repeatedly executes the operation with the same intention through other input modes in preset time; If the feedback information indicates that the user has cancel operation or repeat operation, judging that the intention recognition result is inaccurate, and marking the multi-mode input data, the assigned fusion weight, the current application state information and the intention recognition result as training samples to be corrected; Periodically collecting a batch of training samples to be corrected, comparing the training samples with correct intention instruction labels, and calculating model prediction loss; And carrying out back propagation fine adjustment on the parameters of the multi-mode fusion intention recognition model by utilizing the model prediction loss, updating the weight parameters of the feature coding layer and the intention decision layer, and realizing the online self-optimization of the model.
10. A multi-modal interaction system of a smart display terminal comprising a memory, a processor and a computer program stored in the memory and running on the processor, characterized in that the processor, when executing the computer program, realizes the steps of a multi-modal interaction method of a smart display terminal as claimed in any one of claims 1 to 9.

Description

Multi-mode interaction method and system for intelligent display terminal Technical Field The invention belongs to the technical field of intelligent display terminal interaction, and particularly relates to a multi-mode interaction method and system for an intelligent display terminal. Background In the current multi-mode interaction technology of the intelligent display terminal, a single or a few input modes are mostly adopted, and after user input data are collected through corresponding sensors, simple fusion processing is directly carried out to identify user interaction intention. The common technical scheme is that a touch sensor is used for collecting touch data, a voice collector is used for collecting voice signals, the image sensor is added to collect hand actions in part of the scheme, the data are simply overlapped and then input into a recognition model, and an interaction instruction is generated and the terminal is controlled to execute corresponding operation. In the prior art, the acquired multidimensional input data is lack of targeted conversion and extraction processing, only basic acquisition and simple integration are carried out, so that the accuracy and the structuring degree of the input data are insufficient, and the accuracy of the subsequent interactive intention recognition is affected. Meanwhile, in the prior art, when the interactive intention is identified, scene distinction is not carried out by combining with the current application state of the terminal, and corresponding mode priority rules are not set, and various input data are simply fused, so that the matching degree of the intention identification result and the actual requirement of a user is lower. In addition, in the prior art, after the control terminal executes the operation, the current application state information of the terminal is not updated synchronously, so that the follow-up interaction process cannot be adapted based on the latest state, and the problems of interaction fault and incoherence response occur. Disclosure of Invention The present invention aims to solve at least one of the technical problems existing in the prior art; therefore, the invention provides a multi-mode interaction method of an intelligent display terminal, which comprises the following steps: Collecting multi-dimensional user input data in real time through a sensor array arranged on the intelligent display terminal, wherein the sensor array comprises a touch sensor, a voice collector, an image sensor and an eye tracker, and the multi-dimensional user input data comprises touch position data, a voice instruction signal, a continuous hand action video sequence and a user sight line coordinate sequence; converting the touch position data into a screen coordinate set, converting the voice command signal into voice text information, extracting a dynamic gesture track from the continuous hand action video sequence, and identifying a gazing area and gazing residence time from the user sight coordinate sequence; Acquiring current application state information of the intelligent display terminal, determining an interaction scene type of a current user based on the current application state information and a preset scene classification model, and inquiring a mode priority rule corresponding to the interaction scene type; Synchronously inputting the screen coordinate set, the voice text information, the dynamic gesture track, the gazing area, the gazing residence time and the determined modal priority rule into a pre-trained multi-modal fusion intention recognition model to generate a final interaction intention instruction of the current user; and controlling the intelligent display terminal to execute corresponding operation according to the final interaction intention instruction, and updating the current application state information. Further, extracting a dynamic gesture track from the continuous hand motion video sequence includes: detecting hand key points of each frame of images of the continuous hand motion video sequence, and positioning pixel position coordinates of a plurality of hand joint points in each frame of images; carrying out time sequence tracking on the pixel position coordinates of the same hand joint point in continuous multi-frame images to form a moving path of each hand joint point; Calculating a motion vector sequence of the whole hand in a three-dimensional space based on the moving paths of all hand joint points, wherein the motion vector sequence comprises a translational motion component and a rotational motion component; identifying a coherent action segment which accords with the preset starting point, the preset ending point and the preset path characteristics from the motion vector sequence, wherein the coherent action segment forms the dynamic gesture track; timestamp information and confidence scores are appended to the dynamic gesture trajectories. Further, identifying a gaze region and a gaze res