CN-121980388-A - Multi-mode fusion-based robot intention recognition and understanding method

CN121980388ACN 121980388 ACN121980388 ACN 121980388ACN-121980388-A

Abstract

The invention relates to the technical field of man-machine interaction, and particularly discloses a multi-modal fusion-based robot intention recognition and understanding method, which comprises the following steps of S1, acquiring multi-modal input data; S2, carrying out hierarchical feature extraction on multi-mode input data, S3, carrying out self-adaptive gating fusion on the extracted feature vectors, S4, carrying out hierarchical semantic reasoning on the fused features, outputting intention probability distribution, S5, outputting intention types based on the intention probability distribution, and activating an execution module corresponding to the robot according to the intention types. According to the method, the problems of single-mode dependence and rough semantic understanding are solved through multi-mode fusion and hierarchical reasoning, a lightweight model is used for voice and visual feature extraction, the computational complexity is reduced, the method is suitable for edge equipment, feature complementarity is improved through self-adaptive gating fusion, understanding depth is enhanced from fine granularity to coarse granularity through hierarchical semantic reasoning, and intention recognition accuracy is obviously improved compared with single mode.

Inventors

HUANG WANZHONG
YANG ZONGFENG
WANG XIANSHI
HUANG CHAOWEI
LUO MINGYANG
ZHOU ZIHONG

Assignees

昆仑之数(成都)科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260128

Claims (10)

1. The method for identifying and understanding the intention of the robot based on the multi-mode fusion is characterized by comprising the following steps of: s1, acquiring multi-mode input data; s2, carrying out layered characteristic extraction on the multi-mode input data; S3, performing self-adaptive gating fusion on the extracted feature vectors; s4, carrying out hierarchical semantic reasoning on the fused features and outputting intention probability distribution; S5, outputting the intention category based on the intention probability distribution, and activating an execution module corresponding to the robot according to the intention category.
2. The robot intention recognition and understanding method based on the multimodal fusion according to claim 1, wherein in step S1, the multimodal input data includes text data, voice data, and visual image data.
3. The method for recognizing and understanding the intention of the robot based on the multi-modal fusion according to claim 2, wherein the step S2 of extracting the layered feature of the voice data includes: Converting the voice data into text data by automatic voice recognition; the hierarchical feature extraction of the text data includes: Carrying out standardized processing on the text data, including removing noise and unifying formats; encoding the standardized text by using a BERT-6L encoder to generate a text feature vector; The hierarchical feature extraction of visual image data includes: preprocessing visual image data, including adjusting the size of an image and normalizing; visual features are extracted using MobileViT-xS models, generating visual feature vectors.
4. The method for identifying and understanding the intention of the robot based on the multi-modal fusion according to claim 3, wherein in step S3, performing the adaptive gating fusion on the extracted feature vector includes: alignment of text feature vectors and visual feature vectors into unified feature space by linear transformation of the formula And Wherein Is a feature vector of the text and, Is the text feature vector after alignment, Is a matrix of weights that are to be used, Is the offset vector of the reference signal, Is a vector of the visual characteristics, Is the visual feature vector after alignment, Is a matrix of weights that are to be used, Is a bias vector; And inputting the aligned feature vectors into a transducer encoder, and modeling interaction relations among modes.
5. The method for recognizing and understanding intention of a robot based on multi-modal fusion as recited in claim 4, wherein in step S3, the weight calculation of the adaptive gating fusion adopts a dynamic weight formula, and the weight calculation formula is , wherein, Is the fusion weight of the two-dimensional image data, Is a sigmoid function of the number of bits, Is a matrix of weights that are to be used, Is the offset vector of the reference signal, The method is the splicing of the aligned text feature vector and the aligned visual feature vector, and the weight is used for adjusting the contribution degree of each module in fusion.
6. The method for identifying and understanding the intention of the robot based on the multi-modal fusion according to claim 5, wherein in step S4, hierarchical semantic reasoning is performed on the fused features, and an intention probability distribution is output, including: s41, word level reasoning, namely extracting key word features by using an attention mechanism to generate word level feature vectors; the intention probability distribution is calculated based on the vocabulary level feature vector, and the calculation formula is as follows , wherein, Is a word-level feature vector that is used to generate a word-level feature vector, And Is the classification weight and the bias of the classification, Is a vocabulary level intent probability distribution; Checking whether the maximum probability value of the vocabulary level intent probability distribution is greater than a confidence threshold If the current intention probability distribution is larger than the simple intention output, outputting the current intention probability distribution as simple intention output, and proceeding to step S5, otherwise proceeding to step S42 for phrase level reasoning; S42, phrase level reasoning, namely using a gating circulation unit network to carry out sequence modeling on vocabulary level features to generate phrase level feature vectors; The intention probability distribution is calculated based on phrase-level feature vectors, and the calculation formula is as follows , wherein, Is a phrase-level feature vector that is used to determine the feature vector, And Is the classification weight and the bias of the classification, Is a phrase level intent probability distribution; Checking whether the maximum probability value of the phrase-level intent probability distribution is greater than a confidence threshold If the current intention probability distribution is larger than the preset threshold, outputting the current intention probability distribution as medium intention output, and proceeding to step S5, otherwise proceeding to step S43 for sentence-level reasoning; s43, performing context modeling on phrase-level features by using a bidirectional gating cyclic unit network to generate sentence-level feature vectors; The intention probability distribution is calculated based on sentence-level feature vectors, and the calculation formula is as follows , wherein, Is a sentence-level feature vector that is presented, And Is the classification weight and the bias of the classification, Is a sentence-level intent probability distribution; Checking whether the maximum probability value of the sentence-level intent probability distribution is greater than a confidence threshold If the current intention probability distribution is larger than the complex intention output, outputting the current intention probability distribution as the complex intention output, and jumping to S5, otherwise, performing intention level reasoning in the step S44; S44, intention level reasoning, namely carrying out deep semantic understanding on sentence level features by using a full connection layer to generate intention level feature vectors; Calculating the intention probability distribution based on the intention level feature vector, wherein the calculation formula is as follows , wherein, Is the intention level feature vector, And Is the classification weight and the bias of the classification, Is the intention level probability distribution of intention and is output as the final intention output.
7. The method for recognizing and understanding intention of a robot based on multi-modal fusion according to claim 6, wherein in step S41, the attention mechanism specifically includes: Calculating attention weights using a multi-headed attention mechanism , wherein, Is a query vector, derived from the fused features, Is a key vector, derived from lexical embedding, Is a dimensional scaling factor, the number of heads for multi-head attention is 8.
8. The method for recognizing and understanding the intention of the robot based on the multi-modal fusion according to claim 7, wherein in step S42, the gating loop unit network specifically includes: The network input of the gating circulating unit is a vocabulary level characteristic sequence, and the gating formula is updated as follows by updating the modeling sequence dependence of the gating and resetting the gating Reset the gate formula to , wherein, Is in a hidden state at the last moment, Is the current input and is used to determine, 、、、 Is a parameter.
9. The method for recognizing and understanding intention of a robot based on multi-modal fusion according to claim 8, wherein in step S4, a confidence threshold is set The adaptive adjustment mechanism is adopted, and specifically comprises the following steps: confidence threshold Dynamically updating according to the historical recognition accuracy, wherein the formula is as follows , wherein, Is the current confidence threshold value and, Is the threshold value at the last moment in time, Is the rate of learning to be performed, Is the accuracy of the recent identification.
10. The method for identifying and understanding the intention of the robot based on the multi-modal fusion according to claim 9, wherein in step S4, the calculation of the probability distribution of intention introduces a multi-task joint training, specifically comprising: The loss function is , wherein, Is an intention classification loss, using cross entropy loss, Is the loss of the auxiliary task, And Is a weight coefficient.

Description

Multi-mode fusion-based robot intention recognition and understanding method Technical Field The invention relates to the technical field of man-machine interaction, in particular to a multi-mode fusion-based robot intention recognition and understanding method. Background The robot-computer interaction technology is used as an important crossing field of artificial intelligence and robotics, and is widely applied to scenes such as service robots, intelligent home furnishings, industrial automation and the like in recent years. The intention recognition and understanding serve as the core links of man-machine interaction, and the response accuracy and interaction naturalness of the robot to the user instructions are directly determined. The prior art is mainly based on single-mode or multi-mode fusion methods such as voice recognition, visual analysis and the like, and attempts to infer interaction intention by analyzing information such as voice, gestures, expressions and the like of a user. The current mainstream robot-computer interaction intention recognition mainly has the following technical defects: 1. the single-mode dependency is strong, the existing system mostly adopts single modes such as voice recognition or visual recognition, the robustness is poor under a complex environment, and the recognition accuracy is seriously affected by noise interference. 2. Rough intention understanding, namely lack of deep semantic understanding of the true intention of a user, and inability to effectively distinguish operation instructions, interaction requirements and dialogue intentions. 3. The real-time performance and the accuracy are contradicted, the response speed is usually sacrificed when the traditional scheme pursues high precision, and the real-time interaction requirement is difficult to meet. Disclosure of Invention The invention provides a robot intention recognition and understanding method based on multi-mode fusion, which aims to solve the problems that the current mainstream robot interaction intention recognition in the prior art is strong in single-mode dependence, coarse in intention understanding and contradictory in real-time performance and accuracy. The technical scheme adopted by the invention is as follows: a multi-mode fusion-based robot intention recognition and understanding method comprises the following steps: s1, acquiring multi-mode input data; s2, carrying out layered characteristic extraction on the multi-mode input data; S3, performing self-adaptive gating fusion on the extracted feature vectors; s4, carrying out hierarchical semantic reasoning on the fused features and outputting intention probability distribution; S5, outputting the intention category based on the intention probability distribution, and activating an execution module corresponding to the robot according to the intention category. Preferably, in step S1, the multimodal input data includes text data, voice data, and visual image data. Preferably, in step S2, performing hierarchical feature extraction on the voice data includes: Converting the voice data into text data by automatic voice recognition; the hierarchical feature extraction of the text data includes: Carrying out standardized processing on the text data, including removing noise and unifying formats; encoding the standardized text by using a BERT-6L encoder to generate a text feature vector; The hierarchical feature extraction of visual image data includes: preprocessing visual image data, including adjusting the size of an image and normalizing; visual features are extracted using MobileViT-xS models, generating visual feature vectors. Preferably, in step S3, performing adaptive gating fusion on the extracted feature vector includes: alignment of text feature vectors and visual feature vectors into unified feature space by linear transformation of the formula AndWhereinIs a feature vector of the text and,Is the text feature vector after alignment,Is a matrix of weights that are to be used,Is the offset vector of the reference signal,Is a vector of the visual characteristics,Is the visual feature vector after alignment,Is a matrix of weights that are to be used,Is a bias vector; And inputting the aligned feature vectors into a transducer encoder, and modeling interaction relations among modes. Preferably, in step S3, the weight calculation of the adaptive gating fusion adopts a dynamic weight formula, where the weight calculation formula is, wherein,Is the fusion weight of the two-dimensional image data,Is a sigmoid function of the number of bits,Is a matrix of weights that are to be used,Is the offset vector of the reference signal,The method is the splicing of the aligned text feature vector and the aligned visual feature vector, and the weight is used for adjusting the contribution degree of each module in fusion. Preferably, in step S4, hierarchical semantic reasoning is performed on the fused features, and an intention probability distributio