CN-121997245-A - Control method and device for self-interactive robot, computer equipment and storage medium

CN121997245ACN 121997245 ACN121997245 ACN 121997245ACN-121997245-A

Abstract

The embodiment of the application provides a method and a device for controlling a self-interacting robot, computer equipment and a storage medium. The method comprises the steps of obtaining voice interaction audio of a target object, an object expression image frame sequence, an object posture image frame sequence and a scene image frame sequence, determining object center coordinates of a candidate object and hand action directions of the target object to determine the target object when index information exists in the voice interaction audio, obtaining tone vectors and semantic vectors, encoding the tone vectors and the object expression image frame sequence to obtain emotion characteristics, encoding the semantic vectors and the object posture image frame sequence to obtain intention characteristics, obtaining target fusion characteristics based on fusion of the emotion characteristics and the intention characteristics, obtaining emotion classification results and intention classification results based on classification, generating a target interaction sequence based on position coordinates of the target object, the emotion classification results and the intention classification results, and controlling the self-interaction robot to conduct action interaction and voice feedback based on the target interaction sequence.

Inventors

CAI CHANG
GUO QINGDA
LIU LINGBO

Assignees

鹏城实验室

Dates

Publication Date: 20260508
Application Date: 20251219

Claims (10)

1. A method of self-interacting robotic control, the method comprising: Acquiring voice interaction audio, an object expression image frame sequence, an object posture image frame sequence and a scene image frame sequence of a target object; When the voice interaction audio has indication information, determining the object center coordinates of each candidate object contained in the scene image frame sequence, determining the hand action direction of the target object according to the object gesture image frame sequence, and determining the target object pointed by the target object according to the angle difference between the hand action direction and the object center coordinates; acquiring a tone vector and a semantic vector corresponding to the voice interaction audio, coding the tone vector and the object expression image frame sequence to obtain corresponding emotion characteristics, and coding the semantic vector and the object posture image frame sequence to obtain corresponding intention characteristics; Fusing the emotion characteristics and the intention characteristics to obtain corresponding target fusion characteristics; classifying according to the target fusion characteristics to obtain an emotion classification result and an intention classification result, and generating a corresponding target interaction sequence based on the position coordinates of the target object, the emotion classification result and the intention classification result; and controlling the self-interaction robot to perform action interaction and voice feedback based on the target interaction sequence.
2. The method of claim 1, wherein the hand includes a finger base and a finger tip, and wherein determining the target object to which the target object is directed based on the angular difference between the hand motion direction and the center coordinates of each object includes: acquiring a first coordinate of the finger root and a second coordinate of the finger tip of the target object in the object gesture image frame sequence, and determining a hand motion direction vector based on the difference between the first coordinate and the second coordinate; Calculating an object pointing vector based on a difference between the first coordinate and a center coordinate of each object; Calculating an angular difference between the hand motion direction of the target object and a center coordinate of each object based on the hand motion direction vector and the object pointing vector; And determining a candidate object with the smallest angle difference from the plurality of candidate objects as a target object pointed by the target object based on the magnitude relation among the plurality of angle differences corresponding to the plurality of candidate objects.
3. The method of claim 1, wherein the encoding the pitch vector and the sequence of object-expression image frames to obtain corresponding emotion features, and encoding the semantic vector and the sequence of object-pose image frames to obtain corresponding intention features, comprises: coding the tone vector and the object expression image frame sequence through a self-attention computing module of a preset target model to obtain corresponding emotion characteristics; And encoding the semantic vector and the object posture image frame sequence through a space-time cyclic network module of the target model to obtain corresponding intention characteristics.
4. The method for controlling a self-interacting robot according to claim 1, wherein the fusing the emotion feature and the intention feature to obtain a corresponding target fusion feature comprises: Determining, by an attention fusion module, emotion assist features associated with the intent features, and determining the intent features as query vectors, the emotion features as key vectors, and the emotion assist features as value vectors; Performing cross-modal attention calculation based on the query vector, the key vector and the value vector to obtain a corresponding initial fusion vector; And carrying out weighted fusion on the emotion characteristics and the initial fusion vector to obtain corresponding target fusion characteristics.
5. The method for controlling a self-interacting robot according to claim 4, wherein said weighting fusion is performed based on the emotion feature and the initial fusion vector to obtain a corresponding target fusion feature, comprising: outputting a first gating weight according to the emotion characteristics and the initial fusion vector through a gating fusion unit; acquiring a reference value, and calculating to obtain a second gating weight based on a difference value between the reference value and the first gating weight; adjusting the initial fusion vector through the first gating weight to obtain a first target vector; adjusting the emotion characteristics through the second gating weight to obtain a second target vector; And fusing based on the first target vector and the second target vector to obtain a target fusion characteristic.
6. The method according to claim 1, wherein the classifying according to the target fusion feature to obtain an emotion classification result and an intention classification result comprises: Classifying the target fusion features through an intention classification head of a preset target model to obtain an intention classification result; and classifying the target fusion features through the emotion classification head of the target model to obtain an emotion classification result.
7. The method of self-interacting robotic control of claim 6, wherein the method further comprises: Outputting target confidence corresponding to the target fusion feature through a confidence estimation head of a preset target model; acquiring a preset confidence coefficient threshold value, and triggering a clarification instruction when the target confidence coefficient is smaller than the confidence coefficient threshold value; Based on the clarification instruction, acquiring a historical interaction context associated with the voice interaction audio, and generating a corresponding target interaction sequence based on the historical interaction context, the position coordinates of the target object, the emotion classification result and the intention classification result; and controlling the self-interaction robot to perform action interaction and voice feedback based on the target interaction sequence.
8. A self-interacting robotic control device, the device comprising: the acquisition module is used for acquiring voice interaction audio of the target object, an object expression image frame sequence, an object posture image frame sequence and a scene image frame sequence; The determining module is used for determining the object center coordinates of each candidate object contained in the scene image frame sequence when the voice interaction audio has the indication information, determining the hand action direction of the target object according to the object gesture image frame sequence, and determining the target object pointed by the target object according to the angle difference between the hand action direction and the object center coordinates; The coding module is used for acquiring a tone vector and a semantic vector corresponding to the voice interaction audio, coding the tone vector and the object expression image frame sequence to obtain corresponding emotion characteristics, and coding the semantic vector and the object posture image frame sequence to obtain corresponding intention characteristics; the fusion module is used for fusing the emotion characteristics and the intention characteristics to obtain corresponding target fusion characteristics; the classification module is used for classifying according to the target fusion characteristics to obtain an emotion classification result and an intention classification result, and generating a corresponding target interaction sequence based on the position coordinates of the target object, the emotion classification result and the intention classification result; And the control module is used for controlling the self-interaction robot to perform action interaction and voice feedback based on the target interaction sequence.
9. A computer device, characterized in that it comprises a memory storing a computer program and a processor implementing the method of self-interacting robot control of any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the self-interacting robot control method of any one of claims 1 to 7.

Description

Control method and device for self-interactive robot, computer equipment and storage medium Technical Field The present application relates to the field of robot control technologies, and in particular, to a method and apparatus for controlling a robot with body interaction, a computer device, and a storage medium. Background The self-interactive robot integrates a mechanical body and an intelligent algorithm, can execute tasks in a real environment through physical entities (such as a mechanical arm, a mobile chassis and the like) of the self-interactive robot, and can recognize the intention of a user by means of interaction information with the user. In the related art, interactions between users and the self-interacting robot mainly depend on voice recognition and semantic understanding technologies. Specifically, the user sends an instruction to the self-interactive robot through voice, the self-interactive robot converts the voice into text through the voice recognition module, then the user intention analysis is carried out through the natural language processing technology, and finally the feedback is carried out through the voice or the execution action. However, in the process of interaction between the user and the self-interacting robot, the user's instruction intention is mainly understood by a single voice interaction mode, but the mode is easy to generate ambiguity in a complex operation scene, so that the accuracy of understanding and the nature of interaction of the self-interacting robot in the interaction process are lower. Disclosure of Invention The application provides a control method, a device, computer equipment and a storage medium for a self-interactive robot, which can improve the understanding accuracy and the interaction naturalness of the interactive robot in the interaction process. To achieve the above object, a first aspect of an embodiment of the present application provides a method for controlling a self-interacting robot, the method including: Acquiring voice interaction audio, an object expression image frame sequence, an object posture image frame sequence and a scene image frame sequence of a target object; When the voice interaction audio has indication information, determining the object center coordinates of each candidate object contained in the scene image frame sequence, determining the hand action direction of the target object according to the object gesture image frame sequence, and determining the target object pointed by the target object according to the angle difference between the hand action direction and the object center coordinates; acquiring a tone vector and a semantic vector corresponding to the voice interaction audio, coding the tone vector and the object expression image frame sequence to obtain corresponding emotion characteristics, and coding the semantic vector and the object posture image frame sequence to obtain corresponding intention characteristics; Fusing the emotion characteristics and the intention characteristics to obtain corresponding target fusion characteristics; classifying according to the target fusion characteristics to obtain an emotion classification result and an intention classification result, and generating a corresponding target interaction sequence based on the position coordinates of the target object, the emotion classification result and the intention classification result; and controlling the self-interaction robot to perform action interaction and voice feedback based on the target interaction sequence. Accordingly, a second aspect of an embodiment of the present application proposes a self-interacting robot control device, the device comprising: the acquisition module is used for acquiring voice interaction audio of the target object, an object expression image frame sequence, an object posture image frame sequence and a scene image frame sequence; The determining module is used for determining the object center coordinates of each candidate object contained in the scene image frame sequence when the voice interaction audio has the indication information, determining the hand action direction of the target object according to the object gesture image frame sequence, and determining the target object pointed by the target object according to the angle difference between the hand action direction and the object center coordinates; The coding module is used for acquiring a tone vector and a semantic vector corresponding to the voice interaction audio, coding the tone vector and the object expression image frame sequence to obtain corresponding emotion characteristics, and coding the semantic vector and the object posture image frame sequence to obtain corresponding intention characteristics; the fusion module is used for fusing the emotion characteristics and the intention characteristics to obtain corresponding target fusion characteristics; the classification module is used for classifying according to the targ