CN-121989229-A - Robot control method and device based on audio drive and robot

CN121989229ACN 121989229 ACN121989229 ACN 121989229ACN-121989229-A

Abstract

The invention provides a robot control method and device based on audio driving and a robot, and relates to the technical field of robot motion control. The method comprises the steps of obtaining audio data, extracting audio features of the audio data, inputting the audio features to a pre-trained audio motion alignment module for audio motion alignment processing, outputting audio style latent variables aligned by motion priori, obtaining motion content latent variables corresponding to current task semantics, inputting the audio style latent variables and the motion content latent variables to a pre-trained diffusion student strategy network to generate joint motion instructions of a robot, and sending the joint motion instructions to a joint driver of the robot to drive the robot to execute motions synchronous with the audio data. The embodiment of the invention avoids multi-stage error accumulation through the direct mapping from end to end, and realizes the synchronous control of high fidelity and low delay of the robot action and the audio data.

Inventors

CHI CHENG
LI ZHE
Zhu Boan
WEI YANGYANG
WANG PENGWEI
WANG ZHONGYUAN
ZHANG SHANGHANG

Assignees

北京智源人工智能研究院

Dates

Publication Date: 20260508
Application Date: 20251229

Claims (10)

1. A robot control method based on audio driving, comprising: acquiring audio data and extracting audio features of the audio data; Inputting the audio characteristics to a pre-trained audio motion alignment module for audio motion alignment processing, and outputting audio style latent variables aligned by motion prior; Acquiring a motion content latent variable corresponding to the current task semantics; inputting the audio style latent variable and the motion content latent variable into a pre-trained diffusion student strategy network to generate a joint action instruction of the robot; And sending the joint action instruction to a joint driver of the robot so as to drive the robot to execute the action synchronous with the audio data.
2. The audio-drive-based robot control method according to claim 1, wherein the training step of the audio motion alignment module comprises: Acquiring paired first audio data and first human motion sequence data; Encoding the first human motion sequence data as a first motion latent variable and encoding the first audio data as a first audio latent variable; Constructing a neural network containing a time sequence attention mechanism as an audio motion alignment network, taking the first audio latent variable as a query, taking the first motion latent variable as a key and a value, and calculating attention weight; And optimizing training the audio motion alignment network through an information noise comparison estimation loss function to obtain the trained audio motion alignment module, wherein the information noise comparison estimation loss function is configured to pull the distance between the paired first audio latent variable and the first motion latent variable in a feature space and push the distance between the unpaired samples.
3. The audio-driven-based robot control method according to claim 2, wherein the audio motion alignment network is a transducer architecture-based adapter network, the attention of the audio motion alignment network is calculated as a scaled dot product attention based on the attention weight, and the information noise comparison estimation loss function L is specifically: Wherein, the A first audio latent variable for the ith sample, For a first motion latent variable corresponding to the first audio latent variable, As a first motion latent variable for other samples within the batch, As a function of the cosine similarity, N is the batch size, which is a temperature super parameter.
4. The audio-driven based robot control method according to claim 1, wherein the diffusion student policy network is obtained by distillation from an incremental mixed expert teacher policy network, the training step of the incremental mixed expert teacher policy network comprising: in a physical simulation environment, constructing an incremental hybrid expert teacher strategy network comprising a gating network and at least three expert networks; Defining a group of nested condition subspaces for the at least three expert networks, wherein the input condition subspace of the ith expert network comprises the input condition subspace of the ith-1 expert network, and additionally increases at least one condition dimension, wherein the condition dimension comprises one or more of a robot body state, an aligned audio style latent variable, a reference motion trail and simulation privilege information; configuring the gating network to output a set of normalized weights corresponding to each expert network according to the current robot state input; Performing reinforcement learning training on the incremental mixed expert teacher strategy network by adopting a near-end strategy optimization algorithm to obtain the trained incremental mixed expert teacher strategy network; and each expert network independently calculates action output according to the corresponding input condition subspace, and the final action output of the incremental mixed expert teacher strategy network is obtained by weighting and fusing action output increments of each expert network according to the normalization weight.
5. The audio-drive-based robot control method according to claim 4, wherein the weighted fusion is implemented by a residual fusion mechanism, specifically: Let the original action of the ith expert network be output as Setting action output of 0 th expert network ; Computing action output delta for ith expert network ; Final action output of the incremental hybrid expert teacher policy network The calculation mode of (a) is as follows: Wherein, the For the final action output of the incremental hybrid expert teacher policy network, The gating network is assigned a normalized weight for the i-th expert network, Is the total number of expert networks.
6. The audio-driven based robot control method according to claim 5, wherein the training step of the diffusion student policy network comprises: constructing a denoising network taking a multi-layer perceptron as a backbone, wherein the denoising network is configured to receive noisy motion input, diffusion time step information, the motion content latent variable and the audio style latent variable; Injecting the motion content latent variable into each layer of the multi-layer perceptron through self-adaptive layer normalization; Modulating the output of at least one intermediate layer of the multi-layer perceptron by the audio style latent variable through a learnable linear transformation and scaling coefficient in an additive injection mode; Collecting state data of the robot in a simulation environment by adopting a class DAgger paradigm, and inquiring the incremental mixed expert teacher strategy network after training is completed to obtain a corresponding expert action label; and (3) adding noise to the expert action label in the forward diffusion process, training the denoising network to predict the added noise or the original clean action, and optimizing the target to minimize the mean square error between the predicted value and the true value to obtain the trained diffusion student strategy network.
7. The method for controlling a robot based on audio driving according to claim 6, wherein the additive injection mode is specifically: for the first layer of the multi-layer perceptron, outputting The calculation mode of (a) is as follows: Wherein, the For the output of the first layer of the multi-layer perceptron, Represented as a layer-one network, For the output of the upper layer, As a potential amount of the motion content, As a latent variable of the audio style, As a matrix of projections that can be learned, Is a learnable scalar coefficient for controlling the intensity of style injection.
8. The audio-drive-based robot control method according to any one of claims 1 to 7, wherein the extracting the audio features of the audio data includes: when the audio data are voice signals, extracting the audio characteristics of the voice signals by adopting a time sequence convolution network; when the audio data is a music signal, a pre-trained audio encoder is used to extract audio features of the music signal.
9. A robot control device based on audio driving, comprising: the extraction module is used for acquiring audio data and extracting audio characteristics of the audio data; the audio motion alignment module is used for inputting the audio features to the pre-training audio motion alignment module for audio motion alignment processing and outputting audio style latent variables aligned by a motion priori; The acquisition module is used for acquiring the motion content latent variable corresponding to the current task semantics; The generation module is used for inputting the audio style latent variable and the motion content latent variable into a pre-trained diffusion student strategy network to generate a joint action instruction of the robot; and the control module is used for sending the joint action instruction to a joint driver of the robot so as to drive the robot to execute the action synchronous with the audio data.
10. A robot comprising a robot body, a robot body and a robot body, characterized by comprising the following steps: One or more processors; a memory storing a computer program; A robot body including a plurality of joint drivers controllable by the processor; Wherein the computer program, when executed by the one or more processors, causes the robot to implement the audio drive based robot control method of any one of claims 1 to 8.

Description

Robot control method and device based on audio drive and robot Technical Field The present invention relates to the field of robot motion control technologies, and in particular, to a method and an apparatus for controlling a robot based on audio driving, and a robot. Background In the prior art, humanoid robots perform dance or accompanying actions often rely on a multi-stage serial technical path. The path generally firstly converts an audio signal into an explicit human body motion sequence (such as a joint point coordinate sequence) through an audio-driven motion generation model, then, the generated human body motion sequence is required to be subjected to redirection processing aiming at the specific form and the kinematic constraint of a target robot so as to obtain a reference motion track adapted to the robot, and finally, a physical controller at the bottom layer tracks the reference track to generate a motor control instruction. This procedure is widely adopted in academia and industry to achieve music or voice driven robotic performances. However, the split architecture of "generate-redirect-track" described above has several inherent drawbacks. Firstly, each stage of module is independently optimized, motion generation errors, redirection approximation errors and control tracking errors are continuously accumulated in a multi-stage transmission process, and finally performed robot actions are obviously reduced in expressive force (expression fidelity) and physical feasibility. Secondly, serial execution of multiple modules introduces considerable computation and communication delays, which makes it difficult to meet the high real-time audio-action synchronization requirements. Furthermore, the correlation between the information such as high-level semantics and rhythm in the audio signal and the bottom-layer joint driving instruction is loose, so that the action is difficult to accurately fit with the fine-grained characteristic of the audio. In addition, existing schemes are typically designed for specific audio types (e.g., music only or speech only), and lack a unified, generalized framework to flexibly cope with diverse audio driving scenarios. Disclosure of Invention The invention provides a robot control method and device based on audio drive, and a robot, which avoid multi-stage error accumulation and realize high-fidelity and low-delay synchronous control of robot actions and audio data through end-to-end direct mapping. In a first aspect, the present invention provides a robot control method based on audio driving, including the steps of: acquiring audio data and extracting audio features of the audio data; Inputting the audio characteristics to a pre-trained audio motion alignment module for audio motion alignment processing, and outputting audio style latent variables aligned by motion prior; Acquiring a motion content latent variable corresponding to the current task semantics; inputting the audio style latent variable and the motion content latent variable into a pre-trained diffusion student strategy network to generate a joint action instruction of the robot; And sending the joint action instruction to a joint driver of the robot so as to drive the robot to execute the action synchronous with the audio data. Preferably, according to the method for controlling a robot based on audio driving provided by the present invention, the training step of the audio motion alignment module includes: Acquiring paired first audio data and first human motion sequence data; Encoding the first human motion sequence data as a first motion latent variable and encoding the first audio data as a first audio latent variable; Constructing a neural network containing a time sequence attention mechanism as an audio motion alignment network, taking the first audio latent variable as a query, taking the first motion latent variable as a key and a value, and calculating attention weight; And optimizing training the audio motion alignment network through an information noise comparison estimation loss function to obtain the trained audio motion alignment module, wherein the information noise comparison estimation loss function is configured to pull the distance between the paired first audio latent variable and the first motion latent variable in a feature space and push the distance between the unpaired samples. Preferably, according to the method for controlling a robot based on audio driving provided by the present invention, the audio motion alignment network is an adapter network based on a transducer architecture, the attention of the audio motion alignment network is calculated as a scaling dot product attention based on the attention weight, and the information noise comparison estimation loss function L is specifically: Wherein, the A first audio latent variable for the ith sample,For a first motion latent variable corresponding to the first audio latent variable,As a first motion latent variable f