CN-122018332-A - Control method and system for universal foot robot

CN122018332ACN 122018332 ACN122018332 ACN 122018332ACN-122018332-A

Abstract

The invention provides a control method and a control system of a universal foot robot, and relates to the technical field of robot control, wherein the control method constructs a unified input representation compatible with heterogeneous sensors and different sampling frequencies by unifying time sequence observation data of a multi-source sensor and adding modes, time and sensor configuration embedding; the method comprises the steps of introducing a shielding mask to enable a shielding time sequence attention encoder to cope with data shielding and frame loss, fusing robot state description information and world state latent variables in a cross-form fusion module, realizing the adaptation of a single strategy network (comprising the shielding time sequence attention encoder, the cross-form fusion module and a general strategy head) to foot-type robot forms with different kinematic topologies and dynamic parameters, and finally outputting a joint level control instruction to drive the robot to move so as to form a complete general foot-type robot control scheme, thereby effectively solving the core problems that the prior art depends on specific sensors and forms and is difficult to cope with perceived data loss.

Inventors

WANG BORAN
LUO QINGYUAN
Wu Wangjicheng

Assignees

哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院)

Dates

Publication Date: 20260512
Application Date: 20260410

Claims (10)

1. A control method of a general-purpose foot robot, comprising: Acquiring multi-source sensor time sequence observation data of the foot robot in a current control period and a historical control period; Unified token is conducted on the multi-source sensor time sequence observation data to obtain a modal token sequence, embedding processing is conducted on each modal token in the modal token sequence, and modal embedding, time embedding and sensor configuration embedding are added to obtain a modal token sequence after the embedding is added; generating a shielding mask according to a preset shielding strategy of a training stage or a quality indication of an reasoning stage; Inputting the added and embedded modal token sequence and the shielding mask into a shielding time sequence attention coder, and outputting world state latent variables; The method comprises the steps of obtaining morphological description information of the foot robot, inputting the morphological description information and the world state latent variable into a cross-morphology fusion module to obtain a morphology conditional latent variable, inputting the morphology conditional latent variable into a general strategy head, and outputting a joint level control instruction to control the motion of the foot robot.
2. The method for controlling a universal foot robot according to claim 1, wherein the unified token of the multi-source sensor time sequence observation data to obtain a modal token sequence comprises: based on the corresponding encoder, extracting the characteristics of the time sequence observation data of each mode in the time sequence observation data of the multi-source sensor to obtain the initial characteristics of each mode; linearly mapping the initial characteristics of each mode to a vector space with uniform dimension to obtain a mode token of each mode at the current moment; And arranging the mode token of different modes at different moments according to the time sequence and the mode sequence to form a mode token sequence.
3. The method of claim 1, wherein the mask comprises at least one of a random modal mask, a random time slice mask, and a structured mask, wherein the structured mask is generated according to a real scene mask distribution, and wherein the structured mask comprises at least one of a lateral field of view mask, a local area mask, a distance slice mask, and a continuous frame loss mask.
4. The method according to claim 1, wherein inputting the added and embedded modal token sequence and the mask into a mask timing attention encoder outputs world state latent variables, comprising: acquiring length information of a current sliding window, and determining a causal mask according to the length information; Calculating, in each attention layer of the occlusion temporal attention encoder, an attention score with the added embedded sequence of modality tokens as a source of queries, keys and values, and superimposing the causal mask and the occlusion mask on the attention score; Normalizing the attention score after the mask is overlapped to obtain attention weight, and carrying out weighted summation on the value according to the attention weight to obtain the attention output of the current layer; and after stacking a plurality of attention layers, extracting a characteristic representation corresponding to the current moment from the attention output of the last layer, and aggregating through pooling operation to obtain the world state latent variable.
5. The method for controlling a universal foot robot according to claim 1, wherein the cross-morphology fusion module comprises a morphology encoder and a feature fusion device, wherein the step of inputting the morphology description information and the world state latent variable into the cross-morphology fusion module to obtain a morphology conditional latent variable comprises the steps of: Building a morphology graph encoder based on a graph neural network or morphology transducer encoder; constructing a morphological graph of kinematic topology and kinetic parameters of the foot robot, wherein nodes of the morphological graph represent joints or connecting rods, edges of the morphological graph represent connection relations, and node attributes comprise at least one of connecting rod length, mass, moment of inertia, joint type and joint limit; inputting the morphology graph into the graph neural network or the morphology transducer encoder to obtain a morphology token sequence; And fusing the morphological token with the world state latent variable through the feature fusion device to obtain the morphological conditional latent variable.
6. The method according to claim 5, wherein the fusing, by the feature fusion device, the morphological token with the world state latent variable to obtain the morphological conditional latent variable includes: and injecting the morphological token sequence into the world state latent variable based on a characteristic linear modulation or a cross attention mechanism to obtain the morphological conditioning latent variable.
7. The method for controlling a universal foot robot according to claim 1, the method is characterized in that the control method of the universal foot robot further comprises the following steps: According to the quality indication and the cross-modal reconstruction residual error, determining the reliability of each mode at the current moment through a reliability estimator; Dynamically adjusting the shadow mask according to the reliability, including: when the reliability of any mode is lower than a preset threshold, replacing the corresponding mode token with a leachable mask mark, and reducing or shielding the weight of the mode in the attention calculation.
8. The method of claim 1, wherein the masking timing attention encoder and the cross-modality fusion module are obtained by three-stage training including self-supervised masking modeling pre-training, multi-tasking control learning, and cross-modality expert-student distillation: The self-supervised masking modeling pre-training includes: Acquiring time sequence observation data acquired in advance in a multi-form multi-sensor simulation data pool as historical training data; applying random masking and structured masking to the historical training data to generate a masked modal token sequence; Inputting the masked modal token sequence into an initialized masking time sequence attention encoder, taking the original observation data of the masked part as a supervision signal, optimizing masking reconstruction loss, dynamic consistency loss and cross-modal alignment loss, and training to obtain a pre-training masking time sequence attention encoder; The multitasking learning comprises: the parameters of the pre-training shielding time sequence attention encoder are connected with the initialized cross-form fusion module and the initialized universal strategy head to form a primary strategy network; In a multitask reinforcement learning environment, generating experience data through online interaction between the preliminary strategy network and the environment; Inputting the form description information of the present foot robot form into a cross-form fusion module in the preliminary strategy network to generate a temporary form token, fusing the temporary form token and the temporary world state latent variable to obtain a temporary form conditional latent variable, and inputting a general strategy head output action in the preliminary strategy network; constructing reinforcement learning control loss according to the reward signal fed back by the environment; The gradient of the reinforcement learning control loss is reversely transmitted to a shielding time sequence attention encoder, a cross-form fusion module and a general strategy head joint optimization parameter, modal randomization disturbance is continuously applied in the training process, and the learning is adapted to the motion control strategies of robots in different forms, so that the shielding time sequence attention encoder, the cross-form fusion module and the general strategy head after the preliminary joint training are obtained; The trans-morphotic expert-student distillation comprises: Acquiring a plurality of expert strategy networks, and using action distribution generated by each expert strategy network in a plurality of states as a soft label; Forming a unified student strategy network by using the masking time sequence attention encoder, the cross-form fusion module and the universal strategy head after the preliminary combined training, and optimizing distillation loss by using the soft tag as a supervision signal so that the action distribution output by the unified student strategy network is the same as the action distribution of each expert strategy network; and in the distillation process, jointly optimizing the shading reconstruction loss, and training to obtain the shading time sequence attention encoder, the cross-morphology fusion module and the universal strategy head.
9. The control method of a universal foot robot according to claim 1, wherein the masking timing attention encoder further outputs uncertainty; the control method of the universal foot robot further comprises the following steps: And carrying out self-adaptive adjustment on a desired speed instruction or a safety constraint weight according to the uncertainty.
10. A control system for a universal foot robot, comprising: the acquisition unit is used for acquiring the multi-source sensor time sequence observation data of the foot-type robot in the current control period and the historical control period; The embedding unit is used for carrying out unified token on the multi-source sensor time sequence observation data to obtain a modal token sequence, carrying out embedding processing on each modal token in the modal token sequence, adding modal embedding, time embedding and sensor configuration embedding to obtain a modal token sequence after adding the embedding; The processing unit is used for generating a shielding mask according to a preset shielding strategy of a training stage or a quality indication of an inference stage, inputting the added and embedded modal sequence and the shielding mask into a shielding time sequence attention encoder, and outputting world state latent variables; The control unit is used for acquiring the morphological description information of the foot-type robot, inputting the morphological description information and the world state latent variable into a cross-morphology fusion module to obtain a morphological conditional latent variable, inputting the morphological conditional latent variable into a general strategy head, and outputting a joint level control instruction to control the movement of the foot-type robot.

Description

Control method and system for universal foot robot Technical Field The invention relates to the technical field of robot control, in particular to a control method and a control system of a universal foot robot. Background The foot robot has wide application prospect in the fields of industrial inspection, logistics transportation, disaster relief, field exploration and the like due to the excellent terrain adaptability. In recent years, with the rapid development of deep reinforcement learning and simulation technology, the learning-based foot robot motion control has made a significant breakthrough in tasks such as speed tracking, disturbance recovery, complex terrain passing and the like. However, the existing foot robot control method still faces technical bottlenecks in the engineering landing process, such as the existing method has high requirements on the integrity and the synchronism of sensor data, and it is generally assumed that each mode sensor is always available stably. However, in practical applications, the sensor data is often missing or distorted due to environmental factors (such as shielding, intense light, reflection, fog dust) or hardware problems (such as frame loss, delay, drop line). Traditional strategy networks lack effective modeling of such partially observable states, and when the input data is incomplete, the control performance drops dramatically, even causing the robot to fall unstably. In addition, the current multi-source sensor fusion method mostly adopts simple splicing or weighted fusion, so that the time sequence dependency relationship and complementarity among different modes are difficult to fully utilize, and the problem of data asynchronization caused by the difference of sensor sampling frequencies is also difficult to process. Disclosure of Invention The present invention solves one or more of the above-mentioned problems of the related art. In order to solve the above problems, the present invention provides a control method and system for a universal foot robot. In a first aspect, the present invention provides a control method of a universal foot robot, including: Acquiring multi-source sensor time sequence observation data of the foot robot in a current control period and a historical control period; Unified token is conducted on the multi-source sensor time sequence observation data to obtain a modal token sequence, embedding processing is conducted on each modal token in the modal token sequence, and modal embedding, time embedding and sensor configuration embedding are added to obtain a modal token sequence after the embedding is added; generating a shielding mask according to a preset shielding strategy of a training stage or a quality indication of an reasoning stage; Inputting the added and embedded modal token sequence and the shielding mask into a shielding time sequence attention coder, and outputting world state latent variables; The method comprises the steps of obtaining morphological description information of the foot robot, inputting the morphological description information and the world state latent variable into a cross-morphology fusion module to obtain a morphology conditional latent variable, inputting the morphology conditional latent variable into a general strategy head, and outputting a joint level control instruction to control the motion of the foot robot. Optionally, the unified token of the multi-source sensor time sequence observation data to obtain a modal token sequence includes: based on the corresponding encoder, extracting the characteristics of the time sequence observation data of each mode in the time sequence observation data of the multi-source sensor to obtain the initial characteristics of each mode; linearly mapping the initial characteristics of each mode to a vector space with uniform dimension to obtain a mode token of each mode at the current moment; And arranging the mode token of different modes at different moments according to the time sequence and the mode sequence to form a mode token sequence. Optionally, the shielding mask comprises at least one of a random modal shielding mask, a random time slice shielding mask and a structured shielding mask, wherein the structured shielding mask is generated according to a real scene shielding distribution, and the structured shielding mask comprises at least one of a lateral view shielding, a local area shielding, a distance segment shielding and a continuous frame loss shielding. Optionally, inputting the added and embedded modal token sequence and the mask into a mask timing attention encoder to output world state latent variables, including: acquiring length information of a current sliding window, and determining a causal mask according to the length information; Calculating, in each attention layer of the occlusion temporal attention encoder, an attention score with the added embedded sequence of modality tokens as a source of queries, keys and values, and sup