CN-121973248-A - Motion control method and device based on vision and motion generation model

CN121973248ACN 121973248 ACN121973248 ACN 121973248ACN-121973248-A

Abstract

The invention relates to the technical field of robot control, and discloses a motion control method and a motion control device based on a vision and motion generation model, wherein the method comprises the following steps: image information of an object to be operated corresponding to the double-arm robot is acquired, scene perception analysis is carried out on the image information to obtain a multi-mode scene representation vector, physical parameters of the double-arm robot are acquired, a basic motion generation model is dynamically modulated according to the physical parameters to obtain a motion generation model, a task instruction and a double-arm state are acquired, the multi-mode scene representation vector, the task instruction and the double-arm state are input into the motion generation model to generate a double-arm joint angle control sequence, and the double-arm robot is controlled to execute an operation task according to the double-arm joint angle control sequence. Therefore, the implementation of the invention can accurately sense the three-dimensional scene, realize the cross-platform multiplexing of the basic model, reduce the adaptation cost and the period of the model, realize the end-to-end mapping from visual sense to motion control, and improve the automation and the intelligent level of the motion control.

Inventors

LI DONGSHENG
LI TIANGANG

Assignees

深圳和润达科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (10)

1. A method of motion control based on a vision and motion generation model, the method comprising: Acquiring image information of an object to be operated corresponding to a double-arm robot, and performing scene perception analysis on the image information to obtain a multi-mode scene representation vector corresponding to the object to be operated; obtaining physical parameters of the double-arm robot, and dynamically modulating a preset basic motion generation model according to the physical parameters to obtain a motion generation model matched with the double-arm robot, wherein the physical parameters comprise at least one of arm length parameters, mass distribution parameters, joint limiting parameters and joint transmission ratio parameters; Acquiring a task instruction aiming at the object to be operated and a double-arm state of the double-arm robot, inputting the multi-mode scene representation vector, the task instruction and the double-arm state into the motion generation model, and generating a double-arm joint angle control sequence; and controlling the double-arm robot to execute the operation task corresponding to the task instruction according to the double-arm joint angle control sequence.
2. The motion control method based on a visual and motion generation model according to claim 1, wherein the motion generation model comprises a feature fusion layer, a hybrid codec and an output layer, the hybrid codec comprising an encoder, a motion primitive library and a decoder; The step of inputting the multi-mode scene representation vector, the task instruction and the double-arm state into the motion generation model to generate a double-arm joint angle control sequence comprises the following steps: Performing feature stitching on the multi-mode scene representation vector, the task instruction and the double-arm state input through the feature fusion layer to obtain initial context features; Extracting space-time characteristics of double-arm cooperation of the double-arm robot in the initial context characteristics through a multi-layer space-time convolution network in the encoder, wherein the space-time characteristics comprise spatial relations and time evolution rules among double-arm joints of the double-arm robot and semantic information related to an operation task corresponding to the task instruction; generating a motion instruction matched with the operation task according to the motion primitives stored in the motion primitive library and the space-time characteristics; generating, by the decoder, an initial joint angle sequence according to the spatiotemporal feature and the motion instruction, the initial joint angle sequence comprising joint angle control instructions for a plurality of time steps in the future; And carrying out smooth filtering treatment on the initial joint angle sequence through the output layer to obtain a double-arm joint angle control sequence.
3. The vision and motion generation model based motion control method of claim 2, wherein the multi-layered spatio-temporal convolution network comprises a vision network and a space-time diagram convolution network; The extracting the space-time feature of the double-arm robot double-arm cooperation in the initial context feature through the multi-layer space-time convolution network in the encoder comprises the following steps: Capturing image global semantic information in the initial context feature through the visual network to obtain a global visual feature, inputting historical joint track information corresponding to the double-arm state into the space-time diagram convolution network to obtain a space-time joint feature, and splicing the global visual feature and the space-time joint feature to obtain an initial fusion feature; performing cross-modal space-time attention calculation on the initial fusion feature to obtain a cross-modal attention weight set, and performing weighted optimization on the initial fusion feature according to the cross-modal attention weight set to obtain a cross-modal fusion feature, wherein the cross-modal attention weight set comprises a space attention weight matrix, a time attention weight matrix and association weights between the global visual feature and the space-time joint feature, which correspond to the initial fusion feature; Determining at least one functional partition aiming at the double arms of the double-arm robot according to the task instruction, and determining local spatial relationship characteristics of the joint points in each functional partition according to each functional partition and the cross-mode fusion characteristics; And inputting each local spatial relationship characteristic into a preset multi-layer cascade structure to perform multi-layer space-time convolution operation to obtain the space-time characteristics of the double-arm robot in double-arm cooperation, wherein the multi-layer cascade structure is formed by alternately cascading a plurality of multi-configuration partitioned self-adaptive graph convolution network layers and a time one-dimensional convolution layer.
4. A method of motion control based on a vision and motion generation model according to any one of claims 1-3, characterized in that the method further comprises: constructing a basic motion generation model network architecture and a training data set, inputting the basic motion generation model network architecture based on the training data set, and obtaining model prediction motion of the basic motion generation model network architecture; calculating the mean square error loss of the model predicted motion and the human demonstration motion in the training data set based on a behavior cloning algorithm, judging the model predicted motion and the human demonstration motion based on a discriminator to obtain a judging result, constructing a joint optimization loss function based on the mean square error loss and the judging result, and performing simulated learning pre-training on the basic motion generation model network architecture based on the joint optimization loss function to obtain a pre-training model; In a preset simulation environment, taking the pre-training model as an initial motion strategy, constructing a weighted reward function, and carrying out iterative updating on the initial motion strategy based on a near-end strategy optimization algorithm and the weighted reward function to obtain a fine tuning model; And verifying the fine adjustment model through a preset verification data set to obtain a verification result, and determining the fine adjustment model as a basic motion generation model when the verification result indicates that the fine adjustment model meets preset model requirements, wherein the model requirements comprise task execution precision requirements, motion coordination requirements and physical constraint requirements.
5. A vision and motion generation model based motion control method according to any of claims 1-3, characterized in that the image information comprises binocular synchronous RGB image information; performing scene perception analysis on the image information to obtain a multi-mode scene representation vector corresponding to the object to be operated, wherein the multi-mode scene representation vector comprises: Performing example segmentation on the binocular synchronous RGB image information to obtain a 2D mask of an object corresponding to the double-arm robot, performing stereo matching on the binocular synchronous RGB image information to obtain a dense depth map, performing depth complementation on the dense depth map according to the 2D mask to obtain a three-dimensional scene point cloud, wherein the object comprises double arms of the double-arm robot, an object to be operated and an obstacle; Identifying object information of the object according to the three-dimensional scene point cloud, determining scene graph nodes according to the object information, constructing an undirected scene graph based on the space distance between the scene graph nodes and each scene graph node, and determining a task level and a double-arm cooperative relationship of the double arms according to the undirected scene graph; Extracting RGB appearance characteristics from the binocular synchronous RGB image information, and calculating a first weight coefficient of the RGB appearance characteristics and a second weight coefficient of the three-dimensional scene point cloud according to the task level and the double-arm cooperation relation; and fusing the RGB appearance characteristics and the three-dimensional scene point cloud according to the first weight coefficient and the second weight coefficient to obtain fusion characteristic vectors, and carrying out normalization processing on the fusion characteristic vectors to obtain multi-mode scene representation vectors corresponding to the objects to be operated, wherein the multi-mode scene representation vectors comprise the pose of the objects to be operated, barrier information, scene context, task levels and the double-arm cooperation relationship.
6. A method of motion control based on vision and motion generation models according to any one of claims 1-3, characterized in that the dynamically modulating a preset basic motion generation model according to the physical parameters to obtain a motion generation model matched with the dual-arm robot comprises: encoding the physical parameters to generate a condition vector; determining standard space-time characteristics corresponding to the basic motion generation model, and generating scaling factors and offset factors of the standard space-time characteristics according to the condition vector; Performing layer normalization operation on the standard space-time features to obtain a normalization operation result, and dynamically modulating the standard space-time features according to the normalization operation result, the scaling factor and the offset factor to obtain modulated space-time features; And carrying out residual connection on the modulated space-time features and the standard space-time features to obtain fused space-time features, and carrying out gradient updating fine adjustment on a basic motion generation model corresponding to the fused space-time features based on a model-agnostic element learning frame to obtain a motion generation model matched with the double-arm robot.
7. A vision and motion generation model based motion control method according to claim 2 or 3, characterized in that the motion primitive library comprises a plurality of basic motion primitives; The generating a motion instruction matched with the operation task according to the motion primitive stored in the motion primitive library and the space-time feature comprises the following steps: Calculating the similarity between the space-time feature and each basic motion primitive based on semantic information related to an operation task corresponding to the task instruction, and screening the basic motion primitives with the similarity larger than a preset similarity threshold as control motion primitives to obtain a control motion primitive set; based on the spatial relationship between the two-arm joints and the time evolution rule, constructing a time sequence constraint condition corresponding to the control motion element set, and performing time sequence sequencing on the control motion element set according to the time sequence constraint condition to obtain a motion element sequence, wherein the acceleration fluctuation value of the two-arm joint motion corresponding to the motion element sequence is minimum; And mapping the motion primitive sequence into a joint angle variation, and generating a motion instruction matched with the operation task according to the joint angle variation and constraint parameters corresponding to the double-arm robot, wherein the constraint parameters comprise joint limit parameters and motion speed parameters, and the motion instruction comprises double-arm cooperation sequential logic, joint motion amplitude and motion execution duration.
8. A motion control apparatus based on a vision and motion generation model, the apparatus comprising: The acquisition module is used for acquiring image information of an object to be operated corresponding to the double-arm robot, and performing scene perception analysis on the image information to obtain a multi-mode scene representation vector corresponding to the object to be operated; the acquisition module is used for acquiring physical parameters of the double-arm robot; The modulation module is used for dynamically modulating a preset basic motion generation model according to the physical parameters to obtain a motion generation model matched with the double-arm robot, wherein the physical parameters comprise at least one of arm length parameters, mass distribution parameters, joint limit parameters and joint transmission ratio parameters; the acquisition module is also used for acquiring a task instruction aiming at the object to be operated and a double-arm state of the double-arm robot; The generating module is used for inputting the multi-mode scene representation vector, the task instruction and the double-arm state into the motion generating model to generate a double-arm joint angle control sequence; and the control module is used for controlling the double-arm robot to execute the operation task corresponding to the task instruction according to the double-arm joint angle control sequence.
9. A motion control apparatus based on a vision and motion generation model, the apparatus comprising: A memory storing executable program code; A processor coupled to the memory; The processor invokes the executable program code stored in the memory to perform the vision and motion generation model based motion control method of any one of claims 1-7.
10. A computer storage medium storing computer instructions which, when invoked, are operable to perform a vision and motion generation model based motion control method according to any one of claims 1-7.

Description

Motion control method and device based on vision and motion generation model Technical Field The invention relates to the technical field of robot control, in particular to a motion control method and device based on vision and motion generation models. Background The double-arm motion generation technology of the humanoid robot is a research core in the field of artificial intelligence and robot imitation learning, and aims to realize that the robot autonomously generates accurate, coordinated and safe double-arm motion by observing human demonstration so as to complete complex tasks such as grabbing, placing and the like. The prior art scheme has a plurality of bottlenecks in practical application, namely a perception layer, pure vision imitates learning multi-dependent monocular vision, depth information support is lacked, a three-dimensional scene cannot be accurately constructed, meanwhile, joint modeling is not carried out on coupling constraint of double-arm coordination and whole body balance, so that ambiguity exists in understanding of an operation space by a robot, the execution precision and obstacle avoidance effect of a double-arm terminal are affected, a motion generation layer is relied on an accurate model and environment parameters, natural coordination motion of a similar person is difficult to generate, a main stream transducer model can extract space-time characteristics, but the two-domain coupling relation of joint topology and time sequence motion is not characterized, long sequence coordination drift easily occurs, and the coupling relation and self-collision problem between the double arms cannot be effectively processed. In order to solve the problems, the prior art adopts various improvement means, for example, part of schemes are used for improving visual perception precision by wearing mark points on limbs of a human body, but an additional sensing device is used for increasing system complexity and use cost, and part of schemes are used for attempting to combine a transducer with a traditional graph convolution network to optimize feature extraction, but a task-adaptive network structure is not designed, so that the two-arm cooperative relationship cannot be dynamically modeled. Therefore, it is important to provide a technical scheme capable of accurately sensing three-dimensional scenes and improving the nature, coordination and accuracy of the double-arm motions of the robot Disclosure of Invention The invention provides a motion control method and a motion control device based on a vision and motion generation model, which can accurately sense a three-dimensional scene and improve the nature, coordination and accuracy of double-arm motion of a robot. To solve the above technical problem, a first aspect of the present invention discloses a motion control method based on a vision and motion generation model, the method comprising: Acquiring image information of an object to be operated corresponding to a double-arm robot, and performing scene perception analysis on the image information to obtain a multi-mode scene representation vector corresponding to the object to be operated; obtaining physical parameters of the double-arm robot, and dynamically modulating a preset basic motion generation model according to the physical parameters to obtain a motion generation model matched with the double-arm robot, wherein the physical parameters comprise at least one of arm length parameters, mass distribution parameters, joint limiting parameters and joint transmission ratio parameters; Acquiring a task instruction aiming at the object to be operated and a double-arm state of the double-arm robot, inputting the multi-mode scene representation vector, the task instruction and the double-arm state into the motion generation model, and generating a double-arm joint angle control sequence; and controlling the double-arm robot to execute the operation task corresponding to the task instruction according to the double-arm joint angle control sequence. As an optional implementation manner, in the first aspect of the present invention, the motion generation model includes a feature fusion layer, a hybrid codec, and an output layer, where the hybrid codec includes an encoder, a motion primitive library, and a decoder; The step of inputting the multi-mode scene representation vector, the task instruction and the double-arm state into the motion generation model to generate a double-arm joint angle control sequence comprises the following steps: Performing feature stitching on the multi-mode scene representation vector, the task instruction and the double-arm state input through the feature fusion layer to obtain initial context features; Extracting space-time characteristics of double-arm cooperation of the double-arm robot in the initial context characteristics through a multi-layer space-time convolution network in the encoder, wherein the space-time characteristics comprise spatial relation