CN-122008227-A - Multi-stage course-aided robot action learning method

CN122008227ACN 122008227 ACN122008227 ACN 122008227ACN-122008227-A

Abstract

A multi-stage course-aided robot action learning method belongs to the technical field of robot motion control. The invention solves the problems that the existing method still has difficulty in realizing accurate control of the jump distance and landing place, poor real-time performance and poor generalization capability among multiple jump targets. The staged rewarding shaping and whole body coordination guiding mechanism is combined, so that the humanoid robot forms a stable angular momentum management and body coordination mode in each key stage, the jumping performance and the drop point precision are improved, and the posture recovery capability and the disturbance resistance under a high impact scene are obviously enhanced. And two core technologies of torque control domain optimization and friction identification compensation are provided, so that the training strategy is subjected to dynamic constraint consistent with real hardware in a simulation stage. By constructing a continuous and conductive moment-angular velocity feasible domain model and identifying friction parameters based on an improved Stribeck model, the feasibility of a strategy is greatly improved. The method can be applied to robot motion control.

Inventors

SHAO XIANGYU
ZHU DONG
LI BINGCHEN
SUN GUANGHUI
YAO WEIRAN

Assignees

哈尔滨工业大学
成都川哈工机器人及智能装备产业技术研究院有限公司

Dates

Publication Date: 20260512
Application Date: 20260320

Claims (10)

1. A multi-stage course-assisted robotic action learning method, the method comprising the steps of: firstly, respectively constructing an Actor network and an input state vector of the Critic network in an Actor-Critic architecture, wherein the input state vector of the Actor network is an observed quantity obtained through ontology sensing, and the input state vector of the Critic network is privilege information; Step two, mastering, expanding and deploying an Actor network and a Critic network in an Actor-Critic architecture; Firstly, training in a grasping stage, wherein the switching condition from the grasping stage to the expansion stage is that the average survival time of the humanoid robot in the environment reaches a set threshold value in the grasping stage; The switching condition from the expansion stage to the deployment stage is that the topography course and the instruction course in the expansion stage are stably converged; After the deployment stage is executed, returning to the expansion stage again to continue training, and obtaining a trained Actor network after the training is converged; And thirdly, taking the actual state vector of the humanoid robot as the input of a trained Actor network, outputting actions through the trained Actor network, and converting the outputted actions into control instructions of the humanoid robot.
2. The multi-stage course-assisted robotic action learning method of claim 1, wherein the input vector of the Actor network ; Wherein, the A control command representing the current step is given, , Indicating the desired position of the left foot, Indicating the desired position of the right foot, Indicating a desired landing direction; Representing the future reference motion of the current step, , Indicating a reference action for the 3 rd step in the future, Indicating a reference action at step 12 in the future, Representing future reference actions of step 21, each step including a reference position and a reference velocity for each joint; historical motion and feedback information representing the current step, , Representing current and historical observations of 14 steps, each comprising the Euler angle and angular velocity of the floating base, the actual position and actual velocity of each joint, Representing a history of 15 steps of actions, the actions being the expected joint positions corresponding to each step; The input vector of the Critic network comprises the privilege information of the current step, Privilege information of steps Step privilege information.
3. The multi-stage course-assisted robotic action learning method of claim 2, wherein the total loss function employed in the Actor-Critic architecture training process is: Wherein, the Representing a policy penalty of the type indicated, Representing the loss of the cost function, Representing a loss of symmetry and, And Are super parameters; Wherein, the Representing a function mapping actions to mirror corresponding terms, Representing the input vector A function mapped to a mirrored counterpart, Representing the policies of the Actor network, The L2 norm is represented by the number, Representing the desire.
4. A multi-stage curriculum assisted robotic action learning method as claimed in claim 3, wherein said grasping stage and extension stage environmental termination conditions are: the condition (1) that the upper limit of the time step set for the environment is reached; Condition (2), actual height of base ; Condition (3), foot end trajectory tracking error Exceeding a threshold, i.e. , wherein, Representing the current height of the foot, Representing the height of the foot with respect to the motion, Representing the undulating height of the actual ground relative to the horizon, The representation takes absolute value; Condition (4), deviation of the actual position of the humanoid robot base from the desired position after landing, or actual orientation of the humanoid robot And the desired orientation Deviation of (2) exceeds a threshold, i.e Or (b) ; Wherein, the Representing a humanoid robot base in Actual position in axial direction The vector of the actual position of the axial direction, Indicating that the base is in Axial desired position The axial direction is a vector of desired positions, Representing the actual yaw angle of the humanoid robot base, Representing a desired yaw angle of the base; When the condition (1), the condition (2), the condition (3) or the condition (4) is satisfied, the current environment is terminated.
5. A multi-stage curriculum assisted robotic action learning method as claimed in claim 4, wherein said training phase training is of flat ground; The training terrain in the expansion stage is random uniform terrain and discrete obstacle terrain, and the training terrain upgrading condition and the training terrain degrading condition in the expansion stage are set; The terrain upgrading condition is that the survival time of the humanoid robot on the current terrain exceeds 15 seconds; The terrain degradation condition is that the survival time of the humanoid robot on the current terrain is lower than 4 seconds, the condition (2) is reached or the condition (3) is reached.
6. The multi-stage course-assisted robotic action learning method of claim 5, wherein the Actor-Critic architecture training process employs a reward function The method comprises the following steps: Wherein, the Representing a style imitative item reward, Indicating that the task is completed with a term reward, Indicating that the lead prize is to be awarded, Representing a smooth item reward; Wherein, the A vector representing the actual position composition of each joint, A vector representing the composition of the reference positions of the joints, Indicating the actual hand position of the humanoid robot, Representing the hand reference position of the humanoid robot, A vector representing the composition of the reference positions of the feet of the humanoid robot in the z-axis direction, A vector representing the composition of the actual positions of the feet of the humanoid robot in the z-axis direction, Indicating the reference position of the humanoid robot base in the z-axis direction, Representing the actual position of the humanoid robot base in the z-axis direction, Indicating that the foot is in contact with the reward, Representing a foot speed reward; Wherein, the 、、 And Representing a phase indication quantity; represents the ground reaction force after the normalization of the left foot, Represents the ground reaction force after right foot normalization, Indicating the velocity of the left foot, Indicating the right foot speed.
7. A multi-stage curriculum assisted robotic action learning method as claimed in claim 6, wherein said task completion items are rewarded The method comprises the following steps: Wherein, the Representing a vector of the actual position of the base in the x-axis direction and the actual position of the base in the y-axis direction, Representing a vector of a reference position of the base in the x-axis direction and a desired position of the base in the y-axis direction, A vector representing the actual yaw, pitch and roll angles of the base, A vector representing the desired yaw, pitch and roll angles of the base, Indicating that the step is to be followed up with a reward, Indicating an upright posture reward; Foot tracking rewards The calculation method of (2) is as follows: Wherein, the Representing the distance of the left foot to the target landing point, Representing the distance of the right foot to the target landing point, In order for the scaling factor to be a factor, And Are indicated values; Upright posture rewarding The calculation method of (2) is as follows: Wherein, the Representing the actual position of the base of the humanoid robot in the x-axis direction, Representing the actual position of the head of the humanoid robot in the x-axis direction, an intermediate variable , Representing the actual position vector of the humanoid robot head, Representing the actual position vector of the humanoid robot base, Representing a vertically upward unit vector in the world coordinate system.
8. A multi-stage curriculum assisted robotic action learning method as claimed in claim 7, wherein said lead rewards The method comprises the following steps: Wherein, the Indicating the actual velocity of the base in the z-axis direction, Indicating a torso guided reward, Representing a centroid guidance reward; Wherein, the The control input is represented as such, The angle of the neutral torso is represented, In order to input the scaling factor(s), For real-time measurement of the tilt angle of the torso, Representing the lower bound of the torso-tilt angle, Representing the upper bound of the torso-tilt angle, In order to adapt the parameters of the device, And Represents an intermediate variable which is referred to as, A threshold value representing the torso-tilt angle, Representing the inverse hyperbolic tangent function, Representing a function; the centroid guided reward The method comprises the following steps: Wherein, the In order to indicate the function, And Is an intermediate variable; Indicating when In the time-course of which the first and second contact surfaces, Has a value of 1 when In the time-course of which the first and second contact surfaces, The value of (2) is 0; Indicating when In the time-course of which the first and second contact surfaces, Has a value of 1 when In the time-course of which the first and second contact surfaces, The value of (2) is 0; Wherein, the Representing the centroid relative to the support polygon center The displacement on the shaft is such that, , Indicating that the base is in At a desired position in the axial direction, Representing the support polygon centered on The position on the shaft is such that, Representing centroid at A position on the shaft; is a sign function; is an activation function; As the normalized coefficient of the direction-dependent, The values of (2) are as follows: Wherein, the Representing the support polygon center to The axial direction supports the distance of the polygon boundary.
9. A multi-stage curriculum assisted robotic action learning method as claimed in claim 8, wherein said smoothing term rewards The method comprises the following steps: Wherein, the A vector representing the motion components of each joint in the current step, Represent the first The vector formed by the actions of each joint, Represent the first The vector formed by the actions of each joint, Represent the first The moment of the individual joints is determined, Represent the first The actual speed of the individual joints is determined, Represent the first Actual acceleration of the individual joints.
10. The multi-stage course-assisted robot action learning method of claim 9, wherein during the real machine test, the friction parameters of each joint of the humanoid robot are identified, and the humanoid robot dynamics equation is compensated based on the identified friction parameters, specifically: Step 1, constructing an objective function for identifying friction parameters by utilizing an error minimization criterion : Wherein, the In the form of a vector of friction parameters to be identified, , As an error in the moment of friction, , Is the first The actual friction moment of the individual joints is, In order to identify the moment of friction, Is the estimated result of the coulomb friction coefficient; as an estimation result of the maximum static friction coefficient, As a result of the estimation of the Stribeck speed threshold, Is the estimation result of the viscosity friction coefficient; I.e. friction moment The expression is: Wherein, the Is the friction torque in the classical Stribeck friction model, Is the coulomb friction coefficient; for the maximum coefficient of static friction, Is the base of the natural logarithm, Is the first The actual angular velocity of the individual joints is, Representation of Is used for the control of the (c), For the Stribeck speed threshold value, In order to achieve a viscous coefficient of friction, As a function of the sign of the symbol, To set the first The angular velocity threshold value of each joint, Is the attenuation empirical constant of the Stribeck curve; the friction moment of the humanoid robot joint can be calculated based on the above method; collecting joint angular velocity and moment data, and establishing a humanoid robot dynamics equation after friction parameter compensation: In the formula, In order to output the moment of the joint, In order to compensate for the friction torque that is present, Is the actual position vector of the human-shaped robot joint, Is the velocity vector of the human-shaped robot joint, Is the acceleration vector of the joint of the humanoid robot, Is an inertia matrix of the humanoid robot, For a matrix of velocity terms related to centrifugal and coriolis forces, Is a gravity term; Returning the final moment feasible region model based on the actuator and the compensated humanoid robot dynamics model to the expansion stage for retraining until the training process of the expansion stage converges, and obtaining a final trained model; the method for acquiring the final torque feasible region model of the actuator comprises the following steps: Step S1, constructing a motor moment-angular velocity capacity model for each joint actuator of the humanoid robot: the peak moment-angular speed boundary of the motor is synthesized to form an initial feasible region reflecting the real hardware performance; s2, correcting based on real machine test data And (3) acquiring moment data under different angular speeds and loads, fitting moment capacity attenuation conditions caused by magnetic saturation, current limiting and friction, and smoothly degrading the initial feasible region into a motor moment-angular speed boundary curve conforming to the capacity of a real actuator, thereby obtaining a final moment feasible region model.

Description

Multi-stage course-aided robot action learning method Technical Field The invention belongs to the technical field of robot motion control, and particularly relates to a multi-stage course-aided robot action learning method. Background The ability to traverse complex terrain with a humanoid robot that is flexible and robust is a very challenging research direction. Current research is focused mainly on enhancing the walking and running capabilities of humanoid robots on unstructured terrain, while jumping actions have an irreplaceable role when crossing large vertical or horizontal obstacles (like ravines, steps or rocks). However, jump-related research remains relatively limited and presents challenges in that humanoid machines need to generate sufficient ground reaction forces during the take-off phase, handle underactuated dynamics during the vacation phase, and quickly resume stability after being subjected to high impact forces during the landing phase. In addition, if the robot needs to fall at a designated position, the angular momentum needs to be accurately regulated and controlled, so that the humanoid robot can realize high-precision falling point control. The model-based control and optimization method has achieved significant results in the field of humanoid robot jumping. For example, qi et al, by combining centroid trajectory optimization with angular momentum control, have a humanoid robot that completes a 50cm vertical jump, and boston-powered Atlas humanoid robots can even complete a post-void flip action. Such skips typically employ a hierarchical optimization framework by first planning the skips trajectory offline based on whole body dynamics and a contact model, and then utilizing simplified system dynamics for on-line control. However, this approach has inherent limitations in that each new jump target requires re-solving the trajectory, and at the same time, due to the high computational complexity of the optimization, the trajectory planning can often only be performed off-line, resulting in difficulty in achieving real-time adaptation capability. In recent years, data-driven reinforcement learning methods have made significant progress in complex humanoid robot control tasks. For short-distance continuous small hops, one class of methods exhibits strong dynamic performance with a simple bonus design, but it is difficult to achieve accurate control of the hop distance and landing place. Another paradigm uses high fidelity reference motion to simulate so that the humanoid robot can replicate long distance, large amplitude jumps. However, this type of approach presents difficulties in autonomous task execution because it is highly dependent on target-specific reference actions, not only reducing task generalization capability, but also requiring large and expensive action capture data to obtain a diverse set of reference hops. In a generalized motion scheme based on a single reference motion, course learning shows good results on bipedal humanoid robots (e.g., cassie). However, in the humanoid robot scene, since the upper body has a large mass and a high moment of inertia, the angular momentum effect is remarkable, and thus, a stronger whole body coordination ability is required to realize stable and powerful jumping. When the jump strategy is transferred to the Real humanoid robot, the joint actuator needs to complete high-speed movement and output large torque in a very short time, which further amplifies the dynamic difference between simulation and reality, so that the conventional method is difficult to obtain a stabilizing effect in terms of Sim2Real (from simulation to reality). In summary, the existing method still has difficulty in realizing accurate control of the jump distance and landing point, poor real-time performance and poor generalization capability among multiple kinds of jump targets, so that a control strategy for realizing accurate control of the jump distance and landing point of the humanoid robot and generalization among multiple kinds of jump targets and capable of keeping the real-time performance of the control process and the whole body coordination of the humanoid robot is a problem to be solved urgently at present. Disclosure of Invention The invention aims to solve the problems that the accurate control of the jump distance and the landing place is still difficult to realize, the instantaneity is poor and the generalization capability among multiple jump targets is poor in the existing method, and provides a multi-stage course-assisted robot action learning method. The technical scheme adopted by the invention for solving the technical problems is that a multi-stage course-assisted robot action learning method comprises the following steps: firstly, respectively constructing an Actor network and an input state vector of the Critic network in an Actor-Critic architecture, wherein the input state vector of the Actor network is an observed quantity obtaine