CN-121973222-A - Fixed base mechanical arm point-to-point track optimization method based on near-end strategy optimization algorithm

CN121973222ACN 121973222 ACN121973222 ACN 121973222ACN-121973222-A

Abstract

The point-to-point track optimization method of the fixed base mechanical arm based on the near-end strategy optimization algorithm solves the problem that a high-dimensional continuous action space exists when deep reinforcement learning is directly used for the position of the tail end of the fixed base mechanical arm to reach a task, and belongs to the fields of robot operation control and mechanical arm autonomous motion optimization. The method comprises the steps of establishing a mechanical arm kinematics model, defining constraint conditions, modeling an end point-to-point arrival task as a sequential decision process, constructing a state vector as an input of a strategy network, taking an end displacement increment direction vector as an action output of the strategy network, mapping the end displacement increment direction vector into joint increment through differential inverse kinematics, performing constraint processing according to the constraint conditions to generate an executable joint control instruction, training the strategy network by adopting a near-end strategy optimization algorithm, obtaining single-step rewards through a multi-objective rewarding function, and evaluating the action output performance of the strategy network according to the single-step rewards until convergence.

Inventors

SUN CHENGXIN
ZHONG YUJIE
ZHANG JIANQIAO
LIU BOYI
ZHANG NING

Assignees

哈尔滨工业大学

Dates

Publication Date: 20260505
Application Date: 20260210

Claims (10)

1. The fixed base mechanical arm point-to-point track optimization method based on the near-end strategy optimization algorithm is characterized by comprising the following steps of: establishing a mechanical arm kinematics model, and defining constraint conditions including joint limit, joint increment limit and singular risk threshold; modeling an end point-to-point arrival task as a sequential decision process, and constructing a state vector at least comprising an end position error, a joint state, a target point position and a last moment action; Taking the state vector as the input of the strategy network, and taking the end displacement increment direction vector as the action output of the strategy network; Mapping the end displacement increment direction vector into joint increment through differential inverse kinematics, and performing constraint processing according to constraint conditions to generate an executable joint control instruction; Training a strategy network by adopting a near-end strategy optimization algorithm, obtaining single-step rewards through a multi-objective rewarding function, evaluating the performance of strategy network action output according to the single-step rewards, and iterating and optimizing strategies by combining a course learning mechanism until convergence.
2. The method for optimizing a point-to-point trajectory of a fixed base robotic arm based on a proximal strategy optimization algorithm of claim 1, wherein the method for mapping small incremental displacements of the distal cartesian space to joint increments by differential inverse kinematics comprises: Wherein, the For the desired end displacement increment at the current time, For the end displacement increment direction vector at the current time, For the end increment scale-up, , For the position of the target point, As the current end position of the tube, Decreasing the lower limit of the step size when approaching the target to reduce overshoot and oscillation; solving differential inverse kinematics by adopting a damping least square method, and mapping the terminal displacement increment into a joint increment : Wherein, the As a jacobian matrix of the current end position, As a damping factor, increases when the jacobian minimum singular value is below a singular risk threshold To inhibit joint delta divergence.
3. The fixed base robotic arm point-to-point trajectory optimization method based on a near-end strategy optimization algorithm of claim 2, wherein the multi-objective rewards function comprises: Converging and achieving a bonus term for encouraging the end position to approach the target point; the efficiency and energy consumption penalty term is used for reducing the number of exercise steps and inhibiting the variation amplitude of joints; A controllability and smoothness penalty term for constraining single step motion amplitude and motion change rate; and the safety and executable penalty term is used for restraining joint limiting risks, joint increment limiting and touch limiting events.
4. The method for optimizing point-to-point trajectories of a stationary base robotic arm based on a near-end policy optimization algorithm of claim 3, wherein converging and achieving the bonus term comprises a progress term Absolute distance term Achieving rewards with terminals : Wherein, the The decreasing amplitude of the distance error between two adjacent steps is used; Wherein, the Is a position scale factor; Wherein, the Representing a maximum error threshold allowed between the current position of the tip and the position of the target point; The efficiency and energy penalty terms include time terms And energy consumption term ; The controllability and smoothness penalty term includes an action amplitude term And smoothing items ; The controllability and smoothness penalty term includes a limiting risk term Touch penalty term And action distortion term ; A continuous function inversely proportional to the limit margin, or a continuous function of piecewise quadratic; Wherein, the An indication of the amount of joint velocity change that triggers the clipping constraint, A penalty indicator representing when the joint position or joint motion state touches a mechanical structure limit; Wherein, the Representing the tail end displacement executed after the tail end is subjected to dynamics, constraint and cutting; Each term in the multi-objective rewards function is weighted and then summed as a single step reward.
5. The method for optimizing the point-to-point track of the fixed base mechanical arm based on the near-end strategy optimization algorithm according to claim 1, wherein a random sampling strategy in a working space is adopted to obtain candidate target point positions, and the candidate target point positions are subjected to the method And performing differential inverse solution iteration for a plurality of times, and if the error threshold is reached within the given iteration times, taking the candidate target point position into a training sample as the target point position.
6. The fixed base mechanical arm point-to-point track optimization method based on the near-end strategy optimization algorithm as claimed in claim 2, wherein the method for performing constraint processing according to constraint conditions comprises the following steps: To joint increment Performing amplitude limiting treatment to ensure that the joint increment does not exceed the single-step joint increment upper limit; Limiting and cutting off the updated joint angle to ensure that the updated joint angle is within an allowable joint angle range; when the singular risk is detected, scaling the expected end displacement increment at the current moment and synchronously increasing the damping factor.
7. The fixed base mechanical arm point-to-point track optimization method based on the near-end strategy optimization algorithm according to claim 1, wherein the method for training the strategy network by adopting the near-end strategy optimization algorithm comprises the following steps: s1, initializing a strategy network and a value network, and setting a PPO super parameter; s2, using the current strategy to interactively sample with the environment, collecting a track sequence, and caching the collected track sequence and action log probability under the old strategy, wherein the track sequence comprises a state vector at the current moment State vector for next moment End displacement increment direction vector at current time Single step rewards at the present time Termination condition ; S3, calculating an advantage estimation and return target based on the acquired track sequence and the current predicted value of the value network; s4, updating strategy network parameters by utilizing a PPO cutting objective function according to the new and old strategy ratio and the calculated advantage estimation; s5, updating the value network parameters by taking the calculated return target as a supervision signal and minimizing the error between the value network output and the return target; s6, repeating the steps S2 to S5 until the strategy converges.
8. The fixed base mechanical arm point-to-point track optimization method based on the near-end strategy optimization algorithm of claim 7, wherein the PPO clipping objective function is: Wherein, the In order to obtain the advantage value of the present invention, As a parameter of the policy network, Representing expectations Representing clipping functions For the new and old policy ratio, In order to cut-off the threshold value, To estimate the dominant value, the ratio of new strategy to old strategy , Representing the log-probability of actions under the new policy, Log probability for actions under the old policy.
9. The fixed base robotic arm point-to-point trajectory optimization method based on a near-end policy optimization algorithm of claim 7, wherein the course learning mechanism comprises at least one of the following difficulty promotion approaches: gradually increasing the distance between the initial end and the target point; gradually tightening the safety margin of joint limit; external disturbances or noise are introduced during the movement.
10. The method for optimizing a point-to-point trajectory of a fixed base mechanical arm based on a near-end policy optimization algorithm according to claim 7, wherein after the current policy is interactively sampled with the environment, the step S2 further comprises determining whether a termination condition is satisfied, the termination condition comprising at least one of: When the end position is error Judging that the task is successful, terminating the current round, and adding a one-time positive reward in the single-step reward; When the accumulated interactive sampling steps of the current round reach the preset maximum steps or the accumulated execution time exceeds the preset time upper limit, judging to be overtime, and ending the current round; Immediately terminating the current round and appending a penalty term to the single step prize when a violation of a system security constraint is detected, the security constraint comprising at least one of: the joint angle exceeds the mechanical limit allowable range; Joint increment exceeds a set singular risk threshold; the mechanical arm enters a singular configuration, and the minimum singular value of the jacobian matrix is lower than a singular risk threshold.

Description

Fixed base mechanical arm point-to-point track optimization method based on near-end strategy optimization algorithm Technical Field The invention relates to a point-to-point track optimization method for a fixed base mechanical arm based on a near-end strategy optimization algorithm, and belongs to the fields of robot operation control and mechanical arm autonomous motion optimization. Background In the task of robot work outside the ground, fixed base robotic arms are widely used for work such as positioning, contact and operation of end effectors. Such tasks often require the end effector to quickly and safely complete movement and arrival at a designated target point location within a limited workspace and meet constraints such as joint angle range, joint speed/increment limits, and singular configuration avoidance during the procedure. Compared with an idealized environment, the external scene has factors such as uncertain environmental parameters and operation conditions, so that the relative position of a target, contact time and constraint margin have stronger uncertainty, and the end arrival process has higher requirements on-line adaptation and robustness. The existing end-to-track planning methods (such as linear interpolation, cubic spline interpolation, time optimal track planning and the like) mostly adopt an off-line generation track normal form, and generally rely on a relatively accurate and stable model. Under the above circumstances, the conventional method lacks the capability of performing motion selection and step adjustment according to real-time feedback at the decision level, frequently re-planning or introducing conservative track parameters is often required, so that problems of reduced arrival efficiency, increased end position error or unsmooth control instruction, reduced operability, increased motion cutting and the like may occur when the configuration is close to a singular configuration. The reinforcement learning provides an on-line decision idea taking strategy as a core, namely, the mapping from the state to the action is learned through interactive experience, so that the mechanical arm can output continuous control instructions in real time according to the relative position error of the tail end and the target point, the joint state, the constraint margin and other information, and stable arrival performance is maintained under uncertain and constraint change conditions. However, the direct use of deep reinforcement learning for fixed base robot end position arrival tasks still faces challenges such as high dimensional continuous motion space, rewarding sparsity and delay, and strong physical constraints. Especially under the condition that the requirements of the extraterrestrial operation on the safety and the controllability are high, a technical scheme combining reinforcement learning strategy optimization and kinematic constraint processing is needed, so that the terminal position instruction output by the strategy can be stably executed under the limit and the singular constraint of joints, and the high-precision, smooth and high-efficiency arrival of the terminal point-to-point target position is realized. Disclosure of Invention Aiming at the problems of high-dimensional continuous action space, sparse rewarding, delay and strong physical constraint existing in the process of directly applying deep reinforcement learning to the end position arrival task of a fixed base mechanical arm, the invention provides a fixed base mechanical arm point-to-point track optimization method based on a near-end strategy optimization algorithm. The invention discloses a fixed base mechanical arm point-to-point track optimization method based on a near-end strategy optimization algorithm, which comprises the following steps: establishing a mechanical arm kinematics model, and defining constraint conditions including joint limit, joint increment limit and singular risk threshold; modeling an end point-to-point arrival task as a sequential decision process, and constructing a state vector at least comprising an end position error, a joint state, a target point position and a last moment action; Taking the state vector as the input of the strategy network, and taking the end displacement increment direction vector as the action output of the strategy network; Mapping the end displacement increment direction vector into joint increment through differential inverse kinematics, and performing constraint processing according to constraint conditions to generate an executable joint control instruction; Training a strategy network by adopting a near-end strategy optimization algorithm, obtaining single-step rewards through a multi-objective rewarding function, evaluating the performance of strategy network action output according to the single-step rewards, and iterating and optimizing strategies by combining a course learning mechanism until convergence. Preferably, the method for mappi