CN-121589831-B - Multi-degree-of-freedom mechanical arm operation reinforcement learning method from simulation to optimization

CN121589831BCN 121589831 BCN121589831 BCN 121589831BCN-121589831-B

Abstract

The invention discloses a multi-freedom-degree mechanical arm operation reinforcement learning method which is simulated to be optimized, wherein robot state information and visual images related to the mechanical arm operation process are input into a reinforcement learning network obtained by training, and outputting the angle change quantity of the mechanical arm joint, and controlling the mechanical arm to work through the angle change quantity of the mechanical arm joint. The reinforcement learning network comprises a multi-mode encoder and an action decoder, wherein the multi-mode encoder comprises a visual encoder, a state encoder and a bidirectional cross attention module, an initial strategy is built by utilizing offline teaching data through a simulated learning mode, then online interaction is carried out in the environment, fine adjustment is carried out on the basis of the initial strategy through reinforcement learning based on value, and the trained reinforcement learning network is obtained. According to the invention, through combining simulation learning and reinforcement learning, the success rate of the robot in a multi-stage operation task can be improved under the condition of the same teaching sample, and the operation control of the robot in a complex task environment is realized.

Inventors

WANG YAJUN
REN YI
Chang Hannan

Assignees

华南理工大学

Dates

Publication Date: 20260508
Application Date: 20260129

Claims (8)

1. The method is characterized in that robot state information and visual images related to the operation process of the mechanical arm are input into a reinforced learning network obtained through training, the joint angle variation of the mechanical arm is output, and the mechanical arm is controlled to work through the joint angle variation of the mechanical arm; The reinforcement learning network comprises a multi-mode encoder and an action decoder, wherein the multi-mode encoder comprises a visual encoder, a state encoder and a bidirectional cross attention module, the visual encoder is used for extracting local spatial characteristics from visual images related to the operation process of the mechanical arm, the state encoder is used for compressing the state information of the robot into low-dimensional representation, the bidirectional cross attention module is used for fusing the visual characteristics output by the visual encoder and the state information of the robot output by the state encoder to obtain a state representation vector after the multi-mode information is fused, and the action decoder is used for receiving the state representation vector and outputting the joint angle variation of the mechanical arm; The method comprises the steps of constructing an initial strategy by utilizing offline teaching data through a simulated learning mode, performing online interaction in an environment, and performing fine adjustment on the basis of the initial strategy through value-based reinforcement learning to obtain a trained reinforcement learning network; When the value-based reinforcement learning is performed, a Coarse-to-Fine architecture is adopted, the action space output by the reinforcement learning network is discretized into a fixed number of intervals in the Coarse stage, the action rough target area is extracted from the global image and the robot state through a strategy network, the local area is further discretized into a fixed number of intervals in the local area discretized through the Coarse stage in the Fine stage, a Fine action index is output, and then the action value is output through a decoder, and when the value-based reinforcement learning is performed, a Dueling DQN structure is also adopted, and the depth of the Dueling DQN structure is adopted The network will The value function is decomposed into a state value network and an action dominance network, the state value network is used for estimating the overall value of the current robot state and reflecting potential return under the robot state according to the current strategy, and the action dominance network is used for estimating the relative dominance of different actions relative to the average level of the robot state.
2. The method for performing reinforcement learning by simulating to optimizing a multiple degree of freedom mechanical arm operation according to claim 1, wherein the value grade distribution is performed on the offline teaching data, and when performing the reinforcement learning, the reinforcement learning is performed by preferentially playing back the high-quality offline teaching data to the playback pool based on the value grade of the offline teaching data, and the action regression error is optimized.
3. The method for reinforcement learning by simulating to optimizing multiple degree of freedom manipulator operation according to claim 1, wherein during the simulated learning, using Optimizing action regression error by using loss function, performing off-line update based on time sequence difference loss when performing value-based reinforcement learning, and enabling expert action when performing reinforcement learning network training A value greater than the rest of the actions And (3) limiting the updating direction in the reinforcement learning process.
4. A simulation-to-optimization method according to claim 1 the multi-degree-of-freedom mechanical arm operation reinforcement learning method, wherein the action decoder is configured to: Mapping state characterization vectors through a linear layer to Value space and layering by repeating linear mapping Value space, outputting each layer or interval over a set of discrete value intervals Value distribution by Operation pair And carrying out position regression on the value distribution, and then outputting the joint angle variation of the mechanical arm through continuous decoding.
5. A multi-degree of freedom mechanical arm operation reinforcement learning method from simulation to optimization according to any of claims 1-4, wherein the visual encoder adopts convolutional neural network and the state encoder is a linear encoder.
6. A multi-degree of freedom robotic arm-operated reinforcement learning device simulated to be optimized for use in implementing the method of any one of claims 1-5, said device comprising the following elements: The model training unit is used for constructing an initial strategy by utilizing offline teaching data through a simulated learning mode, then carrying out online interaction in the environment and carrying out fine adjustment on the basis of the initial strategy through value-based reinforcement learning to obtain a trained reinforcement learning network; The action output unit inputs the robot state information and the visual image related to the operation process of the mechanical arm into a reinforcement learning network obtained by training, outputs the joint angle variation of the mechanical arm, and controls the mechanical arm to work based on the joint angle variation of the mechanical arm; the reinforcement learning network comprises a multi-mode encoder and an action decoder, wherein the multi-mode encoder comprises a visual encoder, a state encoder and a bidirectional cross attention module, the visual encoder is used for extracting local spatial characteristics from visual images related to the operation process of the mechanical arm, the state encoder is used for compressing robot state information into low-dimensional representation, the bidirectional cross attention module is used for fusing the visual characteristics output by the visual encoder and the robot state information output by the state encoder so as to obtain a state representation vector after multi-mode information fusion, and the action decoder is used for receiving the state representation vector and outputting the joint angle variation of the mechanical arm.
7. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1-5 when executing the computer program.
8. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method of any one of claims 1-5.

Description

Multi-degree-of-freedom mechanical arm operation reinforcement learning method from simulation to optimization Technical Field The invention belongs to the field of robots, and particularly relates to a multi-degree-of-freedom mechanical arm operation reinforcement learning method from simulation to optimization. Background Existing robotic arm manipulation techniques include template-based robotic arm vision manipulation techniques, end-to-end robotic arm manipulation techniques based on imitation learning, and end-to-end robotic arm manipulation techniques based on reinforcement learning. Conventional robotic arm operating systems typically rely on manual pre-programming, i.e., the need to manually design specific operational flows and control logic for each particular task. This approach not only has high development cost and poor flexibility, but also faces many challenges such as modeling difficulties, poor generalization capability, etc. when processing tasks with high complexity or variability. With the development of deep learning and computer vision technologies, more and more robot arm systems start to integrate RGB-D (color image and depth image) vision sensors for sensing the surrounding environment. The system can automatically extract key features and predict the next operation pose of the mechanical arm by inputting the acquired visual information into the deep neural network for processing, thereby realizing closed-loop control. In the process of completing an operation task, the mechanical arm can sequentially reach a plurality of predefined key gesture points (waypoints) according to visual feedback so as to complete the operation behaviors such as grabbing, carrying, assembling and the like. In order to improve the automation level and accuracy of the capture task, the Shanghai university proposed GraspNet framework （FANG H S, WANG C, GOU M, et al. GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping; proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), F 13-19 June 2020, 2020 [C].）,GraspNet in 2020, which is a large-scale neural network model trained from millions of real capture data, can directly predict the executable 6D capture pose (position and orientation) from RGB-D images without relying on manually set templates or capture strategies. The method greatly promotes the grabbing and operating capabilities of the mechanical arm in the actual industrial environment, and is particularly suitable for flexible operation of unknown objects, complex backgrounds or unstructured scenes. GraspNet provides an end-to-end grabbing detection mechanism, and is a key step of transition of a mechanical arm from traditional template driving to perception driving and autonomous intelligent. However, the method can only realize end-to-end grabbing, and still needs to manually program and define templates for complex mechanical arm operation tasks. With the continuous progress of deep learning algorithms and the continuous improvement of computing hardware performance, the manipulator operation technology gradually changes from a traditional modularized control mode to an end-to-end learning control paradigm. Compared with the traditional method which relies on a manual setting sensing, planning and control module, the end-to-end method can directly map the original sensing input (such as RGB image) to the low-layer control signal (such as joint angle), so that the system design flow is simplified, and the overall intelligent level and task generalization capability are improved. 2023, Stanford university proposed Action Chunking with Transformer (ACT) framework （ZHAO T Z, KUMAR V, LEVINE S, et al. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware [J/OL] 2023）, which successfully implements the function of predicting the arm joint angle from RGB visual input end to end based on mimicking the learning idea by introducing a transducer architecture to model the sequence of demonstration data. The core idea of the ACT is to decompose human or teaching operation into a plurality of sub-actions (action chunks), and the sequence modeling of a transducer can be utilized to learn the time sequence relation among action units, so that the understanding and executing efficiency of complex operation tasks are improved. The method has strong generalization capability in a plurality of standard mechanical arm operation tasks (such as object grabbing, inserting, door opening and closing and the like). On this basis, the introduction of the diffusion strategy （Diffusion Policy）（CHI C, XU Z, FENG S, et al. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion [J/OL]2023） has further driven the development of end-to-end robot handling techniques. The diffusion strategy can efficiently sample high-quality action sequences from the teaching tracks by modeling track distribution in a continuous action space, and has natural diversity a