CN-122008197-A - Deep reinforcement learning-based multi-robot collaborative trajectory generation method and device

CN122008197ACN 122008197 ACN122008197 ACN 122008197ACN-122008197-A

Abstract

The application provides a multi-robot collaborative track generation method and device based on deep reinforcement learning. The method comprises the steps of obtaining state data of each robot in a current time step in a target area, processing the state data by adopting a cross-robot attention mechanism in a pre-training deep reinforcement learning strategy network to obtain structural action parameters of mechanical arms of each robot, and determining a local track section of the next time step based on the structural action parameters of the mechanical arms and configured constraint conditions. The method realizes real-time collaborative track generation with no collision and high smoothness of multiple robots through the combination of deep reinforcement learning and structured action space.

Inventors

LI QIANLONG
OUYANG GUANG
WANG LEI
YI FANGXU
WANG CHENG
HE YUNFEI
YIN SHANHUI

Assignees

安徽开阳科技有限公司
奇瑞汽车股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260126

Claims (10)

1. A multi-robot collaborative trajectory generation method based on deep reinforcement learning, the method comprising: Acquiring state data of each robot in the current time step in a target area; processing the state data by adopting a cross-robot attention mechanism in a pre-trained deep reinforcement learning strategy network to obtain structural action parameters of mechanical arms of all robots; and determining a local track section of the next time step based on the structural action parameters of the mechanical arm and the configured constraint conditions.
2. The method of claim 1, wherein processing the state data using a cross-robot attention mechanism in a pre-trained deep reinforcement learning strategy network to obtain structured motion parameters for a robotic arm of each robot, comprises: Normalizing and converting the state data into a high-dimensional feature vector, and inputting the high-dimensional feature vector into a pre-trained deep reinforcement learning strategy network, wherein the state data comprises the body state of each mechanical arm, environment information and the cooperative state among robots; The deep reinforcement learning strategy network is used for processing the high-dimensional feature vectors through a cross-robot attention mechanism, calculating interaction weights among states of all robots, generating global features containing global cooperative information in an aggregation mode, and mapping the global features into structural action parameters of all mechanical arms based on an output layer of the deep reinforcement learning strategy network.
3. The method of claim 1, wherein the local trajectory segments are represented by polynomials of a preset order, the constraints comprising boundary constraints and physical constraints of the respective robotic arms; based on the structural action parameters of the mechanical arm and the configured constraint conditions, determining a local track section of the next time step comprises the following steps: Taking the structured action parameters as coefficient vectors of polynomials of preset orders, and constructing a polynomial track function of each mechanical arm joint about time variable; based on the boundary constraint condition, generating an initial local track segment meeting high-order continuity by solving a running state continuity equation of a polynomial track function at a connecting point; And if the initial local track section meets the physical constraint condition, determining the initial local track section as a local track section of the next time step.
4. The method of claim 3, wherein generating an initial local trajectory segment that satisfies a higher order of continuity by solving a state of operation continuity equation of a polynomial trajectory function at a connection point based on the boundary constraint condition comprises: constructing a continuous equation set of the running state of the polynomial track function at the continuous connection point, wherein the running state comprises the derivatives of four orders of position, speed, acceleration and jerk; solving the continuity equation set to obtain polynomial coefficients meeting the boundary constraint conditions; Substituting the polynomial coefficients into the polynomial trajectory function to generate an initial local trajectory segment that satisfies continuity over a time domain.
5. The method of claim 1, wherein the method further comprises: sampling the local track segments according to fixed time intervals to obtain track positions of a plurality of time points; converting the position states corresponding to the track position points of the plurality of time points into corresponding track point instructions; and converting the corresponding track point instruction into a motor torque signal through a controller so as to drive each mechanical arm joint to move.
6. The method of claim 1, wherein the method further comprises: based on the composite rewarding function, processing the state data corresponding to the local track section, calculating a composite rewarding value, and storing the composite rewarding value in an experience playback buffer area; and if the local track segments of K different time steps are determined in an accumulated mode, randomly sampling experience data from the experience playback buffer area, and updating the deep reinforcement learning strategy network offline, wherein the experience data comprises a plurality of experience tuples, and K is an integer greater than 5.
7. The method of claim 6, wherein the composite rewards function includes a task efficiency rewards, a security penalty, a smoothness penalty, and a synergistic coupling rewards; based on the composite rewarding function, processing the state data corresponding to the local track section, and calculating a composite rewarding value, wherein the method comprises the following steps: carrying out parallel processing on the state data corresponding to the local track segment, and respectively calculating four sub-item rewards of task efficiency rewards, security penalties, smoothness penalties and collaborative coupling rewards; Based on a preset weight coefficient, weighting and summing all the sub-item rewards to obtain a composite rewards value for comprehensively evaluating the track segment quality; Packaging the composite rewards value, corresponding state data, action data and next state data into experience tuples; The experience tuples are stored to an experience playback buffer for batch training and optimization updating of subsequent policy networks.
8. A multi-robot collaborative trajectory generation device based on deep reinforcement learning, the device comprising: The acquisition unit is used for acquiring the state data of each robot in the current time step in the target area; the processing unit is used for processing the state data by adopting a cross-robot attention mechanism in a pre-trained deep reinforcement learning strategy network to obtain structural action parameters of mechanical arms of all robots; And the determining unit is used for determining the local track section of the next time step based on the structural action parameters of the mechanical arm and the configured constraint conditions.
9. An electronic device, characterized in that the electronic device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are in communication with each other through the communication bus; a memory for storing a computer program; a processor for implementing the method of any of claims 1-7 when executing a program stored on a memory.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-7.

Description

Deep reinforcement learning-based multi-robot collaborative trajectory generation method and device Technical Field The application relates to the technical field of multi-robot collaborative trajectory planning, in particular to a multi-robot collaborative trajectory generation method and device based on deep reinforcement learning. Background In the fields of industrial manufacturing, logistics warehouse, complex environment exploration, etc., multi-Robot System (MRS) collaborative operation has become a key technology for improving automation level. In the prior art, a centralized optimization method (such as model predictive control), a geometric interpolation method (such as B spline curve) or a deep reinforcement learning method is mainly adopted for multi-robot collaborative track planning. The centralized optimization method solves the optimal control quantity through global modeling, but has high calculation complexity, is difficult to meet the real-time requirement of a dynamic environment, the geometric interpolation method pays attention to the feasibility and safety of tracks, but lacks high-order continuity guarantee, the deep reinforcement learning method regards a robot as an intelligent body, and a strategy network is used for mapping low-dimensional control instructions, but the definition of an action space is simple, and high-order smoothness and high-dimensional coupling constraint in cooperative motion cannot be effectively processed. It can be seen that the existing methods face challenges of collaborative Jerk mutation, contradiction between real-time and robustness, and insufficient complex constraint processing in industrial application. Disclosure of Invention The embodiment of the application aims to provide a multi-robot collaborative track generation method and device based on deep reinforcement learning, which are used for realizing real-time collaborative track generation of multiple robots with no collision and high smoothness through the combination of the deep reinforcement learning and a structured action space. In a first aspect, a method for generating a multi-robot collaborative trajectory based on deep reinforcement learning is provided, the method may include: Acquiring state data of each robot in the current time step in a target area; processing the state data by adopting a cross-robot attention mechanism in a pre-trained deep reinforcement learning strategy network to obtain structural action parameters of mechanical arms of all robots; and determining a local track section of the next time step based on the structural action parameters of the mechanical arm and the configured constraint conditions. In one possible implementation, the state data is processed by using a cross-robot attention mechanism in a pre-trained deep reinforcement learning strategy network to obtain structural action parameters of mechanical arms of each robot, including: Normalizing and converting the state data into a high-dimensional feature vector, and inputting the high-dimensional feature vector into a pre-trained deep reinforcement learning strategy network, wherein the state data comprises the body state of each mechanical arm, environment information and the cooperative state among robots; The deep reinforcement learning strategy network is used for processing the high-dimensional feature vectors through a cross-robot attention mechanism, calculating interaction weights among states of all robots, generating global features containing global cooperative information in an aggregation mode, and mapping the global features into structural action parameters of all mechanical arms based on an output layer of the deep reinforcement learning strategy network. In one possible implementation, the local track section is represented by a polynomial of a preset order, and the constraint conditions comprise boundary constraint conditions and physical constraint conditions of corresponding mechanical arms; based on the structural action parameters of the mechanical arm and the configured constraint conditions, determining a local track section of the next time step comprises the following steps: Taking the structured action parameters as coefficient vectors of polynomials of preset orders, and constructing a polynomial track function of each mechanical arm joint about time variable; based on the boundary constraint condition, generating an initial local track segment meeting high-order continuity by solving a running state continuity equation of a polynomial track function at a connecting point; And if the initial local track section meets the physical constraint condition, determining the initial local track section as a local track section of the next time step. In one possible implementation, generating an initial local track segment that satisfies a higher order continuity by solving an operational state continuity equation of a polynomial track function at a connection point based on the boun