CN-122021358-A - Robot navigation method and system based on deep reinforcement learning and proxy simulation

CN122021358ACN 122021358 ACN122021358 ACN 122021358ACN-122021358-A

Abstract

The invention discloses a robot navigation method and system based on deep reinforcement learning and proxy simulation. The method comprises the steps of defining states, observing and action spaces, constructing a proxy simulation environment based on a simplified kinematic model, describing obstacles in a geometric mode, calculating ideal laser radar ranging values, introducing simulation network communication uncertainty by superposition of sensor space-time noise, completing state transition by combining actuator dynamics constraint, constructing a strategy network based on a TD3 algorithm, performing pre-training by combining a multi-constraint reward function and a mixed target generation strategy in the proxy environment, constructing an ROS2 Gazebo high-fidelity dynamics simulation environment, loading the pre-training strategy into the high-fidelity environment for dynamics fine adjustment, and deploying the pre-training strategy to a robot to execute an online navigation task after convergence. According to the invention, the sample efficiency and strategy robustness are improved through two-stage training, and the ROS2 system stability is ensured through the differentiated QoS strategy and the multithreading executor.

Inventors

Wang Xuerao
TAO XINGYU
WU DUANYANG
LIU WENZHANG
XU LELE
Na Yuhong

Assignees

安徽大学

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (10)

1. The robot navigation method based on deep reinforcement learning and proxy simulation is characterized by comprising the following steps of: The method comprises the steps of S1, defining a state space, an observation space and an action space of robot navigation, wherein the state space comprises the position, the attitude angle, the linear speed and the angular speed of the robot, the observation space comprises laser radar ranging information and the relative pose of the robot relative to a target point, and the action space comprises a speed control instruction; S2, constructing a proxy simulation environment based on a simplified kinematic model according to parameters of a state space, an observation space and an action space, describing an obstacle in a geometric mode, calculating an ideal laser radar ranging value as original observation data through a ray projection method, superposing sensor space-time noise on the basis of the original observation data, generating noisy observation data, introducing random delay and random packet loss of simulation network communication uncertainty into the noisy observation data, generating final observation data, and completing the state transition of the robot according to a speed control instruction and preset actuator dynamics constraint; s3, constructing a deep reinforcement learning network structure based on a TD3 algorithm, wherein the deep reinforcement learning network structure comprises a strategy network and an evaluation network, wherein the strategy network takes final observed data as input and outputs a normalized speed control instruction; S4, in the agent simulation environment constructed in the step S2, the deep reinforcement learning network constructed in the step S3 is pre-trained by combining the multi-constraint reward function and the mixed target generation strategy to obtain pre-training strategy parameters; S5, constructing a ROS2 Gazebo high-fidelity dynamics simulation environment, which comprises a robot rigid body dynamics model and a sensor plug-in, loading pre-training strategy parameters into the ROS2 Gazebo high-fidelity simulation environment, performing fine adjustment on strategy network parameters under real dynamics and communication conditions, and deploying the strategy network on the robot to execute an online navigation task after fine adjustment and convergence.
2. The robot navigation method based on deep reinforcement learning and proxy simulation of claim 1, wherein in step S2, the robot state transition is completed by combining a speed control instruction and a preset actuator dynamics constraint, and the specific process is that according to a linear speed control instruction and an angular speed control instruction in the speed control instruction and combining a preset maximum acceleration and an actuator response time constant, the actual linear speed and the actual angular speed of the robot are updated, a new pose of the robot is calculated by the actual linear speed and the actual angular speed, and the state transition is completed by calculating the following formula: ; ; Wherein, the Expressed in time steps The actual linear velocity of the robot; expressed in time steps The actual angular velocity of the robot; expressed in time steps The actual linear velocity of the robot; Representing the actual angular velocity of the robot at time step t-1; expressed in time steps A desired linear velocity command of the robot; expressed in time steps A desired angular velocity command of the robot; representing a response time constant of the robot chassis actuator; Representing the maximum linear acceleration allowed by the robot chassis; Representing a time step of the simulation environment; representing a truncated function.
3. The robot navigation method based on deep reinforcement learning and proxy simulation of claim 1, wherein in step S3, target value calculation of TD3 algorithm introduces a target strategy smoothing mechanism, and the expression is: ; ; Wherein, the Representing a target action; Parameters representing a policy network; the next time step state is represented, and the next time step state comprises the laser radar observation and the self state of the robot; Representing a target strategy smoothing noise, wherein the target strategy refers to an expected reference for outputting a next state action; Mean value of 0, standard deviation of Is used for the normal distribution of the (c), Standard deviation representing normal distribution; representing a truncation function; representing the noise boundary constant.
4. The method for navigating a robot based on deep reinforcement learning and proxy simulation of claim 1, wherein in step S4, the multi-constraint reward function The method comprises the following steps of target approach rewards, coupling forward rewards, rotation punishment and grading obstacle avoidance punishment: ; target approach rewards : ; Coupled forward rewards : ; Rotation penalty : ; Hierarchical obstacle avoidance penalty : ; Single step time penalty : ; Wherein, the A scaling factor representing a target approach reward; representing the Euclidean distance of the position of the robot from the target point when the robot at the last time step t-1; Indicating the Euclidean distance between the position of the robot and the target point when the robot is at the current time step t; A weight coefficient representing a coupling advance prize; Indicating that the robot is at a time step Is a linear velocity of (2); Indicating that the robot is at a time step Is a function of the angular velocity of the rotor; representing a rotation penalty coefficient; representing a linear velocity threshold; representing a collision termination penalty; Representing the nearest obstacle distance; representing a physical collision threshold; Representing the obstacle avoidance potential field coefficient; Representing a safe distance threshold; Representing a fixed single step penalty constant.
5. The method for navigating a robot based on deep reinforcement learning and proxy simulation of claim 1, wherein in step S2, constructing a proxy simulation environment based on a simplified kinematic model further comprises applying random displacements to the positions of static obstacles or random disturbances to the movement speed of dynamic obstacles in a partial simulation step.
6. The method for navigating a robot based on deep reinforcement learning and proxy simulation of claim 1, wherein in step S2, when final observation data is generated, the method further comprises the step of processing an abnormal value of the final observation data, wherein when packet loss occurs, the final effective observation is used for replacing the final effective observation, and when the laser radar ranging value exceeds the effective range of the sensor or is lower than the minimum detection distance, the abnormal value is replaced by a preset boundary value.
7. The robot navigation method based on deep reinforcement learning and proxy simulation of claim 1, wherein in step S5, the ROS2 Gazebo high-fidelity simulation environment configures differentiated communication quality strategies for different data types, a Best effect strategy is adopted for a laser radar sensor data stream, namely, no acknowledgement is required to be received and no retransmission is required, and a retransmission strategy is adopted for a speed control instruction stream, namely, reception acknowledgement and timeout retransmission are forcedly started.
8. A robotic navigation system based on deep reinforcement learning and proxy simulation, comprising: The system comprises a space definition module, a space control module and a speed control module, wherein the space definition module is used for defining a state space, an observation space and an action space of the robot navigation, wherein the state space comprises the position, the attitude angle, the linear speed and the angular speed of the robot; The system comprises a state space, an observation space, an action space, a proxy simulation module, a geometric mode description obstacle calculation module, a sensor space-time noise generation module, a robot state transfer module, a speed control instruction and a preset actuator dynamics constraint, wherein the state space, the observation space and the action space are used for constructing a proxy simulation environment based on a simplified kinematic model; The system comprises a TD3 network module, a strategy network module, a speed control module and a speed control module, wherein the TD3 network module is used for constructing a depth reinforcement learning network structure based on a TD3 algorithm and comprises a strategy network and an evaluation network, wherein the strategy network takes final observed data as input and outputs a normalized speed control instruction; The pre-training module is used for pre-training the deep reinforcement learning network constructed in the step S3 by combining the multi-constraint reward function and the mixed target generation strategy in the agent simulation environment constructed in the step S2 to obtain pre-training strategy parameters; The fine adjustment deployment module is used for constructing an ROS2 Gazebo high-fidelity dynamics simulation environment and comprises a robot rigid body dynamics model and a sensor plug-in, loading pre-training strategy parameters into the ROS2 Gazebo high-fidelity simulation environment, carrying out fine adjustment on strategy network parameters under real dynamics and communication conditions, deploying a strategy network on the robot to execute an online navigation task after fine adjustment convergence.
9. An electronic device, comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor implements a robot navigation method based on deep reinforcement learning and proxy simulation as claimed in any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium having stored thereon a computer program for causing a computer to execute a robot navigation method based on deep reinforcement learning and proxy simulation according to any one of claims 1 to 7.

Description

Robot navigation method and system based on deep reinforcement learning and proxy simulation Technical Field The invention relates to the technical field of autonomous navigation and intelligent control of mobile robots, in particular to a robot navigation method and system based on deep reinforcement learning and proxy simulation. Background With the wide application of mobile robots in the fields of warehouse logistics, indoor distribution, security inspection and the like, the autonomous navigation capability of the mobile robots in a complex environment has become a key factor affecting the system performance and deployment reliability. The traditional mobile robot navigation method is generally based on accurate environment mapping, path planning and model driving control strategies, and has high dependence on robot dynamics modeling accuracy, sensor stability and communication reliability. The robustness and adaptability of the above approach is significantly reduced when dynamic obstructions, sensor noise, or communication uncertainties are present in the environment. In recent years, an end-to-end navigation method based on deep reinforcement learning is gradually raised, and the method directly observes and maps to control instructions from a sensor through a deep neural network, so that dependence on an accurate model and an artificial rule is reduced to a certain extent. However, the existing navigation method based on deep reinforcement learning still faces the following technical problems in the ROS2 distributed simulation and deployment environment: in Gazebo isophthonous physical simulation environment, rigid body dynamics, collision detection, sensor simulation and middleware communication overhead are considered simultaneously, so that simulation operation speed is limited, a large number of interaction samples are generally needed by a deep reinforcement learning algorithm, training period is long, resource consumption is high, and sample efficiency is low; when the design of the reward function and the exploration strategy is unreasonable, the intelligent agent is easy to converge to local optimal behaviors such as in-situ rotation and long-time static, and the like, and the collision can be avoided, but the navigation efficiency is low, so that the expected task is difficult to complete; The existing simulation training environment tends to simplify or ignore sensor noise, network communication uncertainty and actuator dynamics constraint, so that the performance of a trained strategy is obviously reduced when the trained strategy is migrated to a high-fidelity simulation environment or a real robot platform, and a large simulation and reality gap exists; The prior work is still realized based on the flow-type node structure of the ROS1, and when the current work is directly migrated to the DDS middleware, the multithreading executor and the asynchronous communication mechanism of the ROS2, the problems of system stability such as unmatched reinforcement learning synchronous interfaces and ROS2 asynchronous callback, service call blocking, node death in the long-time training process and the like are easy to occur. Disclosure of Invention The invention aims to provide a robot navigation method and a system based on deep reinforcement learning and proxy simulation, which are used for solving the problems that the existing navigation technology is low in training efficiency, easy to fall into local optimum, large in simulation and reality gap, insufficient in system operation stability and the like in an ROS2 system. In order to achieve the purpose, the technical scheme provided by the invention is that the robot navigation method based on deep reinforcement learning and proxy simulation comprises the following steps: The method comprises the steps of S1, defining a state space, an observation space and an action space of robot navigation, wherein the state space comprises the position, the attitude angle, the linear speed and the angular speed of the robot, the observation space comprises laser radar ranging information and the relative pose of the robot relative to a target point, and the action space comprises a speed control instruction; S2, constructing a proxy simulation environment based on a simplified kinematic model according to parameters of a state space, an observation space and an action space, describing an obstacle in a geometric mode, calculating an ideal laser radar ranging value as original observation data through a ray projection method, superposing sensor space-time noise on the basis of the original observation data, generating noisy observation data, introducing random delay and random packet loss of simulation network communication uncertainty into the noisy observation data, generating final observation data, and completing the state transition of the robot according to a speed control instruction and preset actuator dynamics constraint; s3, constructing a deep re