CN-117268385-B - Indoor robot visual navigation method based on optimal strategy playback

CN117268385BCN 117268385 BCN117268385 BCN 117268385BCN-117268385-B

Abstract

The invention discloses an indoor robot visual navigation method based on optimal strategy playback, which is divided into an on-policy algorithm and an off-policy algorithm, wherein the catastrophic forgetting is reduced by training a current navigation target in the on-policy and reviewing an old navigation target in the off-policy. The method includes a single feature extraction network and a policy network. The input to the policy network is the output of the state feature extraction network, i.e. the high-level feature map of the current image. The output of the whole network is two functions, the policy function pi (a|s) and the cost function Q (s|a). The whole training process is divided into two phases, an on-policy phase and an off-policy phase. In the on-policy phase, the agent learns the policy of the new target through interactions with the environment, while in the off-policy phase, the agent uses the best experience of the old target stored in the memory to review the learned policy to prevent catastrophic forgetfulness. The learning method utilizes the experience in the memory to learn the experience of the old navigation target, and can furthest reduce forgetting.

Inventors

ZHANG YANNING
LI XINTING
ZHANG SHIZHOU
LU YUE
Dang Jierui
RAN LINGYAN
WANG PENG

Assignees

西北工业大学

Dates

Publication Date: 20260508
Application Date: 20230717

Claims (4)

1. An indoor robot visual navigation method based on optimal strategy playback is characterized by comprising the following steps: Step 1, preparing a virtual environment: The method comprises the steps of obtaining a virtual environment, wherein a navigation task in the virtual environment comprises a scene S, an initial point p and a navigation target o, wherein the target of an intelligent body is to find the navigation target o from an initial position in the 3D environment in a given step number; in each step, the robot receives a picture of the current camera from the scene by one of 1) selecting an action and moving accordingly, or 2) selecting a completion action to terminate the task, the set of all steps from the start to the end of the task being referred to as a trajectory The task is considered successful if the following four conditions are met simultaneously, 1) the robot chooses to complete the action, 2) the robot is not more than 1 meter from a given target object, 3) the target object is within the field of view of the robot, 4) the robot is not more than a maximum step number limit; a plurality of navigation targets are arranged, and each navigation target trains 1 million tracks in training, wherein the training sequence is the sequence of the navigation targets; Step2, robot vision navigation; step 2-1, on-policy learning; in the on-policy stage, the goal is to train the agent to learn the optimal strategy of the current goal, when learning a new navigation goal strategy, the agent interacts directly with the environment, the strategy gradient is given by the following formula: (1) Wherein, the Is a parameter of the neural network and, Is a current strategy for the present application, Is an estimate of the action cost function, Is an estimate of the value function, k is the number of actions, Is the total number of steps of the trajectory, Is at The action of the agent selection at the moment, Is at The state of the intelligent agent at the moment; Step 2-2 off-policy learning In the off-policy stage, experience is collected when the number of training tracks of a navigation target reaches a maximum value, and the tracks stored in the Memory are expressed as , Where T is the total number of steps of the track and m is the Memory size, the Memory is expressed as follows: (2) Wherein, the Is at The rewards obtained by the agent are given at the moment, Is at The strategy of the agent is carried out at the moment, Is at The Q function of the agent is carried out at the moment; from the point of view of preserving policies in the Memory, the policies preserved in the Memory are: (3) Wherein, the Representation of Is a policy of optimization; step 3, performing off-policy stage for multiple times after policy learning stage is finished, and using Retrace De-estimation And reducing the variance of the off-polar distribution drift using bias corrected truncated importance samples, wherein The formula of (2) is as follows: (4) the optimization objective of the off-policy is as follows: (5) Wherein, the Is the importance weight of the cut-off, Wherein , C is a constant value, c is a constant, Is a strategy Is used as a means for controlling the speed of the vehicle, Is a discount factor; On the basis of the optimization objective, the KL and L2 losses are further used for constraint to reduce the difference between the objective strategy and the behavior strategy.
2. The method for visual navigation of an indoor robot based on optimal strategy playback of claim 1, wherein the virtual environment is an AI2-THOR virtual environment.
3. The method for visual navigation of an indoor robot based on optimal strategy playback of claim 1, wherein the robot receives a picture of a current camera from a scene, and the picture size is 300 x 3.
4. The method for navigating an indoor robot based on optimal strategy playback of claim 1, wherein the navigation targets are 10, namely an alarm clock, a book, a bowl, a coffee machine, a water kettle, a dish, a pan, a toaster, a pan and a refrigerator.

Description

Indoor robot visual navigation method based on optimal strategy playback Technical Field The invention belongs to the technical field of navigation, and particularly relates to an indoor robot vision navigation method. Background Visual navigation is a fundamental problem for robots and artificial intelligence, and the goal driven visual navigation task is to instruct the robot to search for a given object in a visual scene. The description of the object can be visual information or semantic information, and the input of the robot is the current visual picture. In recent years, visual navigation attracts increasing research interest in the field of artificial intelligence and the field of computer vision, and indoor robot visual navigation has been applied to various fields such as automatic home service, warehouse and logistics management, hotel and travel industry, and the like. The conventional method is a map-based visual navigation method. These methods explicitly break down the navigation task into a set of subtasks, namely mapping, positioning, planning and motion control. While these approaches have been quite successful for many years, the modular design has fundamental limitations that prevent their widespread adoption. An important limitation is that their noise to the sensor accumulates and propagates from the mapper to the controller, resulting in error accumulation, which makes these algorithms less robust in complex environments. More importantly, they require extensive case-and scenario-driven manual engineering, which makes them difficult to integrate with other downstream artificial intelligence tasks. Due to the success of learning-based methods in recent years, and in particular reinforcement learning-based methods in related tasks, their application to the work of indoor visual navigation has proliferated. These learning methods typically take visual input and task-specific navigational targets as inputs, and output as the best action the agent takes at each timestamp to achieve the task-specific targets. Unlike traditional methods, learning-based methods directly infer solutions from current inputs, which is an end-to-end approach. Thus, it requires little engineering and can be the basis for new artificial intelligence driven visual navigation tasks. However, when learning a plurality of navigation targets, a training method is generally adopted in the current learning-based visual navigation, wherein one navigation target is randomly selected as a target of a task at the beginning of each training task. This allows the network model to learn multiple navigation targets simultaneously, however, this approach presents inflexibility because it requires new targets plus learned targets to be put together for re-learning whenever a new target needs to be learned. This process not only results in a great deal of resource waste, but also limits the applicability of learning-based navigation algorithms. This places high demands on the stability and plasticity of the model. However, some existing methods such as EWC, unclean, etc. are not ideal in solving this problem. Disclosure of Invention In order to overcome the defects of the prior art, the invention provides an indoor robot visual navigation method based on optimal strategy playback, which is divided into an on-policy algorithm and an off-policy algorithm, and the catastrophic forgetting is reduced by training the current navigation target in the on-policy and reviewing the old navigation target in the off-policy. The method includes a single feature extraction network and a policy network. The input to the policy network is the output of the state feature extraction network, i.e. the high-level feature map of the current image. The output of the whole network is two functions, the policy function pi (a|s) and the cost function Q (s|a). The whole training process is divided into two phases, an on-policy phase and an off-policy phase. In the on-policy phase, the agent learns the policy of the new target through interactions with the environment, while in the off-policy phase, the agent uses the best experience of the old target stored in the memory to review the learned policy to prevent catastrophic forgetfulness. The learning method utilizes the experience in the memory to learn the experience of the old navigation target, and can furthest reduce forgetting. The technical scheme adopted by the invention for solving the technical problems comprises the following steps: Step 1, preparing a virtual environment: The method comprises the steps of obtaining a virtual environment, wherein a navigation task in the virtual environment comprises a scene S, an initial point p and a navigation target o, wherein the target of an intelligent body is to find the navigation target o from an initial position in the 3D environment in a given step number; In each step, the robot receives a picture of the current camera from the scene by one of 1) select