CN-122015880-A - Implicit instruction visual language navigation method and system based on multi-mode large model

CN122015880ACN 122015880 ACN122015880 ACN 122015880ACN-122015880-A

Abstract

The invention provides an implicit instruction visual language navigation method and system based on a multi-mode large model, and relates to the technical fields of computer vision, artificial intelligence and robots. The method comprises the steps of generating a semantic map and an occupancy map based on visual RGB image and depth image data, then analyzing an implicit instruction input by a user, generating a semantic navigation reasoning token by combining a local scene map and the current observation state of a robot, predicting the next navigation action by fusing multisource information through a cross-modal attention mechanism, detecting obstacles in real time, avoiding collision and deadlock in navigation through collision penalty loss items, and finally optimizing the training process of the robot navigation through two stages of simulated learning and reinforcement learning by adopting a hierarchical joint optimization strategy and combining an implicit instruction data set. The method realizes the efficient reasoning and navigation decision of the implicit semantics by defining the special token to uniformly encode the scene semantic graph and the real-time observation state.

Inventors

HAN ZHI
WANG XUDONG
LI GAN
LIU LIANQING

Assignees

中国科学院沈阳自动化研究所

Dates

Publication Date: 20260512
Application Date: 20260416

Claims (9)

1. An implicit instruction visual language navigation method based on a multi-mode large model is characterized by comprising the following steps: Generating a semantic map and an occupancy map based on the visual RGB image and the depth image data, and dynamically recording object semantic information and passable areas in a scene; Analyzing an implicit instruction input by a user by utilizing a pre-training multi-mode large model, generating a semantic navigation reasoning token by combining a local scene map and the current observation state of the robot, and predicting the next navigation action by fusing multi-source information through a cross-mode attention mechanism; the specific method for generating the semantic navigation reasoning token by utilizing the pre-training multi-mode large model to analyze the implicit instruction input by the user and combining the local scene map and the current observation state of the robot comprises the following steps: Defining two special tokens < IMAGE > and < MAP > for representing a scene RGB observation and a local semantic MAP, and respectively encoding the scene RGB observation and the semantic MAP through two pre-training visual encoders to obtain two tokens < IMAGE > and < MAP >; Expanding a vocabulary of a large language model, adding a token < ACT > for representing the output of the reasoning navigation action, designing a three-section thinking chain prompt template for guiding the large language model to conduct instruction reasoning, wherein the three-section thinking chain prompt template comprises a token < INSTRU >, tokens < IMAGE > and < MAP > and a token < ACT >, the token < INSTRU > is used for guiding the large language model to understand and reason the implicit intention of a user, the tokens < IMAGE > and < MAP > are used for guiding the large language model to conduct spatial reasoning by combining with the RGB observation and the semantic MAP of the current scene, and determining the possible navigation direction; the implicit instructions input by a user and the current local scene map and the observation state of the robot are merged into a three-section thinking chain prompt template, then the large language model is input for instruction reasoning, when the model generates a text response containing a token < ACT >, different depth hidden layer embedding of the corresponding position of the token < ACT > is extracted, and the two-layer linear transformation projection is adopted to form a reasoning token; detecting obstacles in real time based on the depth map and the occupation map, and avoiding collision and deadlock in navigation through a collision penalty loss item; and adopting a layered combined optimization strategy, combining an implicit instruction data set, and optimizing the training process of robot navigation through two stages of simulated learning and reinforcement learning.
2. The implicit instruction visual language navigation method based on a multi-mode big model according to claim 1, wherein the specific method for generating a semantic map and an occupancy map based on visual RGB image and depth image data is as follows: RGB image based on robot observation And depth map The method comprises the steps of data, extracting scene object semantic information by using a pre-trained 3D semantic segmentation network, generating a 3D point cloud of semantic segmentation, and recording the position, category and spatial relation of objects in a scene; By a back pinhole projection method, the 3D point cloud of semantic segmentation is projected to a 2D plane by combining the pose of the robot, and an occupancy map is generated Identifying passable and non-passable regions in a scene and simultaneously generating a semantic map Recording the semantic position of the object.
3. The multi-modal large model-based implicit instruction visual language navigation method of claim 2, wherein the method further performs time series cumulative update of the semantic map and the occupancy map and stores as a scene memory structure; and cutting a local scene map which takes the robot as a center from the global scene map based on the current robot pose for subsequent reasoning navigation decision.
4. The implicit instruction visual language navigation method based on a multi-mode big model according to claim 3, wherein the specific method for predicting the next navigation action by fusing multi-source information through a cross-mode attention mechanism is as follows: Extracting the space position information of the robot from the occupied map by using a map encoder to generate a map token; extracting spatial distance information from the depth map using a depth encoder, generating a depth token; predicting the next navigation action by adopting an action output head module to merge an inference token, a map token and a depth token, wherein the action output head module comprises two GRU networks with a cross-modal attention mechanism; The second GRU network calculates the weighted representation of the reasoning token, the map token and the depth token by combining the attention mechanism based on the hidden state of the robot and the previous navigation action output by the first GRU, and outputs the hidden state fused with the multi-mode information for predicting the next navigation action.
5. The implicit instruction visual language navigation method based on a multi-mode big model according to claim 4, wherein the specific method for performing obstacle real-time detection based on a depth map and an occupation map and avoiding collision and deadlock in navigation through a collision penalty loss term is as follows: Detecting obstacles on the navigation path in real time by using the depth map and the occupied map; defining collision indication function to judge if robot is about to collide, said collision indication function using depth threshold super parameter Determining a function value when the Euclidean distance from the robot to the nearest obstacle in the obstacle set in the environment is not more than Judging that collision risk exists, setting a collision indication function value to be 1, otherwise, setting 0; Designing a collision penalty loss term for obstacle avoidance learning in a robot training stage; when a collision risk is detected, the forced robot selects execution from the steering actions, and the robot is guided to avoid the obstacle.
6. The method for navigating a visual language of implicit instructions based on a large multi-mode model according to claim 5, wherein the specific method for optimizing the training process of the robot navigation by simulating two stages of learning and reinforcement learning by adopting a hierarchical joint optimization strategy and combining an implicit instruction data set is as follows: the robot adopts DAgger to simulate a learning algorithm to learn basic navigation actions in a first training stage, and uses expert-labeled correction actions to train the robot to master basic navigation skills; the first training phase mimics the learning objective function as: ; Wherein, the The first training phase mimics the learning objective function, To learn the loss function of the underlying navigational action through DAgger mimicking the learning algorithm, Representing a collision penalty loss; the second training stage further learns semantic perception reasoning navigation based on the navigation action capability obtained in the first training stage, and optimizes the end-to-end track level target through the reinforcement learning paradigm: ; Wherein, the Reinforcement learning objective functions for the second stage; is the first A combined reward for steps; Is as the parameter of Is a policy network of (2); representing the total number of steps of the robot for actually completing the navigation task; is the first A step of navigation action; Is the robot at the first Environmental status of the step.
7. The multi-modal large model based implicit instruction visual language navigation method of claim 6, wherein the first Combined rewards for steps And taking navigation completion, semantic correctness and track efficiency into consideration, and carrying out weighted fusion on track alignment rewards, destination semantic association rewards and step number efficiency rewards.
8. An implicit instruction visual language navigation system based on a multi-mode big model is realized based on the implicit instruction visual language navigation method based on the multi-mode big model as described in claim 1, and is characterized by comprising a scene semantic mapping module, an implicit instruction reasoning and navigation action prediction module, an obstacle avoidance module and a layered learning module; The scene semantic map building module generates a semantic map and an occupied map based on the visual RGB image and the depth image data, and dynamically records object semantic information and passable areas in the scene; The implicit instruction reasoning and navigation action prediction module analyzes an implicit instruction input by a user by utilizing a pre-training multi-mode large model, generates a semantic navigation reasoning token by combining a local scene map and the current observation state of the robot, and predicts the next navigation action by fusing multi-source information through a cross-mode attention mechanism; The obstacle avoidance module detects obstacles in real time based on the depth map and the occupation map, and collision and deadlock in navigation are avoided through a collision penalty loss item; The layered learning module adopts a layered joint optimization strategy and combines an implicit instruction data set to optimize the training process of robot navigation through two stages of simulated learning and reinforcement learning.
9. A computer program product for performing the multimodal big model based implicit instruction visual language navigation method of any of claims 1-7, comprising a computer program or instructions that when executed by a processor implement the multimodal big model based implicit instruction visual language navigation method.

Description

Implicit instruction visual language navigation method and system based on multi-mode large model Technical Field The invention relates to the technical fields of computer vision, artificial intelligence and robots, in particular to an implicit instruction vision language navigation method and system based on a multi-mode large model. Background In the rapid development of modern artificial intelligence and robotics, visual language navigation (Vision-and-LanguageNavigation, VLN) has received a great deal of attention as a research hotspot for multi-modal fusion. The method aims at guiding the robot to complete the autonomous navigation task by analyzing the language instruction of the user and combining visual perception and environment modeling. However, the conventional visual language navigation method mainly performs path planning based on explicit stepwise instructions (such as ' forward 5m, right turn 90 degrees '), and the method faces the following technical bottlenecks in practical application, namely 1. Implicit instruction understanding capability is insufficient, in practical human-computer interaction, a user is more prone to use implicit instructions with fuzzy nature (such as ' weather is too hot, please take me to drink), and the navigation system is required to have strong semantic reasoning and intention resolving capability. However, conventional approaches fail to infer specific navigation objectives and paths from the implicit needs of the user. 2. The modeling of the semantic association of the scene is lacking, namely the existing multi-mode large language model has strong reasoning capability, but lacks the modeling capability of specific semantic association in a navigation scene, and is difficult to guide the robot to an unknown destination only by priori knowledge. 3. The robustness of the complex scene is lacking, and the existing navigation system is difficult to stably complete navigation tasks and is easy to fall into collision or deadlock state in the face of complex conditions such as dynamic environment, shielding, noise and the like. 4. Most methods rely on instant perception, lack of modeling and memorizing mechanism for persistent scenes, which results in low efficiency of repeated navigation tasks and failure to improve navigation performance with increased familiarity of scenes. Therefore, how to design a high-efficiency navigation method capable of combining a multi-mode large model, understanding implicit instructions and adapting to a durable complex scene becomes a research difficulty in the current technical field. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a multi-mode large model-based implicit instruction visual language navigation method and a multi-mode large model-based implicit instruction visual language navigation system. In order to solve the technical problems, the invention adopts the following technical scheme: In one aspect, the present invention provides a method for implicit instruction visual language navigation based on a multimodal big model, comprising: Generating a semantic map and an occupancy map based on the visual RGB image and the depth image data, and dynamically recording object semantic information and passable areas in a scene; Analyzing an implicit instruction input by a user by utilizing a pre-training multi-mode large model, generating a semantic navigation reasoning token by combining a local scene map and the current observation state of the robot, and predicting the next navigation action by fusing multi-source information through a cross-mode attention mechanism; detecting obstacles in real time based on the depth map and the occupation map, and avoiding collision and deadlock in navigation through a collision penalty loss item; and adopting a layered combined optimization strategy, combining an implicit instruction data set, and optimizing the training process of robot navigation through two stages of simulated learning and reinforcement learning. Further, the specific method for generating the semantic map and the occupancy map based on the visual RGB image and the depth image data comprises the following steps: RGB image based on robot observation And depth mapThe method comprises the steps of data, extracting scene object semantic information by using a pre-trained 3D semantic segmentation network, generating a 3D point cloud of semantic segmentation, and recording the position, category and spatial relation of objects in a scene; By a back pinhole projection method, the 3D point cloud of semantic segmentation is projected to a 2D plane by combining the pose of the robot, and an occupancy map is generated Identifying passable and non-passable regions in a scene and simultaneously generating a semantic mapRecording the semantic position of the object. Further, the method also carries out time series accumulated update on the semantic map and the occupancy map and stores the semant