CN-121994227-A - Visual language navigation method and device, body-equipped intelligent body system and electronic equipment

CN121994227ACN 121994227 ACN121994227 ACN 121994227ACN-121994227-A

Abstract

The invention provides a visual language navigation method, a visual language navigation device, an intelligent system with a body and electronic equipment, which comprise the steps of obtaining a historical action entropy sequence of an intelligent body to be navigated, a current environment visual sequence observed by the intelligent body to be navigated and a language instruction; the method comprises the steps of calling a pre-trained visual language navigation model, wherein an adaptive reasoning module is embedded in the visual language navigation model and used for determining a navigation reasoning strategy based on a historical action entropy sequence, a current environment visual sequence and a language instruction, inputting the historical action entropy sequence, the current environment visual sequence and the language instruction into the visual language navigation model, obtaining the navigation reasoning strategy at the current time based on the adaptive reasoning module, and obtaining the navigation action of an agent to be navigated at the current time output by the visual language navigation model based on the navigation reasoning strategy. The navigation accuracy and the robustness are improved under the condition of limited calculation and communication resources.

Inventors

CAO TING
LIU YUNXIN

Assignees

清华大学

Dates

Publication Date: 20260508
Application Date: 20251226

Claims (10)

1. A visual language navigation method, the method comprising: Acquiring a historical motion entropy sequence of an agent to be navigated, a current environment vision sequence observed by the agent to be navigated and a language instruction, wherein the historical motion entropy sequence is a motion entropy sequence of the agent to be navigated in a previous time period; Invoking a pre-trained visual language navigation model, wherein an adaptive reasoning module is embedded in the visual language navigation model and is used for determining a navigation reasoning strategy based on the historical action entropy sequence, the current environment visual sequence and the language instruction, and the navigation reasoning strategy comprises a first reasoning strategy for performing navigation reasoning at the current moment and a second reasoning strategy for not performing navigation reasoning at the current moment; Inputting the historical motion entropy sequence, the current environment visual sequence and the language instruction into a visual language navigation model, obtaining a navigation reasoning strategy under the current time based on the self-adaptive reasoning module, and obtaining the navigation action of the intelligent body to be navigated under the current time output by the visual language navigation model based on the navigation reasoning strategy.
2. The visual language navigation method according to claim 1, wherein the obtaining the navigation action of the agent to be navigated at the current time output by the visual language navigation model based on the navigation inference policy comprises: based on the navigation reasoning strategy, obtaining the navigation action of the intelligent body to be navigated at the current time output by the visual language navigation model, and/or deriving the reasoning process of the navigation action.
3. The visual language navigation method according to claim 1 or 2, wherein the obtaining, based on the adaptive reasoning module, a navigation reasoning strategy at a current time, and obtaining, based on the navigation reasoning strategy, a navigation action of the agent to be navigated at the current time output by the visual language navigation model, includes: combining the historical motion entropy sequence, the current environment vision sequence and the language instruction, and obtaining motion entropy of the intelligent agent to be navigated at the current moment based on the self-adaptive reasoning module; Obtaining a navigation reasoning strategy under the current moment based on the action entropy under the current moment; Under the condition that the navigation reasoning strategy is the first reasoning strategy, semantic guidance information corresponding to the first reasoning strategy is obtained based on the first reasoning strategy, wherein the semantic guidance information is used for guiding navigation actions of the intelligent agent to be navigated; and adjusting the action strategy of the intelligent agent to be navigated based on the semantic guidance information, and obtaining the navigation action of the intelligent agent to be navigated at the current moment output by the visual language navigation model based on the adjusted action strategy.
4. A visual language navigation method according to claim 3, wherein said deriving semantic guidance information corresponding to said first inference policy based on said first inference policy comprises: obtaining an inference mode corresponding to the first inference strategy based on the first inference strategy, wherein the inference mode comprises any one or more of a scene description inference mode, a path summary inference mode and a track error correction inference mode; And based on the reasoning mode, obtaining semantic guidance information obtained by reasoning according to the reasoning mode.
5. The visual language navigation method according to claim 1, wherein the obtaining, based on the adaptive reasoning module, a navigation reasoning strategy at a current time, and obtaining, based on the navigation reasoning strategy, a navigation action of the agent to be navigated at the current time output by the visual language navigation model, includes: Combining the historical action entropy sequence, the current environment vision sequence and the language instruction, and obtaining a navigation reasoning strategy at the current time based on the self-adaptive reasoning module; and under the condition that the navigation reasoning strategy is the second reasoning strategy, the navigation action of the to-be-navigated agent at the current moment is not inferred.
6. The visual language navigation method of claim 1, wherein the adaptive reasoning module is trained by: In the initial training stage, a heuristic strategy is adopted to provide preliminary guidance for whether to trigger reasoning or not in the exploration process for the self-adaptive reasoning module; In the training process, gradually reducing the guiding weight of the heuristic strategy, increasing the weight of reinforcement learning, optimizing the parameters of the self-adaptive reasoning module through the reinforcement learning algorithm, enabling the self-adaptive reasoning module to autonomously learn to trigger reasoning when uncertainty meets preset conditions so as to improve the reasoning success rate, and skipping reasoning when uncertainty does not meet the preset conditions so as to improve the reasoning efficiency, wherein the uncertainty is determined according to action entropy.
7. The visual language navigation method of claim 6, wherein said optimizing parameters of the adaptive reasoning module by a reinforcement learning algorithm comprises: determining a reward function, wherein the reward function comprises a task completion reward item and a penalty item for invalid reasoning behavior; and optimizing parameters of the self-adaptive reasoning module by combining the reward function through a reinforcement learning algorithm so as to balance the reasoning cost and the navigation precision.
8. A visual language navigation device, the device comprising: The system comprises an acquisition module, a navigation module and a display module, wherein the acquisition module is used for acquiring a historical motion entropy sequence of an agent to be navigated, a current environment vision sequence observed by the agent to be navigated and a language instruction, wherein the historical motion entropy sequence is a motion entropy sequence of the agent to be navigated in a previous time period; The calling module is used for calling a pre-trained visual language navigation model, wherein the visual language navigation model is embedded with an adaptive reasoning module, and the adaptive reasoning module is used for determining a navigation reasoning strategy based on the historical action entropy sequence, the current environment visual sequence and the language instruction, and the navigation reasoning strategy comprises a first reasoning strategy for performing navigation reasoning at the current moment and a second reasoning strategy for not performing navigation reasoning at the current moment; The navigation module is used for inputting the historical action entropy sequence, the current environment visual sequence and the language instruction into a visual language navigation model, obtaining a navigation reasoning strategy under the current time based on the self-adaptive reasoning module, and obtaining the navigation action of the intelligent body to be navigated under the current time output by the visual language navigation model based on the navigation reasoning strategy.
9. A body-building intelligent agent system, the body-building intelligent agent system comprising: The visual sensor is used for acquiring a current environment visual sequence; The processor is used for receiving the current environment visual sequence issued by the visual sensor, obtaining the navigation action of the intelligent body to be navigated at the current moment by realizing the visual language navigation method of any one of claims 1 to 7, forming a navigation action instruction based on the navigation action, and issuing the navigation action instruction to the intelligent body; And the intelligent body is used for receiving the navigation action instruction issued by the processor and executing the navigation action instruction.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the visual language navigation method of any one of claims 1 to 7 when the computer program is executed by the processor.

Description

Visual language navigation method and device, body-equipped intelligent body system and electronic equipment Technical Field The present invention relates to the field of visual language navigation technologies, and in particular, to a visual language navigation method and apparatus, an intelligent system with a body, and an electronic device. Background In the field of visual language navigation (Visual Language Navigation, VLN), conventional solutions typically guide agents (e.g., robots, virtual agents) to perform navigation tasks in a physical or virtual environment by semantically matching natural language instructions to visual scenes. However, when facing a complex navigation task of a long time sequence cross scene, a space coordinate system of the existing scheme is generally systematically deviated from a space dimension described by an instruction, and the time sequence alignment and the space alignment of a user instruction are caused, so that an agent cannot be guided to perform the navigation task well. Disclosure of Invention The invention provides a visual language navigation method, a visual language navigation device, an intelligent system with a body and electronic equipment, which can improve navigation precision and robustness under the condition of limited calculation and communication resources and effectively avoid the problems of time sequence alignment and space alignment of user instructions. The invention provides a visual language navigation method which comprises the steps of obtaining a historical action entropy sequence of an agent to be navigated, a current environment visual sequence observed by the agent to be navigated and a language instruction, wherein the historical action entropy sequence is an action entropy sequence of the agent to be navigated in a previous time period, the language instruction is a navigation instruction issued by a user to the agent to be navigated, a pre-trained visual language navigation model is called, an adaptive reasoning module is embedded in the visual language navigation model and used for determining a navigation reasoning strategy based on the historical action entropy sequence, the current environment visual sequence and the language instruction, the navigation reasoning strategy comprises a first reasoning strategy for conducting navigation reasoning at the current moment and a second reasoning strategy for conducting no navigation reasoning at the current moment, the historical action entropy sequence, the current environment visual sequence and the language instruction are input into the visual language navigation model, the navigation strategy at the current moment is obtained based on the adaptive reasoning module, and the current action reasoning strategy of the agent to be navigated is obtained based on the adaptive reasoning module. According to the visual language navigation method provided by the invention, the navigation action of the intelligent agent to be navigated under the current moment output by the visual language navigation model is obtained based on the navigation reasoning strategy, and/or the reasoning process of the navigation action is obtained by deduction. The visual language navigation method comprises the steps of obtaining a navigation inference strategy under the current moment based on the self-adaptive inference module, obtaining navigation action of an agent to be navigated under the current moment output by a visual language navigation model based on the navigation inference strategy, combining the historical action entropy sequence, the current environment visual sequence and the language instruction, obtaining action entropy of the agent to be navigated under the current moment based on the self-adaptive inference module, obtaining the navigation inference strategy under the current moment based on the action entropy under the current moment, and obtaining semantic guidance information corresponding to the first inference strategy based on the first inference strategy under the condition that the navigation inference strategy is the first inference strategy, wherein the semantic guidance information is used for guiding the formation of navigation action of the agent to be navigated, adjusting the action strategy of the agent to be navigated based on the semantic guidance information, and obtaining the navigation action of the agent to be navigated under the current moment output by the visual language navigation model based on the adjusted action strategy. The visual language navigation method provided by the invention comprises the steps of obtaining semantic guidance information corresponding to a first reasoning strategy based on the first reasoning strategy, and obtaining a reasoning mode corresponding to the first reasoning strategy, wherein the reasoning mode comprises any one or more of a scene description reasoning mode, a path summary reasoning mode and a track error correction reasoning m