CN-121981259-A - Multi-agent collaborative visual language navigation method and device based on large model

CN121981259ACN 121981259 ACN121981259 ACN 121981259ACN-121981259-A

Abstract

The application relates to the technical field of artificial intelligence and discloses a multi-agent collaborative visual language navigation method and device based on a large model, wherein the method comprises the steps of obtaining a navigation task instruction, visual observation information and navigation history information of a current time step; the method comprises the steps of combining relevant navigation knowledge in a navigation knowledge base, carrying out reasoning through a first agent model to obtain a reasoning result of a current time step, judging whether path correction needs to be triggered according to the reasoning result and a navigation task instruction, carrying out returnal and backtracking through a second agent model to generate corrected actions when the path correction needs to be triggered, carrying out navigation actions according to the reasoning result or the corrected actions to generate a navigation task execution result and updating navigation history information, and carrying out new navigation knowledge and updating to the navigation knowledge base through a third agent model according to the navigation task execution result. The application can improve the reliability and scene adaptation capability of visual language navigation.

Inventors

HE YING
LIN JIE
HE BIAO
YUAN ZHILU

Assignees

深圳大学

Dates

Publication Date: 20260505
Application Date: 20251231

Claims (10)

1. A multi-agent collaborative visual language navigation method based on a large model is characterized by comprising the following steps: Acquiring a navigation task instruction, visual observation information and navigation history information of a current time step; According to the current navigation state representation formed by the navigation task instruction, the visual observation information and the navigation history information, the first agent model is used for carrying out reasoning by combining the related navigation knowledge in the navigation knowledge base to obtain a reasoning result of the current time step, wherein the reasoning result comprises a suggestion action; Judging whether a trigger path correction is needed or not according to the reasoning result and the navigation task instruction; when the trigger path correction is judged to be needed, performing back thinking and backtracking through a second agent model according to the navigation task instruction, the visual observation information, the navigation history information and the reasoning result, and generating corrected actions; Executing navigation action according to the reasoning result or the corrected action, generating a navigation task execution result, and updating the navigation history information; And according to the navigation task execution result, attributing new navigation knowledge through a third agent model, and updating the new navigation knowledge to the navigation knowledge base.
2. The large model-based multi-agent collaborative visual language navigation method according to claim 1, wherein the obtaining navigation task instructions, visual observation information, and navigation history information of a current time step comprises: acquiring the navigation task instruction expressed in natural language; performing visual semantic understanding and text description generation on the visual observation information to generate corresponding natural language visual description information; And fusing the natural language visual description information and the navigation task instruction, and combining the navigation history information organized in a natural language format to form a multi-mode navigation state representation in a text form of the current time step.
3. The large model-based multi-agent collaborative visual language navigation method according to claim 2, wherein the performing visual semantic understanding and text description generation on the visual observation information to generate corresponding natural language visual description information comprises: Carrying out overall scene description on the visual observation information by utilizing a visual understanding model to generate a scene description text; Identifying a specific object in the visual observation information by utilizing a target detection model, and generating an object description text containing object types and position information; and fusing the scene description text and the object description text to generate the natural language visual description information.
4. The large-model-based multi-agent collaborative visual language navigation method according to claim 1, wherein the obtaining the current time step reasoning result by the first agent model in combination with the navigation knowledge in the navigation knowledge base according to the current navigation state representation formed by the navigation task instruction, the visual observation information and the navigation history information comprises: Generating a current navigation state representation based on the navigation task instruction, the visual observation information and the navigation history information; Retrieving at least one navigation knowledge segment meeting the correlation condition from the navigation knowledge base by taking the current navigation state as a query basis; Constructing reasoning prompt information comprising the navigation task instruction, the current navigation state representation and the at least one navigation knowledge segment; inputting the reasoning prompt information into a large language model serving as the first agent model for reasoning, and obtaining a generated reasoning text; And analyzing the reasoning text to obtain the suggested action and the corresponding reasoning basis.
5. The large model-based multi-agent collaborative visual language navigation method according to claim 4, wherein retrieving at least one navigation knowledge segment from the navigation knowledge base that satisfies a relevance condition based on the current navigation state representation as a query basis comprises: Converting the current navigation state representation into a vector representation; Calculating the similarity between the vector representation and knowledge vectors pre-stored in the navigation knowledge base; And sequencing the knowledge vectors according to the similarity, and selecting the knowledge content corresponding to at least one knowledge vector with the similarity higher than a preset similarity threshold as the navigation knowledge segment.
6. The large model-based multi-agent collaborative visual language navigation method according to claim 1, wherein the determining whether a trigger path correction is required according to the reasoning result and the navigation task instruction comprises: judging whether the suggested action in the reasoning result is a preset navigation abnormal mark or not; If not, further analyzing the logic consistency between the suggested action and the executed path reconstructed based on the navigation history information; And when the suggested action is detected to be a preset navigation abnormal mark or the logic consistency is lower than a preset tolerance, determining that the path correction needs to be triggered.
7. The large model-based multi-agent collaborative visual language navigation method according to claim 1, wherein when it is determined that a path correction is required, performing a back thought and backtracking through a second agent model according to the navigation task instruction, the visual observation information, the navigation history information and the reasoning result, generating a corrected action, comprising: According to the navigation task instruction, the visual observation information, the navigation history information and the reasoning result, analyzing through the second agent model, and identifying an error decision point which causes mismatching of the current navigation state and the task requirement; Taking the error decision point as a backtracking starting point, combining a correct decision sequence before the error decision point in the navigation history information, and performing backtracking reasoning through the second agent model to generate at least one alternative decision after the error decision point; from the at least one alternative decision, the modified action applicable to the current navigational state is determined.
8. The large-model-based multi-agent collaborative visual language navigation method according to claim 1, wherein the attributing new navigation knowledge through a third agent model and updating to the navigation knowledge base according to the navigation task execution result comprises: after the navigation task reaches the termination condition, acquiring complete navigation execution history information and real navigation reference track information; Comparing the navigation execution history information with the real navigation reference track information, and combining the navigation task instruction, analyzing and summarizing through the third agent model to generate new navigation knowledge in a text form; And carrying out structuring processing on the new navigation knowledge, and storing the new navigation knowledge into the navigation knowledge base.
9. The large model-based multi-agent collaborative visual language navigation method according to claim 8, wherein the analyzing and summarizing by the third agent model generates new navigation knowledge in text form, comprising extracting, by the third agent model, reusable success experience, failure teaching training or scenerising navigation strategy from a success or failure navigation execution history as the new navigation knowledge; the structuring process is performed on the new navigation knowledge and the new navigation knowledge is stored in the navigation knowledge base, including: Converting the new navigation knowledge into a preset tuple format, wherein the tuple format at least comprises landmark information and associated action information; converting new navigation knowledge in tuple format into vector representation and storing to the navigation knowledge base based on vector retrieval.
10. A multi-agent collaborative visual language navigation device based on a large model, comprising: the acquisition module is used for acquiring a navigation task instruction, visual observation information and navigation history information of the current time step; The reasoning module is used for carrying out reasoning through a first agent model according to the current navigation state representation formed by the navigation task instruction, the visual observation information and the navigation history information and combining the related navigation knowledge in a navigation knowledge base to obtain a reasoning result of the current time step, wherein the reasoning result comprises a suggestion action; the judging module is used for judging whether the path correction needs to be triggered according to the reasoning result and the navigation task instruction; The correction module is used for carrying out back thinking and backtracking through a second agent model according to the navigation task instruction, the visual observation information, the navigation history information and the reasoning result when the trigger path correction is judged to be needed, and generating corrected actions; the updating module is used for executing the navigation action according to the reasoning result or the corrected action, generating a navigation task execution result and updating the navigation history information; and the induction module is used for inducing new navigation knowledge through the third agent model according to the navigation task execution result and updating the new navigation knowledge to the navigation knowledge base.

Description

Multi-agent collaborative visual language navigation method and device based on large model Technical Field The application relates to the technical field of artificial intelligence, in particular to a multi-agent collaborative visual language navigation method and device based on a large model. Background Visual language navigation is an important application direction in the field of artificial intelligence, is widely used for scenes such as automatic driving, autonomous movement of a mobile robot and the like, and has the core aim of enabling an intelligent body to autonomously plan and execute navigation actions based on natural language navigation instructions, real-time visual observation information and navigation history data so as to complete preset navigation tasks. Currently, in the prior art, multiple source information is integrated through a single model to make navigation decisions and generate actions. However, the scheme has some defects, such as difficulty in quick adjustment to return to a correct navigation track if decision deviation or path deviation task requirements occur in the navigation process, and limited adaptability and navigation robustness when an agent faces different scenes because an effective knowledge accumulation and multiplexing mechanism is not formed in the navigation process in the prior art. In summary, the existing visual language navigation technology has the problems of deficiency in navigation deviation correction effect and scene adaptability, and is difficult to meet the high-efficiency and reliable navigation requirements in complex scenes. The foregoing description is provided for general background information and does not necessarily constitute prior art. Disclosure of Invention The embodiment of the application provides a multi-agent collaborative visual language navigation method and device based on a large model, which can improve the reliability and scene adaptation capability of visual language navigation and meet the high-efficiency and reliable navigation requirements in complex scenes. In a first aspect, an embodiment of the present application provides a multi-agent collaborative visual language navigation method based on a large model, including: Acquiring a navigation task instruction, visual observation information and navigation history information of a current time step; According to the current navigation state representation formed by the navigation task instruction, the visual observation information and the navigation history information, the first agent model is used for carrying out reasoning by combining the related navigation knowledge in the navigation knowledge base to obtain a reasoning result of the current time step, wherein the reasoning result comprises a suggestion action; Judging whether a trigger path correction is needed or not according to the reasoning result and the navigation task instruction; when the trigger path correction is judged to be needed, performing back thinking and backtracking through a second agent model according to the navigation task instruction, the visual observation information, the navigation history information and the reasoning result, and generating corrected actions; Executing navigation action according to the reasoning result or the corrected action, generating a navigation task execution result, and updating the navigation history information; And according to the navigation task execution result, attributing new navigation knowledge through a third agent model, and updating the new navigation knowledge to the navigation knowledge base. In a second aspect, an embodiment of the present application provides a multi-agent collaborative visual language navigation device based on a large model, including: the acquisition module is used for acquiring a navigation task instruction, visual observation information and navigation history information of the current time step; The reasoning module is used for carrying out reasoning through a first agent model according to the current navigation state representation formed by the navigation task instruction, the visual observation information and the navigation history information and combining the related navigation knowledge in a navigation knowledge base to obtain a reasoning result of the current time step, wherein the reasoning result comprises a suggestion action; the judging module is used for judging whether the path correction needs to be triggered according to the reasoning result and the navigation task instruction; The correction module is used for carrying out back thinking and backtracking through a second agent model according to the navigation task instruction, the visual observation information, the navigation history information and the reasoning result when the trigger path correction is judged to be needed, and generating corrected actions; the updating module is used for executing the navigation action according to the reasoning re