CN-121977556-A - Unmanned aerial vehicle visual language navigation method based on multi-mode historical information fusion

CN121977556ACN 121977556 ACN121977556 ACN 121977556ACN-121977556-A

Abstract

The application provides an unmanned aerial vehicle visual language navigation method based on multi-mode historical information fusion, which comprises the steps of obtaining an image frame sequence collected by an unmanned aerial vehicle before the current moment, a track point sequence of unmanned aerial vehicle flight, a task language instruction and an observation value of the current moment of the unmanned aerial vehicle, processing the image frame sequence by utilizing visual Slot aggregation to obtain a visual memory feature sequence, processing the track point sequence of unmanned aerial vehicle flight to obtain track embedded features, processing the task language instruction by utilizing a Bert language encoder to obtain language semantic features, processing the language semantic features, the visual memory feature sequence, the track embedded features and the observation value of the current moment to obtain fusion features, and processing the fusion features by utilizing a large language model to obtain navigation decision information. According to the embodiment, the historical images, the track points and the task language instructions are fused, so that the context understanding capability of the unmanned aerial vehicle and the robustness of navigation decision are remarkably improved.

Inventors

WANG LI
JIANG WEN
Gong Ruixuan
CHEN HAOLIN
XU TAO
XU BIN

Assignees

北京理工大学

Dates

Publication Date: 20260505
Application Date: 20260112

Claims (10)

1. An unmanned aerial vehicle visual language navigation method based on multi-mode historical information fusion is characterized by comprising the following steps: Acquiring an image frame sequence acquired by an unmanned aerial vehicle before the current moment, a track point sequence of the unmanned aerial vehicle flight, a task language instruction and an observation value of the unmanned aerial vehicle at the current moment; processing the image frame sequence by utilizing visual Slot aggregation to obtain a visual memory characteristic sequence; processing a track point sequence of the unmanned aerial vehicle to obtain track embedded features; Processing the task language instruction by using a Bert language encoder to obtain language semantic features; processing the semantic features, the visual memory feature sequences, the track embedding features and the observation values at the current moment to obtain fusion features; And processing the fusion characteristics by using a large language model to obtain navigation decision information.
2. The method of claim 1, wherein the processing of the sequence of image frames using visual Slot aggregation to obtain the sequence of visual memory features comprises: processing the image frame sequence by utilizing a Siglip visual angle encoder to obtain a historical visual characteristic sequence; performing time aggregation on a Slot variable and a historical visual feature sequence which are learned in advance by using a Slot Attention mechanism to obtain an aggregated variable; And updating the Slot variable by utilizing an aggregation variable to obtain a visual memory characteristic sequence.
3. The method of claim 2, wherein the time aggregation of the previously learned Slot variable and the historical visual feature sequence by using a Slot attribute mechanism to obtain an aggregated variable comprises: obtaining a Slot variable obtained by pre-learning , wherein, Is a Slot variable Is the first of (2) A component; Is a Slot variable Is used in the manufacture of a printed circuit board, ; For the nth first visual feature And the kth component of the Slot variable Attention weight is calculated : Wherein, the Is a bilinear projection matrix which can be learned; Aggregating the historical visual feature sequence by using the attention weight to obtain an aggregation variable Individual components : Wherein N is the total frame number of the image frames before the current moment; Aggregation variables The method comprises the following steps: 。
4. a method according to claim 3, characterized in that updating the Slot variable with an aggregation variable results in a visual memory feature sequence comprising: For the first polymerization variable Individual components And the kth component of the Slot variable Combining to obtain the kth visual memory feature : Wherein, the As a matrix of linear projections that can be learned, Is a normalization function; the visual memory characteristic sequence is 。
5. The method of claim 4, wherein processing the sequence of trajectory points for the unmanned aerial vehicle to fly to obtain the trajectory embedding feature comprises: the sequence of the track points of the unmanned aerial vehicle flight is Wherein the nth trace point ,( ) Three-dimensional coordinates of an nth track point; The flight angle of the nth track point; Stacking historical track point sequences of unmanned aerial vehicle flight into a matrix according to a time sequence ; Pair matrix Linear projection processing is carried out to obtain initial characteristics : Initial feature pairs using multiple sequentially concatenated Transformer encoders Processing to obtain track embedded features 。
6. The method of claim 5, wherein processing the speech semantic features, the visual memory feature sequence, the track embedding features, and the current time observation to obtain the fusion features comprises: Paraphrasing speech semantic features Sequence of visual memory features Track embedding feature And the observed value of the current time t Combining to obtain combined data, and observing value of current time t The three-dimensional position and posture of the unmanned aerial vehicle and the RGB image acquired by the unmanned aerial vehicle are included; Obtaining visual segments from combined data And track segment ; In language semantic features For inquiry, respectively to visual fragments And track segment Respectively performing cross-modal attention prealignment to obtain visual results Sum trace results : Wherein, the Are all a matrix of projections that can be learned, Is a dimension; are all a matrix of projections that can be learned, Paraphrasing speech semantic features Visual results Track results And current observations Splicing to obtain fusion characteristics : 。
7. The method of claim 1, wherein the navigation decision information comprises M navigation points: , wherein, For the m-th navigation point, 。
8. An unmanned aerial vehicle vision language navigation device based on multi-mode history information fusion, which is characterized by comprising: The acquisition unit is used for acquiring an image frame sequence acquired by the unmanned aerial vehicle before the current moment, a track point sequence of the unmanned aerial vehicle flight, a task language instruction and an observation value of the unmanned aerial vehicle at the current moment; the first processing unit is used for processing the image frame sequence by utilizing visual Slot aggregation to obtain a visual memory characteristic sequence; The second processing unit is used for processing the track point sequence of the unmanned aerial vehicle flight to obtain track embedded features; The third processing unit is used for processing the task language instruction by using the Bert language encoder to obtain language semantic features; the fusion unit is used for processing the speech semantic features, the visual memory feature sequences, the track embedding features and the observation values at the current moment to obtain fusion features; and the navigation unit is used for processing the fusion characteristics by using the large language model to obtain navigation decision information.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any of claims 1-7 when executing the computer program.
10. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the method of any one of claims 1-7.

Description

Unmanned aerial vehicle visual language navigation method based on multi-mode historical information fusion Technical Field The application relates to the technical field of unmanned aerial vehicles, in particular to an unmanned aerial vehicle visual language navigation method based on multi-mode historical information fusion. Background Along with the rapid development of artificial intelligence, natural language processing and computer Vision technology, an unmanned aerial vehicle system with Vision-language navigation (Vision-and-Language Navigation, VLN) capability gradually becomes an important platform for intelligent perception and task execution, and is widely applied to high-risk and semi-structured environments such as post-disaster search and rescue, inspection operation and maintenance, security patrol and the like. The existing unmanned aerial vehicle visual language navigation method mainly relies on images and language instructions input in a single round to make a step-by-step decision, namely, the next action is inferred in a mode of 'current images and language instructions'. However, this approach of "single-step perceptions-single-step predictions" has the following significant limitations: 1. Ignoring the accumulated features of the historical visual information, namely, the image sequence acquired by the unmanned aerial vehicle in the navigation process contains rich environmental change clues and semantic context information, and the existing model usually only utilizes the current frame image, so that the historical visual clues cannot be effectively mined and fused, and the context understanding capability is insufficient. 2. The space-time dependency relationship of the unmodeled historical motion trail is that the navigation trail not only reflects the route selection, but also contains the dynamic characteristics of the user intention evolution and the environment feedback. The existing method has the defects of insufficient modeling on the characteristics of the space position relation, the time sequence, the movement direction and the like among the historical track points, and is difficult to stably plan the path in a dynamic environment. 3. The multi-mode fusion granularity is coarse, the reasoning lacks a global view angle, the current VLN system is mainly based on two modes of images and languages, the history and current information fusion mode is simpler, an effective cross-mode context modeling mechanism is lacking, and unified understanding and reasoning among user instructions, environment perception and motion paths are difficult to realize. Therefore, in a real complex environment, the existing unmanned aerial vehicle navigation method has the problems of poor path continuity, slow instruction response, easiness in environmental disturbance and the like, and the success rate of navigation and the robustness of task execution are seriously affected. Disclosure of Invention In view of the above, the present application provides an unmanned aerial vehicle visual language navigation method based on multi-modal history information fusion, so as to solve the above technical problems; In a first aspect, the present application provides an unmanned aerial vehicle visual language navigation method based on multimodal history information fusion, including: Acquiring an image frame sequence acquired by an unmanned aerial vehicle before the current moment, a track point sequence of the unmanned aerial vehicle flight, a task language instruction and an observation value of the unmanned aerial vehicle at the current moment; processing the image frame sequence by utilizing visual Slot aggregation to obtain a visual memory characteristic sequence; processing a track point sequence of the unmanned aerial vehicle to obtain track embedded features; Processing the task language instruction by using a Bert language encoder to obtain language semantic features; processing the semantic features, the visual memory feature sequences, the track embedding features and the observation values at the current moment to obtain fusion features; And processing the fusion characteristics by using a large language model to obtain navigation decision information. In one possible implementation, the processing of the image frame sequence by visual Slot aggregation to obtain the visual memory feature sequence includes: processing the image frame sequence by utilizing a Siglip visual angle encoder to obtain a historical visual characteristic sequence; performing time aggregation on a Slot variable and a historical visual feature sequence which are learned in advance by using a Slot Attention mechanism to obtain an aggregated variable; And updating the Slot variable by utilizing an aggregation variable to obtain a visual memory characteristic sequence. In one possible implementation, the method uses a Slot attribute mechanism to perform time aggregation on a Slot variable and a historical visual feature seq