CN-122015850-A - Unmanned aerial vehicle autonomous navigation method based on large model and related equipment

CN122015850ACN 122015850 ACN122015850 ACN 122015850ACN-122015850-A

Abstract

Inputting target training data corresponding to a task scene of an unmanned aerial vehicle and a first visual angle overlooking image sequence into an initial vision-language model for supervision and fine adjustment to obtain a candidate vision-language model; receiving a current natural language navigation instruction of a user, and generating an unmanned aerial vehicle autonomous navigation prediction result comprising a plurality of steps of current action sequences according to the current natural language navigation instruction, current unmanned aerial vehicle state information and a visual memory set through the target vision-language model. The application can improve the flight stability and task execution efficiency of the unmanned aerial vehicle in a complex urban environment, remarkably improve the task success rate and can be widely applied to the technical field of unmanned aerial vehicle navigation.

Inventors

Zhong Zanyang
ZHONG RENXIN
Cai hengxing

Assignees

中山大学

Dates

Publication Date: 20260512
Application Date: 20260129

Claims (10)

1. An unmanned aerial vehicle autonomous navigation method based on a large model is characterized by comprising the following steps: Acquiring multi-modal data from an unmanned aerial vehicle task scene, wherein the multi-modal data comprises original training data and a first visual angle overlooking image sequence corresponding to the unmanned aerial vehicle, and the original training data comprises a natural language navigation instruction, unmanned aerial vehicle state information and an unmanned aerial vehicle motor driving sequence; performing data preprocessing on the original training data to obtain target training data; Inputting the target training data and the first visual angle overlooking image sequence into an initial vision-language model for supervision and fine tuning training to obtain a candidate vision-language model; Based on a composite rewarding mechanism, performing reinforcement learning training on the candidate vision-language model to obtain a target vision-language model, and deploying the target vision-language model verified by the model to a target unmanned aerial vehicle control terminal; Receiving a current natural language navigation instruction input by a user, and generating an unmanned aerial vehicle autonomous navigation prediction result comprising a plurality of steps of current action sequences according to the current natural language navigation instruction, current unmanned aerial vehicle state information and a visual memory set through the target visual-language model in the target unmanned aerial vehicle control terminal, wherein the visual memory set is constructed from a historical first view overlook image sequence based on a progressive interval sampling strategy.
2. The method according to claim 1, wherein after said receiving a current natural language navigation instruction input by a user, generating an unmanned aerial vehicle autonomous navigation prediction result comprising a current action sequence of several steps according to the current natural language navigation instruction, current unmanned aerial vehicle state information, and a visual memory set through the target vision-language model in the target unmanned aerial vehicle control terminal, the method further comprises: controlling a current unmanned aerial vehicle to sequentially execute corresponding flight control operations according to the current action sequence in the unmanned aerial vehicle autonomous navigation prediction result, wherein the length of the current action sequence meets a preset length threshold value, and all actions in the current action sequence are in a predefined discrete action space; dynamically updating the current unmanned aerial vehicle state information and the current environment perception information corresponding to the current unmanned aerial vehicle in the process of executing the flight control operation, wherein the current environment perception information comprises a first visual angle overlook image generated in the process of executing the flight control operation, the first visual angle overlook image generated in the process of executing the flight control operation is used for updating the historical first visual angle overlook image sequence, and the updated historical first visual angle overlook image sequence is used for updating the visual memory set; And if the stopping instruction does not exist in the current action sequence, returning to execute the target vision-language model in the target unmanned aerial vehicle control terminal, and generating unmanned aerial vehicle autonomous navigation prediction results comprising a plurality of steps of current action sequences according to the current natural language navigation instruction, current unmanned aerial vehicle state information and a vision memory set until the stopping instruction exists in the current action sequence so as to finish the unmanned aerial vehicle autonomous navigation process.
3. The method of claim 1, further comprising the step of constructing the set of visual memories based on the progressive interval sampling strategy, the constructing the set of visual memories based on the progressive interval sampling strategy comprising: When receiving the current natural language navigation instruction input by the user, acquiring a first visual angle overlook image generated in the flight process of the current unmanned aerial vehicle, and adding the first visual angle overlook image into the historical first visual angle overlook image sequence in time sequence; Based on the progressive interval sampling strategy, calculating a time offset sequence corresponding to the historical first view overlook image sequence, and constructing the visual memory set according to the time offset sequence in a current time step, wherein the time offset sequence is composed of offsets of a target first view overlook image in the historical first view overlook image sequence relative to the current time step, and the target first view overlook image is determined based on a preset historical view sampling rule.
4. The method of claim 1, wherein the performing data preprocessing on the raw training data to obtain target training data comprises: carrying out data normalization processing on the natural language navigation instruction in the original training data to obtain target instruction text data; constructing tag data comprising the unmanned plane state information and the unmanned plane motion sequence according to the unmanned plane state information and the unmanned plane motion sequence in the original training data based on a reference example track, wherein the tag data is used for performing supervision alignment between a motion sequence output by a model in a supervision fine-tuning training process and the reference track, and the tag data is also used for performing input of alignment rewards and stopping consistency rewards as sub-target states contained in the composite rewarding mechanism in a reinforcement learning training process; And determining the target training data according to the target instruction text data and the label data.
5. The method of claim 1, wherein said inputting the target training data and the first sequence of perspective top view images into an initial vision-language model for supervised fine tuning training results in a candidate vision-language model, comprising: Inputting the target training data and the first sequence of perspective top view images into the initial vision-language model; And training the initial vision-language model according to the target training data and the first visual angle overlooking image sequence by adopting an autoregressive sequence generation mode, and optimizing model parameters of the initial vision-language model by adopting a cross entropy loss function as a training target function to obtain the candidate vision-language model.
6. The method of claim 1, wherein the reinforcement learning training is performed on the candidate vision-language model based on the composite rewards mechanism to obtain a target vision-language model, and the target vision-language model verified by the model is deployed to a target unmanned aerial vehicle control terminal, comprising: Constructing a multi-target compound rewarding function, wherein the multi-target compound rewarding function comprises sub-target state alignment rewards, stopping consistency rewards and output format rewards, and the sub-target state alignment rewards comprise distance reduction rewards and course angle rewards; In each time step, carrying out linear combination on each sub-rewarding item in the multi-target compound rewarding function according to preset weight to obtain a comprehensive rewarding value; Training the candidate vision-language model according to the comprehensive rewarding value, and optimizing the model strategy of the candidate vision-language model by adopting a group relative strategy optimization method to obtain the target vision-language model.
7. Unmanned aerial vehicle autonomous navigation device based on big model, characterized in that the device includes following module: The system comprises a multi-modal data acquisition module, a multi-modal data acquisition module and a data processing module, wherein the multi-modal data acquisition module is used for acquiring multi-modal data from an unmanned aerial vehicle task scene, the multi-modal data comprises original training data and a first visual angle overlooking image sequence corresponding to an unmanned aerial vehicle, and the original training data comprises a natural language navigation instruction, unmanned aerial vehicle state information and an unmanned aerial vehicle motor motion sequence; The data preprocessing module is used for carrying out data preprocessing on the original training data to obtain target training data; the supervised fine tuning training module is used for inputting the target training data and the first visual angle overlooking image sequence into an initial vision-language model for supervised fine tuning training to obtain a candidate vision-language model; The reinforcement learning training module is used for reinforcement learning training on the candidate vision-language model based on a composite rewarding mechanism to obtain a target vision-language model, and deploying the target vision-language model verified by the model to a target unmanned aerial vehicle control terminal; The navigation model reasoning module is used for receiving a current natural language navigation instruction input by a user, generating an unmanned aerial vehicle autonomous navigation prediction result comprising a plurality of steps of current action sequences according to the current natural language navigation instruction, current unmanned aerial vehicle state information and a visual memory set through the target visual-language model in the target unmanned aerial vehicle control terminal, wherein the visual memory set is constructed from a historical first view overlook image sequence based on a progressive interval sampling strategy.
8. An electronic device comprising a memory storing a computer program and a processor implementing the method of any of claims 1 to 6 when the computer program is executed by the processor.
9. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 6.

Description

Unmanned aerial vehicle autonomous navigation method based on large model and related equipment Technical Field The application relates to the technical field of unmanned aerial vehicle navigation, in particular to an unmanned aerial vehicle autonomous navigation method based on a large model and related equipment. Background Currently, unmanned AERIAL VEHICLE (UAV for short) is continuously expanded in the fields of urban inspection, emergency rescue, public safety, infrastructure operation and maintenance and the like, and the operation environment of the UAV is gradually evolved from an open and regular scene to an urban space with a complex structure and dynamic change. In such environments, unmanned aerial vehicles need to continuously make path decisions under limited perception conditions, and their autonomous navigation and intelligent decision-making capabilities have become important technological bases that affect system safety, efficiency and large-scale deployment. The traditional unmanned aerial vehicle navigation mode mainly depends on manual remote control or preset route planning based on a global positioning system (Global Positioning System, GPS for short), has higher degree of dependence on operators, is difficult to flexibly cope with environmental change and high-level semantic task demands in a complex urban environment, and is difficult to meet actual demands by simply depending on coordinates or predefined paths along with continuous enrichment of application scenes. The human-computer interaction and visual-language navigation (VLN) technology based on natural language provides ideas for solving the problems, can reduce operation thresholds and improve the adaptability of the complex scene of the unmanned plane, and the related technology gradually evolves towards data driving and model driving directions. However, the existing unmanned aerial vehicle VLN related method based on a Vision-Language Model (VLM) still has obvious defects that decisions are mostly single-step action prediction, the whole planning capability is lacked, the problems of path jitter and the like are easy to occur, historical observation and time sequence information are not utilized enough, the space memory capability is limited, the algorithm generalization adaptability is poor, unknown or dynamic change environments are difficult to deal with, and the technology is difficult to popularize from theory to real unmanned aerial vehicle application scenes. In summary, the technical problems in the related art are to be improved. Disclosure of Invention Embodiments of the present application aim to solve at least one of the technical problems in the related art to some extent. Therefore, the main purpose of the embodiment of the application is to provide the unmanned aerial vehicle autonomous navigation method and the related equipment based on the large model, which can improve the flight stability and the task execution efficiency of the unmanned aerial vehicle in a complex urban environment, and remarkably improve the task success rate and the robustness, so that the unmanned aerial vehicle can maintain the efficient, stable and safe navigation capability in a dynamic or unknown environment, and reduce the manual intervention requirement and the potential risk. In order to achieve the above object, an aspect of the embodiments of the present application provides an unmanned aerial vehicle autonomous navigation method based on a large model, the method including the following steps: Acquiring multi-modal data from an unmanned aerial vehicle task scene, wherein the multi-modal data comprises original training data and a first visual angle overlooking image sequence corresponding to the unmanned aerial vehicle, and the original training data comprises a natural language navigation instruction, unmanned aerial vehicle state information and an unmanned aerial vehicle motor driving sequence; performing data preprocessing on the original training data to obtain target training data; Inputting the target training data and the first visual angle overlooking image sequence into an initial vision-language model for supervision and fine tuning training to obtain a candidate vision-language model; Based on a composite rewarding mechanism, performing reinforcement learning training on the candidate vision-language model to obtain a target vision-language model, and deploying the target vision-language model verified by the model to a target unmanned aerial vehicle control terminal; Receiving a current natural language navigation instruction input by a user, and generating an unmanned aerial vehicle autonomous navigation prediction result comprising a plurality of steps of current action sequences according to the current natural language navigation instruction, current unmanned aerial vehicle state information and a visual memory set through the target visual-language model in the target unmanned aerial vehicle control te