CN-121979236-A - Unmanned aerial vehicle control method based on natural language and eye movement data fusion
Abstract
The application provides an unmanned aerial vehicle control method based on natural language and eye movement data fusion, which comprises the steps of obtaining an eye image sequence and voice data of a synchronous user, processing the eye image sequence to obtain an eye movement data sequence, processing the eye movement data to obtain a gaze point coordinate and a sight line vector, utilizing a large language model to process the voice data to generate a natural language instruction, utilizing an encoder to process the eye movement data sequence to obtain an eye movement feature sequence, utilizing the large language model to process the natural language instruction to obtain natural language features, utilizing a bidirectional attention mechanism to fuse the eye movement feature sequence and the natural language features to obtain a cross-modal fusion feature, utilizing the self-attention mechanism and the large language model to process the cross-modal fusion feature to obtain a target position of an unmanned aerial vehicle, and generating a flight track and a corresponding flight control instruction according to a starting point position and a target position of the unmanned aerial vehicle. The unmanned aerial vehicle operation method and the unmanned aerial vehicle operation system can improve the flexibility and efficiency of unmanned aerial vehicle operation.
Inventors
- WANG LI
- JIANG WEN
- WANG YAO
- Yang Dianye
- FAN WEI
- XU BIN
Assignees
- 北京理工大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260109
Claims (9)
- 1. The unmanned aerial vehicle control method based on the fusion of natural language and eye movement data is characterized by comprising the following steps: acquiring a synchronous eye image sequence and voice data of a user; Processing the eye image sequence to obtain an eye movement data sequence, wherein each eye movement data comprises a fixation point coordinate and a sight line vector; Processing the eye movement data sequence by using an encoder to obtain an eye movement characteristic sequence; processing the natural language instruction by using a large language model to obtain natural language characteristics; fusing the eye movement feature sequence and the natural language feature by utilizing a bidirectional attention mechanism to obtain a cross-modal fusion feature; processing the cross-modal fusion characteristics by using a self-attention mechanism and a large language model to obtain a target position of the unmanned aerial vehicle; and generating a flight track and a corresponding flight control instruction according to the starting point position and the target position of the unmanned aerial vehicle.
- 2. The method of claim 1, wherein processing the sequence of eye movement data with an encoder results in a sequence of eye movement characteristics, comprising: eye movement data sequence Expressed as: Wherein, the ) Is the first Two-dimensional coordinates of the gaze point in the eye movement data, Is a line-of-sight vector in the ith eye movement data; , is the amount of eye movement data; Sequence of eye movement data by encoder Coding to obtain a first eye movement characteristic sequence : Wherein, the Is the first A first eye movement feature; First, the Position coding of the first eye movement feature: Wherein, the Is the first Position encoding of the first eye movement feature in a j-th dimension; Is the frequency parameter in position coding: Wherein, the Is the dimension of the first eye movement feature, Is the first The moment of the first eye movement feature; Then the first Second eye movement characteristics The method comprises the following steps: second eye movement feature sequence The method comprises the following steps: 。
- 3. the method of claim 2, wherein processing the natural language instructions using the large language model to obtain natural language features comprises: Processing the language instruction by using a voice recognition engine to obtain a text; processing the text by using the large language model to obtain natural language characteristics : Wherein, the Embedding vectors for the semantics of the kth token, the dimensions being , , Is natural language characteristic The large language model captures context dependencies among language units through a self-attention mechanism based on a multi-layer converger architecture.
- 4. The method of claim 3, wherein fusing the sequence of eye movement features and the natural language features using a bi-directional attention mechanism results in a cross-modal fusion feature, comprising: calculate a first attention weight : Calculate a second attention weight : Wherein, the Is an activation function; for natural language features using a first attention weight and a second attention weight With a second eye movement characteristic sequence Fusion is carried out to obtain cross-modal fusion characteristics , wherein, For the t-th component of the signal, , For cross-modal fusion of features Is a dimension of (c).
- 5. The method of claim 4, wherein processing the cross-modal fusion feature using a self-attention mechanism and a large language model to obtain a target location for the drone comprises: Cross-modal fusion feature using self-attention mechanism Processing to obtain a global perception characterization sequence : Wherein the query vector Key vector Value vector Wherein, the method comprises the steps of, 、 And Is a learnable matrix; is the dimension of the key vector; is an activation function; Characterization of sequences using large language model pairs And processing to obtain the target position of the unmanned aerial vehicle.
- 6. The method of claim 5, wherein generating the flight trajectory and corresponding flight control instructions based on the starting point location and the target location of the drone, comprises: according to the starting position of the unmanned aerial vehicle With the target position Generating N track points by adopting a linear interpolation or Bezier curve mode: , wherein, = , Each track point comprises space coordinates, expected speed and direction information; fitting the N track points by adopting a cubic Bezier curve to generate a continuous flight path; will nth trace point Conversion into corresponding flight control instructions : Wherein, the The flight control instructions comprise control commands, position adjustment, attitude angle change and speed.
- 7. Unmanned aerial vehicle controlling means based on natural language and eye movement data fusion, characterized by comprising: The acquisition unit is used for acquiring the synchronous eye image sequence and voice data of the user; The first processing unit is used for processing the eye image sequence to obtain an eye movement data sequence, wherein each eye movement data comprises a fixation point coordinate and a sight line vector; The second processing unit is used for processing the eye movement data sequence by utilizing an encoder to obtain an eye movement characteristic sequence; The fusion unit is used for fusing the eye movement feature sequence and the natural language feature by utilizing a bidirectional attention mechanism to obtain a cross-modal fusion feature; the third processing unit is used for processing the cross-modal fusion characteristics by utilizing a self-attention mechanism and a large language model to obtain a target position of the unmanned aerial vehicle; and the control unit is used for generating a flight track and a corresponding flight control instruction according to the starting point position and the target position of the unmanned aerial vehicle.
- 8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any of claims 1-6 when executing the computer program.
- 9. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the method of any one of claims 1-6.
Description
Unmanned aerial vehicle control method based on natural language and eye movement data fusion Technical Field The application relates to the technical field of unmanned aerial vehicle control, in particular to an unmanned aerial vehicle control method based on natural language and eye movement data fusion. Background With the continuous development of unmanned aerial vehicle technology, the unmanned aerial vehicle is widely applied to a plurality of scenes such as inspection operation and maintenance, intelligent security, emergency rescue, interactive entertainment, man-machine cooperation and the like. Currently, unmanned aerial vehicle control methods rely mainly on conventional control terminals, such as drivers, remote controls, graphical User Interfaces (GUIs), to allocate tasks and route management. However, this communication approach gradually reveals some important problems in complex environments that are difficult to meet the needs of effective, natural and intelligent communication, mainly in the following aspects: the communication efficiency is low, and the conventional method requires the user to control buttons, remote controllers, etc. step by step. In multitasking or highly dynamic scenarios, it is difficult to quickly convey complex intentions, which limits the response and flexibility of real-time management, especially in sensitive scenarios such as post-disaster search and rescue. The environmental understanding is insufficient in that the conventional control method provides only an operation signal, and does not reflect the user's attention or semantic intention to the main targets of the scene. They cannot perceive the cognitive state (e.g., the attention profile) of the user, and thus it is difficult to implement intelligent auxiliary control based on the user's monitoring behavior. Communication load-in a complex three-dimensional space, the user must handle target recognition, route planning and equipment operation simultaneously. The conventional interface or controller requires a large amount of workload, which is unfavorable for continuous management of non-professional users or high-load scenes, and seriously affects interaction experience and working efficiency. Disclosure of Invention In view of the above, the application provides an unmanned aerial vehicle control method based on natural language and eye movement data fusion, which can add an autonomous intelligent operation function for an unmanned aerial vehicle and solve the technical problems of low operation flexibility and efficiency of a manual remote controller. In a first aspect, an embodiment of the present application provides a method for controlling an unmanned aerial vehicle based on fusion of natural language and eye movement data, including: acquiring a synchronous eye image sequence and voice data of a user; Processing the eye image sequence to obtain an eye movement data sequence, wherein each eye movement data comprises a fixation point coordinate and a sight line vector; Processing the eye movement data sequence by using an encoder to obtain an eye movement characteristic sequence; processing the natural language instruction by using a large language model to obtain natural language characteristics; fusing the eye movement feature sequence and the natural language feature by utilizing a bidirectional attention mechanism to obtain a cross-modal fusion feature; processing the cross-modal fusion characteristics by using a self-attention mechanism and a large language model to obtain a target position of the unmanned aerial vehicle; and generating a flight track and a corresponding flight control instruction according to the starting point position and the target position of the unmanned aerial vehicle. In one possible implementation, the processing the eye movement data sequence by using an encoder to obtain an eye movement characteristic sequence includes: eye movement data sequence Expressed as: Wherein, the ) Is the firstTwo-dimensional coordinates of the gaze point in the eye movement data,Is a line-of-sight vector in the ith eye movement data;, is the amount of eye movement data; Sequence of eye movement data by encoder Coding to obtain a first eye movement characteristic sequence: Wherein, the Is the firstA first eye movement feature; First, the Position coding of the first eye movement feature: Wherein, the Is the firstPosition encoding of the first eye movement feature in a j-th dimension; Is the frequency parameter in position coding: Wherein, the Is the dimension of the first eye movement feature,Is the firstThe moment of the first eye movement feature; Then the first Second eye movement characteristicsThe method comprises the following steps: second eye movement feature sequence The method comprises the following steps: 。 in one possible implementation, the processing of the natural language instruction using the large language model to obtain the natural language feature includes: Processin