CN-121353962-B - Unmanned aerial vehicle scene understanding method, system, equipment and storage medium

CN121353962BCN 121353962 BCN121353962 BCN 121353962BCN-121353962-B

Abstract

The embodiment of the invention provides an unmanned aerial vehicle scene understanding method, system, equipment and storage medium, and relates to the technical field of unmanned aerial vehicle scene understanding. The method comprises the steps of carrying out semantic alignment on video unit data, unmanned aerial vehicle instructions and unmanned aerial vehicle task texts to obtain unified characterization of alignment of the video unit data, the unmanned aerial vehicle instructions and the unmanned aerial vehicle task texts, obtaining historical thinking of unmanned aerial vehicle scene understanding, carrying out vector splicing on the unified characterization and the historical thinking to obtain spliced vectors, selecting reasoning thinking corresponding to the spliced vectors from preset thinking, carrying out reinforcement learning training on the reasoning thinking to obtain a final thinking chain, and determining an unmanned aerial vehicle scene according to the final thinking chain. The invention can reduce delay of unmanned airport scene recognition and reduce calculation waste.

Inventors

YU HAIYANG
LI RUIKAI
JIANG HAN
CUI ZHIYONG
XU LIANG
REN YILONG

Assignees

北京航空航天大学杭州创新研究院
北京航空航天大学

Dates

Publication Date: 20260508
Application Date: 20251217

Claims (10)

1. An unmanned airport scene understanding method, comprising: acquiring an unmanned aerial vehicle instruction and an unmanned aerial vehicle task text, and performing sparse sampling and coding processing on video data acquired by an unmanned aerial vehicle to obtain video unit data; Semantic alignment is carried out on the video unit data, the unmanned aerial vehicle instruction and the unmanned aerial vehicle task text, and unified characterization of alignment of the video unit data, the unmanned aerial vehicle instruction and the unmanned aerial vehicle task text is obtained; acquiring historical thinking of unmanned aerial vehicle scene understanding, and carrying out vector splicing on the unified characterization and the historical thinking to obtain a spliced vector; and selecting an inference thinking corresponding to the spliced vector from preset thinking, performing reinforcement learning training on the inference thinking to obtain a final thinking chain, and determining a unmanned plane scene according to the final thinking chain.
2. The unmanned aerial vehicle scene understanding method according to claim 1, wherein the performing sparse sampling and encoding processing on the video data collected by the unmanned aerial vehicle to obtain video unit data comprises: Acquiring the triggering time of the unmanned aerial vehicle instruction through a sparse sampler, and taking video frame data corresponding to the triggering time in the video data as target frame data; taking the target frame data, the previous frame data and the next frame data of the target frame data as a dense frame set; if the trigger time does not exist, sparse sampling is carried out on the video data through the sparse sampler according to a preset video frame interval, so that a sparse frame set is obtained; converting and convolving the sparse frame set or the dense frame set to obtain sub-block unit data; And processing the sub-block unit data by using a depth separable convolution layer and a local feature extractor to obtain the video unit data.
3. The unmanned aerial vehicle scene understanding method of claim 1, wherein the step of semantically aligning the video unit data, the unmanned aerial vehicle instructions, and the unmanned aerial vehicle task text to obtain a unified representation of the alignment of the video unit data, the unmanned aerial vehicle instructions, and the unmanned aerial vehicle task text comprises: Constructing an embedded table through a two-way text coding mechanism, converting the unmanned aerial vehicle instruction into a discrete symbol through a discrete instruction, and inputting the discrete symbol into the embedded table to obtain a low-dimensional vector instruction; Converting the unmanned aerial vehicle task text into a high-dimensional vector through an embedding layer of a cross-modal aligner, and projecting the high-dimensional vector into a low-dimensional vector text; Aligning the low-dimensional vector instruction and the low-dimensional vector text through the two-way text coding mechanism to obtain text unit data; and constructing a low-rank projection matrix through a low-rank bilinear pooling module of the cross-modal aligner, and performing semantic alignment on the video unit data and the text unit data through the low-rank projection matrix to obtain the unified characterization.
4. The unmanned airport scene understanding method of claim 3, wherein said step of semantically aligning said video unit data and said text unit data via said low-rank projection matrix to obtain said unified characterization comprises: by three ranks Projection matrix of (c) , , Reducing the dimensions of the video unit data and the text unit data to 64, respectively; Calculating the attention of the projection matrix to obtain low-rank cross-modal characterization of the video unit data and the text unit data: Wherein A represents an attention matrix, Is a low rank cross-modal representation of the video unit data and the text unit data, For the video unit data to be displayed, E is text unit data, , Representing the amount of video unit data, Representing the number of text element data; and carrying out average pooling and projection on the low-rank cross-modal characterization to obtain the unified characterization: Wherein, the In order to project the matrix of the light, And (5) uniformly characterizing the uniform.
5. The unmanned aerial vehicle scene understanding method of claim 1, wherein the step of selecting an inference mind corresponding to the stitching vector from among preset thoughts comprises: and selecting long and short thinking according to the spliced vector: Wherein, the For the purpose of stitching the vectors, In order to be a historical thinking of a person, Is super-parametric, if Then the long thinking is selected, otherwise the short thinking is selected; the long thinking and the short thinking are respectively: 。
6. the unmanned airport scene understanding method of claim 5, wherein said step of reinforcement learning training said inferred thinking to arrive at a final chain of thinking comprises: Alternately updating the long or short thoughts by a policy network, the policy network comprising long and short chains, one of the long and short chains being updated at a time while the other of the long and short chains is frozen; The long chain is as follows: the short chain is as follows: The process of alternating updates is adjusted by a reward function to arrive at the final thought chain.
7. The unmanned aerial vehicle scene understanding method of claim 6, wherein the step of determining an unmanned aerial vehicle scene from the final chain of thought comprises: Stopping the alternate updating when the long chain or the short chain outputs target data; And determining the final thinking chain according to the target data, and extracting the target data through a regular strategy to obtain the unmanned airport scene.
8. The unmanned aerial vehicle scene understanding system is characterized by comprising electronic equipment and an unmanned aerial vehicle, wherein the electronic equipment is in communication connection with the unmanned aerial vehicle, the unmanned aerial vehicle is used for collecting video data, and the electronic equipment is used for: acquiring an unmanned aerial vehicle instruction and an unmanned aerial vehicle task text, and performing sparse sampling and coding processing on video data acquired by an unmanned aerial vehicle to obtain video unit data; Semantic alignment is carried out on the video unit data, the unmanned aerial vehicle instruction and the unmanned aerial vehicle task text, and unified characterization of alignment of the video unit data, the unmanned aerial vehicle instruction and the unmanned aerial vehicle task text is obtained; acquiring historical thinking of unmanned aerial vehicle scene understanding, and carrying out vector splicing on the unified characterization and the historical thinking to obtain a spliced vector; and selecting an inference thinking corresponding to the spliced vector from preset thinking, performing reinforcement learning training on the inference thinking to obtain a final thinking chain, and determining a unmanned plane scene according to the final thinking chain.
9. An electronic device comprising a processor and a memory, the memory storing machine-executable instructions executable by the processor to implement the unmanned airport scene understanding method of any of claims 1-7.
10. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the unmanned airport scene understanding method of any of claims 1-7.

Description

Unmanned aerial vehicle scene understanding method, system, equipment and storage medium Technical Field The invention relates to the technical field of unmanned airport scene understanding, in particular to an unmanned airport scene understanding method, an unmanned airport scene understanding system, unmanned airport scene understanding equipment and a storage medium. Background The unmanned aerial vehicle scene understanding means that the unmanned aerial vehicle can understand and explain the three-dimensional environment like a person, so that safe and reasonable flight decisions are made, the existing unmanned aerial vehicle scene understanding method is generally based on pure visual reasoning or reasoning based on pure text instructions, however, delay based on the pure visual reasoning and the pure text reasoning is higher, real-time response cannot be achieved, importance division of scene information cannot be carried out, and computational waste is easy to cause. Disclosure of Invention Accordingly, the present invention is directed to a method, system, device and storage medium for unmanned airport scene understanding, which solve the problems in the prior art. In order to achieve the above object, the technical scheme adopted by the embodiment of the invention is as follows: In a first aspect, an embodiment of the present invention provides an unmanned airport scene understanding method, including: acquiring an unmanned aerial vehicle instruction and an unmanned aerial vehicle task text, and performing sparse sampling and coding processing on video data acquired by an unmanned aerial vehicle to obtain video unit data; Semantic alignment is carried out on the video unit data, the unmanned aerial vehicle instruction and the unmanned aerial vehicle task text, and unified characterization of alignment of the video unit data, the unmanned aerial vehicle instruction and the unmanned aerial vehicle task text is obtained; acquiring historical thinking of unmanned aerial vehicle scene understanding, and carrying out vector splicing on the unified characterization and the historical thinking to obtain a spliced vector; and selecting an inference thinking corresponding to the spliced vector from preset thinking, performing reinforcement learning training on the inference thinking to obtain a final thinking chain, and determining a unmanned plane scene according to the final thinking chain. In an optional embodiment, the sparse sampling and encoding processing of the video data collected by the unmanned aerial vehicle to obtain video unit data includes: Acquiring the triggering time of the unmanned aerial vehicle instruction through a sparse sampler, and taking video frame data corresponding to the triggering time in the video data as target frame data; taking the target frame data, the previous frame data and the next frame data of the target frame data as a dense frame set; if the trigger time does not exist, sparse sampling is carried out on the video data through the sparse sampler according to a preset video frame interval, so that a sparse frame set is obtained; converting and convolving the sparse frame set or the dense frame set to obtain sub-block unit data; And processing the sub-block unit data by using a depth separable convolution layer and a local feature extractor to obtain the video unit data. In an optional embodiment, the step of semantically aligning the video unit data, the unmanned aerial vehicle instruction, and the unmanned aerial vehicle task text to obtain a unified representation of the alignment of the video unit data, the unmanned aerial vehicle instruction, and the unmanned aerial vehicle task text includes: Constructing an embedded table through a two-way text coding mechanism, converting the unmanned aerial vehicle instruction into a discrete symbol through a discrete instruction, and inputting the discrete symbol into the embedded table to obtain a low-dimensional vector instruction; Converting the unmanned aerial vehicle task text into a high-dimensional vector through an embedding layer of a cross-modal aligner, and projecting the high-dimensional vector to the low-dimensional vector text; Aligning the low-dimensional vector instruction and the low-dimensional vector text through the two-way text coding mechanism to obtain text unit data; and constructing a low-rank projection matrix through a low-rank bilinear pooling module of the cross-modal aligner, and performing semantic alignment on the video unit data and the text unit data through the low-rank projection matrix to obtain the unified characterization. In an optional embodiment, the step of semantically aligning the video unit data and the text unit data by using the low-rank projection matrix to obtain the unified representation includes: by three ranks Projection matrix of (c),,Reducing the dimensions of the video unit data and the text unit data to 64, respectively; Calculating the attention of the projection ma