CN-121999392-A - Unmanned aerial vehicle vision processing method capable of updating azimuth scene graph in real time and vision processor

CN121999392ACN 121999392 ACN121999392 ACN 121999392ACN-121999392-A

Abstract

The invention relates to an unmanned aerial vehicle vision processing method and a vision processor for updating an azimuth scene graph in real time, which belong to the technical field of unmanned aerial vehicle vision processing, and the method comprises the steps of performing multi-scale feature extraction and DETR detection on a vision image shot by an unmanned aerial vehicle to obtain the category and bounding box of a target in the vision image; the method comprises the steps of detecting the relation between targets by using a relation prediction network to form a scene graph, taking a boundary frame of each target in a visual image as a key point, finding out corresponding points in a satellite remote sensing graph with a known azimuth through feature extraction and feature matching, forming a matching point pair, calculating a homography matrix, estimating an azimuth angle when the unmanned aerial vehicle shoots the visual image, correspondingly combining the azimuth angle with the scene graph to obtain a calibrated azimuth scene graph, and carrying out real-time updating of the azimuth scene graph by using a lightweight network in an updating period of each DETR and the relation prediction network. The invention improves the usability and the instantaneity of the unmanned aerial vehicle vision processor.

Inventors

XUE RUI
DENG ZIPENG
LUO XIAOYAN

Assignees

北京航空航天大学

Dates

Publication Date: 20260508
Application Date: 20260107

Claims (10)

1. The unmanned aerial vehicle vision processing method for updating the azimuth scene graph in real time is characterized by comprising the following steps of: Step S1, performing multi-scale feature extraction and DETR detection on a visual image shot by an unmanned aerial vehicle to obtain the category and the bounding box of a target in the visual image, and detecting the relationship between the targets by using a relationship prediction network to form a scene graph; Step S2, taking a boundary box of each target in the visual image as a key point, finding out corresponding points in a satellite remote sensing image with a known azimuth through feature extraction and feature matching to form matching point pairs, calculating a homography matrix, and estimating an azimuth angle when the unmanned aerial vehicle shoots the visual image; s3, correspondingly combining the azimuth angle and the scene graph to obtain a calibrated azimuth scene graph; And step S4, in the updating period of each DETR and the relation prediction network, the azimuth scene graph is updated in real time by using a lightweight network, and in the updating process, the lightweight network updates the azimuth scene graph at each moment in the current updating period according to the calibration azimuth scene graph obtained in the last two periods.
2. The method for unmanned aerial vehicle vision processing of real-time updated bearing scene graphs of claim 1, Step S2, including: step S201, respectively carrying out feature extraction on each boundary box in the visual image and the satellite remote sensing image by adopting a feature extraction algorithm to obtain two feature point descriptor subsets; step S202, matching and screening the two feature point descriptor subsets to obtain a plurality of groups of reliable matching point pairs; Step S203, calculating a homography matrix according to the screened matching point pairs, and eliminating error points in the matching point pairs according to the reprojection errors of the matching points and the homography matrix to obtain an accurate homography matrix; and step S204, extracting information related to the azimuth angle from the homography matrix, and estimating the azimuth angle of the unmanned aerial vehicle.
3. The method for unmanned aerial vehicle vision processing of updating an azimuth scene graph in real time according to claim 2, wherein, In step S201, when the average texture intensity of the image of the visual image shot by the unmanned aerial vehicle meets the texture threshold value extracted by the binary descriptor, a feature extraction algorithm of the binary descriptor is adopted, otherwise, a feature extraction algorithm of the floating point descriptor is adopted, and feature extraction is performed on each bounding box and satellite remote sensing image in the visual image, so as to obtain two feature point descriptor subsets.
4. The method for unmanned aerial vehicle vision processing of claim 3, wherein the updated azimuth scene graph is in real time, When using AKAZE feature extraction algorithm in feature extraction algorithm of binary descriptor to extract feature point descriptor, it includes: 1) Applying a Perona-Malik diffusion equation to the image to construct a multi-scale image pyramid; 2) Calculating a second derivative for each scale image in the pyramid, and calculating a scale normalized Hessian determinant of each pixel point according to the second derivative; 3) Searching local maximum value points of Hessian determinant as characteristic points in each scale space; 4) For each feature point, a binary vector is generated by selecting a local area, intensity value calculation, comparison, and combination as a binary descriptor of the feature point.
5. The method for unmanned aerial vehicle vision processing of claim 3, wherein the updated azimuth scene graph is in real time, When the SIFT feature extraction algorithm in the feature extraction algorithm of the floating point number descriptors is used for feature point descriptor extraction, the method comprises the following steps: 1) Forming a scale space of the image by utilizing the constructed Gaussian blurred image with a series of scales, wherein the scales are the blur degree; 2) Constructing a differential Gaussian DoG image, and detecting extreme points to obtain key points; 3) Positioning the key points and determining the main direction, constructing a local coordinate system based on the main direction by taking the key points as the center, and dividing the image subareas; 4) Counting gradient histograms in all directions in each sub-region to generate a real number vector descriptor; 5) And normalizing the real vector descriptors to obtain floating point number descriptors.
6. The method for unmanned aerial vehicle vision processing of real-time updated bearing scene graphs of claim 1, Step S4, including: Step S401, obtaining a calibration azimuth scene graph obtained by the update period of the two latest DETRs and the relation prediction network, and projecting the two azimuth scene graphs to a unified dimension; Step S402, extracting the latest object level information from the output of the DETR detector, and performing feature projection to form object level features and object pair features; Step S403, splicing the time sequence information and the object pair information projected to the unified dimension, and generating a historical code through a multi-layer perceptron network; Step S404, the comprehensive relation characteristic, the history code and the projection of the azimuth scene graph output by the previous lightweight network of the DETR detector form an input vector of the lightweight network to update the azimuth scene graph of the next time; in the initialized input vector, the projection of the calibration azimuth scene graph closest in time is adopted to replace the projection of the azimuth scene graph output by the previous lightweight network.
7. The method of unmanned aerial vehicle vision processing of claim 6, wherein the updated azimuth scene graph in real time, The lightweight network updates the azimuth scene graph by adopting a bidirectional gating and residual error updating structure, and the updating process comprises the following steps: 1) Combining the comprehensive relation characteristic of the DETR detector, the history code and the projection of the azimuth scene graph output by the previous lightweight network into an input vector; 2) Calculating a current gate for obtaining current information, a history gate for retaining history information and a fusion gate for generating candidate information in the bidirectional gating; 3) Fusing the current information, the historical information and the candidate information according to the gating proportion to obtain an updated hidden state; 4) Inputting the hidden state into two lightweight multi-layer perceptrons simultaneously, wherein one carries out mapping of a new scene graph and the other carries out mapping of a new azimuth angle; 5) And fusing the new scene graph and the azimuth angle to obtain an updated azimuth scene graph.
8. The method of claim 7, wherein the unmanned aerial vehicle vision processing method of updating the azimuth scene graph in real time, Fusion to obtain hidden representation : ; 、、 The current information, the historical information and the candidate information are respectively output by the current gate, the historical gate and the fusion gate; is a comprehensive relational feature of the DETR detector; Coding history' "Multiplication by element; To activate the function.
9. The method of unmanned aerial vehicle vision processing of claim 8, wherein the updated azimuth scene graph in real time, The two lightweight multi-layer perceptrons are three-layer MLP network channels, wherein, Real-time updated relationship graph The method comprises the following steps: ; in the formula, A three-layer MLP network updated for the scene graph; Is a Sigmoid function; Azimuth updated in real time The method comprises the following steps: ; Wherein, the A three-layer MLP network updated for azimuth angles; For the period of Is a direction angle of (a).
10. A vision processor implementing the method for unmanned aerial vehicle vision processing of updating an azimuth scene graph in real-time according to any one of claims 1 to 9, comprising: the system comprises a scene graph generating module, an azimuth angle calculating module, an azimuth scene graph calibrating module and an azimuth scene graph updating module, wherein, The scene graph generating module is used for carrying out multi-scale feature extraction and DETR detection on a visual image shot by the unmanned aerial vehicle to obtain the category and the bounding box of the target in the visual image; The azimuth calculation module is used for taking a boundary frame of each target in the visual image as a key point, finding out corresponding points in a satellite remote sensing image with a known azimuth through feature extraction and feature matching to form a matching point pair, calculating a homography matrix, and estimating the azimuth of the unmanned aerial vehicle; The azimuth scene graph calibration module is used for obtaining a calibrated azimuth scene graph after the azimuth angle corresponds to the scene graph; And in the updating process, the lightweight network updates the azimuth scene graph at each moment in the current updating period according to the calibration azimuth scene graph obtained in the last two periods.

Description

Unmanned aerial vehicle vision processing method capable of updating azimuth scene graph in real time and vision processor Technical Field The invention relates to the technical field of unmanned aerial vehicle vision processing, in particular to an unmanned aerial vehicle vision processing method and a vision processor for updating an azimuth scene graph in real time. Background Scene graph (SCENE GRAPH) is a structured representation of data intended to semantically describe a visual scene in a clear, comprehensive manner. In a scene graph, the nodes of the graph represent objects detected in the scene, and the edges of the graph represent the interrelationships between these objects. The goal of the Scene Graph Generation (SGG) technique is to automatically detect objects from an input image or video and predict the semantic relationships between them, ultimately constructing a corresponding scene graph. This technology is key to achieving high-level scene understanding, which motivates computer vision to evolve from simple object recognition to deeper relational reasoning and cognitive intelligence. Because of its powerful scene description capabilities, scene graph generation techniques have become the core support for a variety of downstream applications, such as visual question-and-answer (VQA), image description generation, robotic environmental awareness and interaction, and the like. The scene graph generation technology is applied to the unmanned aerial vehicle vision system, and can provide scene understanding depth far exceeding that of traditional target detection for the unmanned aerial vehicle. A structured scene graph can tell the unmanned aerial vehicle what is seen, and can further understand what links are among seen objects, so that rich semantic information is provided for autonomous navigation, path planning, target tracking and intelligent decision making. However, the particularity of the unmanned plane platform brings unique challenges to scene graph generation, such as uniqueness of aerial overlooking and strabismus viewing angles, missing of azimuth information, high real-time requirement, limited airborne computing resources and the like. Currently, scene graph generation studies for unmanned aerial vehicle applications are still in the development stage. The prior art mainly has the following limitations: First, existing mainstream scene graph generation models (SGG models) mostly pursue high precision on standard data sets, and generally employ a huge and complex network structure, such as a model based on a large backbone network (e.g., resNet) and an attention mechanism (e.g., a transducer). These models are computationally intensive and have a slow reasoning speed. For example, the DETR (DEtection TRansformer) model commonly used in the current research, although making a breakthrough in the field of target detection, has difficulty in meeting the high frame rate real-time update requirements required by unmanned aerial vehicle flight. If the system is directly deployed on an unmanned plane, serious delay of scene understanding can be caused, and dynamic change of the environment can not be reflected in time. Secondly, the existing scene graph generation method mainly focuses on general semantic relationships such as "on", "under", "wearing" and the like, and has very limited research specially used for encoding the orientation information required by unmanned aerial vehicle navigation. The inability of conventional scenegraphs to answer the basic question of "which direction the target is in" makes efficient path planning and behavior decision making by drones difficult with the scenegraphs. Therefore, the prior art lacks an effective solution that can simultaneously meet high real-time, accommodate unique viewing angles of unmanned aerial vehicles, and integrate critical azimuth information into a scene graph. Disclosure of Invention In view of the analysis, the invention aims to disclose an unmanned aerial vehicle vision processing method and a vision processor for updating an azimuth scene graph in real time, wherein azimuth information is added in the scene graph so that an unmanned aerial vehicle intelligent body can more deeply understand the position information of the scene, and the real-time updating with higher frequency is realized so as to improve the availability and the real-time performance of the unmanned aerial vehicle vision processor. The invention discloses an unmanned aerial vehicle vision processing method for updating an azimuth scene graph in real time, which comprises the following steps: Step S1, performing multi-scale feature extraction and DETR detection on a visual image shot by an unmanned aerial vehicle to obtain the category and the bounding box of a target in the visual image, and detecting the relationship between the targets by using a relationship prediction network to form a scene graph; Step S2, taking a boundary box of each target in