Search

EP-4738274-A1 - AERIAL IMAGE-BASED OBJECT DETECTION METHOD, SYSTEM AND TRAINING METHOD

EP4738274A1EP 4738274 A1EP4738274 A1EP 4738274A1EP-4738274-A1

Abstract

Aerial image-based object detection method using of a neural network model comprising an input size, the method comprising the following steps: - taking a 2D image (1), - dividing the taken 2D image (1) into patches (2) of size equal to the input size of the neural network model, - saving the coordinates of the same reference point for each of the patches (2), - resizing the taken 2D image (1) to the input size of the neural network model, - saving the coordinate of a reference point of the resized taken 2D image (1) and the scale of the resize, - stacking into a batch (3) the patches (2) and the resize of the taken 2D image (1), - passing the batch (3) to the neural network model to determine object detections, - transforming the local detections to a reunified image (4) by using the saved patch coordinates.

Inventors

  • BELENGUER ALMECIJA, Salvador
  • Sánchez de Rivera Córdoba, Diego

Assignees

  • Airbus Defence and Space, S.A.U.

Dates

Publication Date
20260506
Application Date
20241030

Claims (11)

  1. Aerial image-based object detection method comprising the use of a neural network model on a Graphics Processing Unit (GPU), the neural network model comprising an input size, characterized in that the method comprises the following steps: - taking a 2D image (1) by an image capturing device, - dividing in the Graphics Processing Unit (GPU) the taken 2D image (1) into patches (2) of size equal to the input size of the neural network model, - saving in the Graphics Processing Unit (GPU) the coordinates of the same reference point for each of the patches (2), - resizing the taken 2D image (1) to the input size of the neural network model in the Graphics Processing Unit (GPU), - saving in the Graphics Processing Unit (GPU) the coordinate of a reference point of the resized taken 2D image (1) and the scale of the resize, - stacking into a batch (3) the patches (2) and the resize of the taken 2D image (1) in the Graphics Processing Unit (GPU), - passing the batch (3) to the neural network model to determine object detections in each stacked patch (2) and in the resized 2D image (1), - transforming the local detections to a reunified image (4) by using the saved patch coordinates so that said reunified image is formed.
  2. Aerial image-based object detection method, according to claim 1, wherein the reference point of the saved patch coordinate is the upper-left corner of the patches (2).
  3. Aerial image-based object detection method, according to any preceding claim, wherein the 2D image (1) is a high-resolution image.
  4. Aerial image-based object detection method, according to any preceding claim, wherein the patches (2) are overlapping patches (2).
  5. Aerial image-based object detection method, according to any preceding claim, wherein detected objects are represented in the images by a bounding box.
  6. Aerial image-based object detection method, according to claim 5, wherein it comprises the additional step of supressing or merging bounding boxes in the images.
  7. Aerial image-based object detection method, according to any preceding claim, wherein when resizing the taken 2D image (1) to the input size of the neural network model it also comprises the step of adding constant padding in order not to deform the original image (1).
  8. Aerial image-based object detection method, according to any preceding claim, wherein the patches (2) are square or rectangular.
  9. Training method of a neural network model of the aerial image-based detection method according to any preceding claim on a Graphics Processing Unit (GPU), wherein from a training data set comprising 2D images the training method comprises the step of performing training data augmentation by: - using labels in the form of bounding boxes or instance segmentation masks to generate binary mask images (6) of the 2D images, - cropping regions (7) around objects in the generated binary mask images (6), said cropped regions (7) being of size equal to the input size of the neural network model, - feeding the images (8) corresponding to the cropped regions (7) to the neural network model.
  10. A computer-readable storage medium comprising instructions which, when executed in a Graphics Processing Unit (GPU), causes the Graphics Processing Unit (GPU) to carry out the method of any of claims 1 to 8.
  11. Aerial image-based object detection system comprising a neural network model running on a Graphics Processing Unit (GPU), the system characterized in that it comprises: - an image capturing device configured for taking a 2D image (1), - the Graphics Processing Unit (GPU) being configured for: - receiving the taken 2D image (1) from the image capturing device, - dividing the taken 2D image (1) into patches (2) of size equal to the input size of the neural network model, - saving the coordinate of the same reference point for each of the patches (2), - resizing the taken image (1) to the input size of the neural network model, - saving the coordinate of a reference point of the resized taken image (1) and the scale of the resize, - stacking into a batch (3) the patches (2) and the resize of the taken image (1) in the graphics processing unit (GPU), - passing the batch (3) to the neural network model to determine object detections in each stacked patch (2) and in the resized image (1), - transforming the local detections to a reunified image (4) by using the saved patches coordinates.

Description

FIELD OF THE INVENTION The present invention relates to the field of image-based detection and uses data-driven Artificial Intelligence (AI), more specifically, Deep Learning (DL). BACKGROUND OF THE INVENTION Currently, different domains like autonomous drones, robotics and self-driving cars find the application of Deep Learning (DL) beneficial to meet unalike intended functions such as Situation Awareness (SA), Guidance, Navigation and Control (GNC) and/or Intelligence, Surveillance, and Reconnaissance (ISR). When compared to classical methods in Computer Vision (CV), Deep Learning (DL) excels at providing understanding of the environment and intelligent and complex pattern recognition. Apart from image classification, object detection and segmentation in 2D images has been one of the most extensively studied problems since the inception of neural networks with AlexNet winning the Large Scale Visual Recognition Challenge (LSVRC) in 2012. Following up with the advent of the Deep Learning (DL) era, the current state-of-the-art in object detection is mostly dominated by two architectural categories, namely, Convolutional Neural Networks (CNNs) and visual transformers or a mixture of both. In the case of Convolutional Neural Networks (CNNs), most architectures for object detection include a backbone/encoder for extracting features, typically variants of networks used for classification, a neck for multi-scale feature fusion to detect at multiple scales and the detection head to decode the features into object detections and category scores. Furthermore, the task can be accomplished in a single pass with One-Stage Detectors (OSDs) or in a dual proposal-refinement pass with Two-Stage Detectors (TSDs). While the first group is more real-time compliant, the second group tends to provide better performance at the expense of more computational cost. In addition to this, the detection can be anchor-based, by inferring bounding box deviations from a predefined grid, which is very sensitive to grid selection and benefits from having multiple grids, or in a more generalizable anchor-free fashion by predicting directly the bounding box. In object detection, a bounding box to is used to describe the spatial location of an object. The bounding box is square or rectangular, which is determined by the x and y coordinates of the upper-left corner of the rectangle and the such coordinates of the lower-right corner. Another commonly used bounding box representation is the (x,y)-axis coordinates of the bounding box center and the width and height of the box. Many datasets have risen as a way to fairly benchmark a wide variety of neural networks for different tasks. For image classification and/or object detection, popular datasets include ImageNet where object detector backbones/encoders are trained, Pascal VOC12, MS COCO or KITTI, etc. These datasets mostly include low-resolution images (640 × 480) with considerably large objects and pixel coverage, 60% of the image size on average. Because of this, while any pretrained model might have successful detection performances for those types of input data, the performance yielded on small object detection datasets like Visdrone and xView is considerably reduced. Since small object detection is usually the case in aerial/space views with high-resolution, high-end, cameras, it is only normal that out-of-the-box detectors struggle in these operational environments. Relatively small pixel coverage pushes the limits of neural-based methods, with greater needs in terms of memory/computation. Detecting small-looking objects normally requires from higher resolution cameras, but this implies drawbacks regarding the use of neural networks: As resolution is increased, the depth and width of neural networks needs also to be scaled up to maintain the optimal structure of the architecture, causing an exponential cost rise in computational needs.Bigger neural networks have more parameters, typically requiring more data to be trained in order to fill the extra capacity of the network. In turn, this increases the overall development/testing time and cost.Since training the neural network takes longer as it is made bigger and needs to process full resolution images, development/testing efforts become even longer and more expensive.As neural networks grow in size, inference/prediction also becomes slower and more hardware (HW) demanding, which might not be suitable for resource constrained real-time applications.There is a limit to how much resolution a neural network can be trained for, also depending on the Graphics Processing Unit (GPU) or Deep Learning DL-specific ASIC memory available in the training infrastructure.Neural network inference/prediction in the deployment application is performed in a streaming fashion, not benefiting at all from batch processing in Graphics Processing Units (GPU). SUMMARY OF THE INVENTION It is an object of the current invention an aerial image-based object detection