US-12626487-B2 - Object detection based on atrous convolution and adaptive processing

US12626487B2US 12626487 B2US12626487 B2US 12626487B2US-12626487-B2

Abstract

Systems and methods for object detection can include obtaining one or more images and processing the one or more images with a machine-learned object detection model to generate one or more bounding boxes and one or more object classifications. The object detection model may perform atrous convolution, feature fusion, feature map generation, and prediction based on feature extraction.

Inventors

Benjamin Eli Klein Sugerman

Assignees

NORTHROP GRUMMAN SYSTEMS CORPORATION

Dates

Publication Date: 20260512
Application Date: 20230912

Claims (19)

1 . A computing system for object detection, the system comprising: one or more processors; and one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: obtaining image data, wherein the image data comprises one or more images, wherein the one or more images are descriptive of one or more objects in an environment; processing image data with an object detection model to generate one or more bounding boxes and one or more object classifications, wherein the object detection model was trained to detect and classify objects in an input image, wherein processing the image data with the object detection model to generate the one or more bounding boxes and the one or more object classifications comprises: performing a first atrous convolution on the one or more images with a first convolutional block to generate a first convolution output, wherein the first atrous convolution comprises convolution kernels that are spaced one or more pixels apart; performing a second atrous convolution on the one or more images with a second convolutional block to generate a second convolution output; generating one or more feature maps based on the first convolution output and the second convolution output, wherein the one or more feature maps are descriptive of a plurality of features in the one or more images; performing spatial pooling on the one or more feature maps to generate a three-dimensional tensor representation; processing the three-dimensional tensor representation with a prediction block to generate the one or more bounding boxes and one or more object classifications; wherein the object detection model comprises a plurality of bottlenecks, wherein adaptive attention is performed at the end of each bottleneck to increase semantic information capture across each feature layer, and wherein the object detection model was trained based on a combined loss function comprising a standard focal loss for objects, a class-balanced focal loss, object label soothing, a bounding box loss, and a balanced object loss; and providing the one or more bounding boxes and the one or more object classifications as output.
2 . The system of claim 1 , wherein the operations further comprise: processing the first convolution output and the second convolution output with an attention block to generate an attention output, wherein the attention block maintains semantic information across processing blocks.
3 . The system of claim 1 , wherein generating one or more feature maps based on the first convolution output and the second convolution output comprises: processing the one or more images with a third convolutional block to generate a third convolution output, wherein the third convolution output is descriptive of one or more kernels generated without a pixel skip.
4 . The system of claim 3 , wherein generating one or more feature maps based on the first convolution output and the second convolution output comprises: generating a fused feature map based at least in part on combining the first convolution output, the second convolution output, and the third convolution output.
5 . The system of claim 4 , wherein the first convolution output, the second convolution output, and the third convolution output are combined via a learned weighted sum and pointwise convolution to generate a single output feature map.
6 . The system of claim 4 , wherein each of the first convolution output, the second convolution output, and the third convolution output comprise a different receptive-field size.
7 . The system of claim 1 , wherein generating one or more feature maps based on the first convolution output and the second convolution output comprises: generating a plurality of feature maps; wherein processing the image data with the object detection model to generate the one or more bounding boxes and the one or more object classifications further comprises: processing the plurality of feature maps with a fusion block to spatially filter information.
8 . The system of claim 1 , wherein processing the image data with the object detection model to generate the one or more bounding boxes and the one or more object classifications comprises: processing feature data with a lambda block, wherein the lambda block is configured to generate contextual representations for prediction.
9 . The system of claim 1 , wherein processing the image data with the object detection model to generate the one or more bounding boxes and the one or more object classifications comprises: processing the second convolution output with an atrous upsample block to generate an upsampled output.
10 . The system of claim 9 , wherein processing the image data with the object detection model to generate the one or more bounding boxes and the one or more object classifications comprises: concatenating the upsampled output and the first convolution output to generate a concatenated upsampled dataset; processing the concatenated upsampled dataset with an atrous downsampling block to generate a downsampled dataset; and generating a feature map based on a concatenation of the second convolution output and the downsampled dataset.
11 . A computer-implemented method for training an object detection model, the method comprising: obtaining, by a computing system comprising one or more processors, training data, wherein the training data comprises image data, one or more ground truth bounding boxes, and one or more ground truth object classifications, wherein the image data comprises one or more images, wherein the one or more images are descriptive of one or more objects in an environment, wherein the one or more ground truth bounding boxes are descriptive of a location for the one or more objects, and wherein the one or more ground truth object classifications are descriptive of an object type for each of the one or more objects; processing, by the computing system, the image data with the object detection model to generate one or more predicted bounding boxes and one or more predicted classifications, wherein the object detection model comprises a plurality of atrous convolution blocks that process the one or more images to generate kernels by skipping pixels during processing, wherein the kernels are processed to generate a plurality of feature maps that are then processed to generate prediction data, wherein the prediction data is processed to generate the one or more predicted bounding boxes and the one or more predicted classifications, wherein the object detection model comprises a plurality of bottlenecks, wherein adaptive attention is performed at the end of each bottleneck to increase semantic information capture across each feature layer; evaluating, by the computing system, a combined loss function based on a standard focal loss for objects, a class-balanced focal loss, object label soothing, a bounding box loss, and a balanced object loss, wherein evaluating the combined loss function comprises: evaluating, by the computing system, a first loss function that evaluates a difference between the one or more predicted bounding boxes and the one or more ground truth bounding boxes; evaluating, by the computing system, a second loss function that evaluates a difference between the one or more predicted classifications and the one or more ground truth object classifications; and adjusting, by the computing system, one or more parameters of the object detection model based at least in part on the combined loss function.
12 . The method of claim 11 , wherein processing the image data with the object detection model comprises: generating, by the computing system, an objectness output, wherein the objectness output is descriptive of a presence prediction, wherein the presence prediction is descriptive of whether one or more portions of the one or more images are descriptive of one or more objects.
13 . The method of claim 12 , wherein evaluating the combined loss function further comprises: evaluating, by the computing system, a third loss function that evaluates the objectness output.
14 . The method of claim 11 , wherein the object detection model comprises atrous upsampling and atrous downsampling.
15 . The method of claim 14 , wherein feature data processed with the atrous upsampling is concatenated with upstream feature data then processed with the atrous downsampling.
16 . One or more non-transitory computer-readable media that collectively store instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations, the operations comprising: obtaining image data, wherein the image data comprises one or more images, wherein the one or more images are descriptive of one or more objects in an environment; processing the image data with a machine-learned model, wherein the machine-learned model comprises: a backbone block, wherein the backbone block processes an input image to perform one or more atrous convolutions; a neck block, wherein the neck block obtains a plurality of backbone outputs associated with a plurality of backbone outputs from a plurality of backbone layers, wherein the neck block processes the plurality of backbone outputs to generate a plurality of feature maps; a head block, wherein the head block processes the plurality of feature maps to generate prediction data; and a prediction block, wherein the prediction block processes the prediction data to generate one or more prediction outputs associated with a detection of the one or more objects; wherein the machine-learned model comprises a plurality of bottlenecks, wherein adaptive attention is performed at the end of each bottleneck to increase semantic information capture across each feature layer, and wherein the machine-learned model was trained based on a combined loss function comprising a standard focal loss for objects, a class-balanced focal loss, object label soothing, a bounding box loss, and a balanced object loss; and in response to processing the image data with the machine-learned model, generating an output data, wherein the output data comprises one or more bounding boxes and one or more object classifications, wherein the one or more bounding boxes are associated with one or more locations for the one or more objects, and wherein the one or more object classifications are descriptive of one or more classifications of the one or more objects.
17 . The one or more non-transitory computer-readable media of claim 16 , wherein the head block comprises: one or more atrous upsample blocks; one or more atrous downsample blocks; and a plurality of fusion blocks.
18 . The one or more non-transitory computer-readable media of claim 16 , wherein the machine-learned model comprises: a plurality of convolutional blocks; one or more self-attention blocks; and one or more normalization blocks.
19 . The one or more non-transitory computer-readable media of claim 16 , wherein generating the output data comprises: generating an annotated image that is descriptive of the one or more images annotated with the one or more bounding boxes and the one or more object classifications.

Description

FIELD The present disclosure relates generally to object detection. More particularly, the present disclosure relates to an object detection that leverages atrous convolution and adaptive processing to perform object detection that is semantically aware and can be utilized for small object detection. BACKGROUND Deep-learning has been utilized for the field of object detection and classification (which can include Automated Target Recognition or “ATR”), yielding improvements in accuracy and reductions in false-alarm rates by up to 50% compared to traditional machine-learning methods. Existing architectures that implement deep-learning may generate results accurately and/or quickly on datasets of color images of common objects such as dogs, bicycles, and cars. The results can show that their performance may be extremely poor on small objects (e.g., objects less than 32 pixels), with average precisions less than half that of the detection accuracy of large objects (e.g., objects above 96 pixels). To reach reasonable accuracies on small objects, existing models may require 40 million parameters or more. SUMMARY Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments. One example aspect of the present disclosure is directed to a computing system for object detection. The system can include one or more processors and one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations. The operations can include obtaining image data. The image data can include one or more images. In some implementations, the one or more images can be descriptive of one or more objects in an environment. The operations can include processing image data with an object detection model to generate one or more bounding boxes and one or more object classifications. The object detection model may have been trained to detect and classify objects in an input image. Processing the image data with the object detection model to generate the one or more bounding boxes and the one or more object classifications can include performing a first atrous convolution on the one or more images with a first convolutional block to generate a first convolution output. The first atrous convolution can include convolution kernels that are spaced one or more pixels apart. In some implementations, processing the image data with the object detection model to generate the one or more bounding boxes and the one or more object classifications can include performing a second atrous convolution on the one or more images with a second convolutional block to generate a second convolution output and generating one or more feature maps based on the first convolution output and the second convolution output. The one or more feature maps can be descriptive of a plurality of features in the one or more images. Processing the image data with the object detection model to generate the one or more bounding boxes and the one or more object classifications can include performing spatial pooling on the one or more feature maps to generate a three-dimensional tensor representation and processing the three-dimensional tensor representation with a prediction block to generate the one or more bounding boxes and one or more object classifications. The operations can include providing the one or more bounding boxes and the one or more object classifications as output. Another example aspect of the present disclosure is directed to a computer-implemented method for training an object detection model. The method can include obtaining, by a computing system including one or more processors, training data. The training data can include image data, one or more ground truth bounding boxes, and one or more ground truth object classifications. The image data can include one or more images. In some implementations, the one or more images can be descriptive of one or more objects in an environment. The one or more ground truth bounding boxes can be descriptive of a location for the one or more objects. The one or more ground truth object classifications can be descriptive of an object type for each of the one or more objects. The method can include processing, by the computing system, the image data with the object detection model to generate one or more predicted bounding boxes and one or more predicted classifications. The object detection model can include a plurality of atrous convolution blocks that process the one or more images to generate kernels by skipping pixels during processing. In some implementations, the kernels can be processed to generate a plurality of feature maps that are then processed to generate prediction data. The prediction data can be processed to generate the one or more predicted bounding bo