US-20260127854-A1 - OBJECT DETECTION WITH A DEEP LEARNING ACCELERATOR OF ARTIFICIAL NEURAL NETWORKS

US20260127854A1US 20260127854 A1US20260127854 A1US 20260127854A1US-20260127854-A1

Abstract

Systems, devices, and methods related to an object detector and a Deep learning accelerator are described. For example, a computing apparatus has an integrated circuit device with the Deep learning accelerator configured to execute instructions generated by a compiler from a description of an artificial neural network of the object detector. The artificial neural network includes a first cross stage partial network to extract features from an image and a second cross stage partial network to combine the features to identify a region of interest in the image showing an object. The artificial neural network uses a technique of minimum cost assignment in assigning a classification to the object and thus avoids post processing of non-maximum suppression.

Inventors

Sheik Dawood Beer Mohideen

Assignees

MICRON TECHNOLOGY, INC.

Dates

Publication Date: 20260507
Application Date: 20251219

Claims (20)

1 . A device, comprising: a plurality of processing units configured to: extract a plurality of features from data representative of an image; combine the plurality of features to identify a region of interest associated with the image; and determine a classification of an object in the region of interest associated with the image.
2 . The device of claim 1 , wherein the device is further configured to receive the data representative of the image.
3 . The device of claim 1 , wherein the plurality of features are extracted using a first cross stage partial network.
4 . The device of claim 3 , wherein the plurality of features are combined to identify the region of interest using a second cross stage partial network.
5 . The device of claim 4 , wherein the classification of the object is determined using a technique of minimum cost assignment.
6 . The device of claim 1 , wherein the plurality of processing units are configured via a compiler output generated by a compiler from data representative of a description of an artificial neural network.
7 . The device of claim 6 , wherein the artificial neural network comprises a first cross stage partial network configured to extract the plurality of features and a second cross stage partial network configured to combine the plurality of features to identify the region of interest.
8 . The device of claim 7 , wherein the compiler output includes instructions executable by the plurality of processing units to implement operations of the artificial neural network and matrices used by the instructions during execution of the instructions to implement the operations of the artificial neural network.
9 . The device of claim 1 , further comprising an integrated circuit package configured to enclose the plurality of processing units.
10 . The device of claim 9 , further comprising an integrated circuit die of a field-programmable gate array or application specific integrated circuit implementing a deep learning accelerator having the plurality of processing units, including at least one processing unit configured to perform matrix operations and a control unit configured to load instructions from the memory for execution.
11 . The device of claim 10 , wherein the at least one processing unit includes a matrix-matrix unit configured to operate on two matrix operands of an instruction.
12 . The device of claim 1 , wherein: the matrix-matrix unit includes a plurality of matrix-vector units configured to operate in parallel; each of the plurality of matrix-vector units includes a plurality of vector-vector units configured to operate in parallel; and each of the plurality of vector-vector units includes a plurality of multiply-accumulate units configured to operate in parallel.
13 . A non-transitory computer readable storage medium storing instructions that, upon execution by a computing apparatus, cause the computing apparatus to: extract a plurality of features from data representative of an image; identify, based on the plurality of features, a region of interest associated with the image; and determine, based on the region of interest, a classification of an object associated with the image.
14 . The non-transitory computer readable storage medium of claim 13 , wherein the instructions further cause the computing apparatus to receive the data representative of the image.
15 . The non-transitory computer readable storage medium of claim 13 , wherein the plurality of features are extracted using a first cross stage partial network.
16 . The non-transitory computer readable storage medium of claim 15 , wherein the plurality of features are combined to identify the region of interest using a second cross stage partial network.
17 . The non-transitory computer readable storage medium of claim 16 , wherein the classification of the object is determined using a technique of minimum cost assignment.
18 . A device, comprising: at least one processing unit configured to: identify, based on a plurality of features extracted from data representative of an image, a region of interest associated with the image; and determine, based on use of a minimum cost assignment technique, a classification of an object in the region of interest associated with the image.
19 . The device of claim 18 , wherein the classification is further determined based on use of a bounding box regression.
20 . The device of claim 18 , wherein the plurality of features are combined, using a cross stage partial network, to identify the region of interest.

Description

RELATED APPLICATIONS The present application is a continuation application of U.S. patent application Ser. No. 17/727,649 filed Apr. 22, 2022, issued as U.S. Pat. No. 12,505,649 on Dec. 23, 2025, which claims priority to Prov. U.S. patent application Ser. No. 63/185,280 filed May 6, 2021, the entire disclosures of which application are hereby incorporated herein by reference. TECHNICAL FIELD At least some embodiments disclosed herein relate to image processing and object detection/recognition in general and more particularly, but not limited to, implementations of Artificial Neural Networks (ANNs) for object detection/recognition in images. BACKGROUND An Artificial Neural Network (ANN) uses a network of neurons to process inputs to the network and to generate outputs from the network. Deep learning has been applied to many application fields, such as computer vision, speech/audio recognition, natural language processing, machine translation, bioinformatics, drug design, medical image processing, games, etc. BRIEF DESCRIPTION OF THE DRAWINGS The embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings in which like references indicate similar elements. FIG. 1 shows an object detector according to one embodiment. FIG. 2 shows an integrated circuit device having a Deep Learning Accelerator and random access memory to implement an object detector according to one embodiment. FIG. 3 shows a processing unit configured to perform matrix-matrix operations according to one embodiment. FIG. 4 shows a processing unit configured to perform matrix-vector operations according to one embodiment. FIG. 5 shows a processing unit configured to perform vector-vector operations according to one embodiment. FIG. 6 shows a Deep Learning Accelerator and random access memory configured to autonomously apply inputs to a trained Artificial Neural Network for object detection according to one embodiment. FIG. 7 shows a method of object detection according to one embodiment. FIG. 8 shows a block diagram of an example computer system in which embodiments of the present disclosure can operate. DETAILED DESCRIPTION At least some embodiments disclosed herein provide a high performance object detector to identify an object in an image. The object detector uses cross stage partial networks in feature extraction and in feature fusion to identify region of interest, and uses minimum cost assignment in object classification to avoid Non-Maximum Suppression. The object detector can be implemented via a Deep Learning Accelerator to achieve performance comparable to acceleration via Graphics Processing Units (GPUs). FIG. 1 shows an object detector 103 according to one embodiment. The object detector 103 implemented via an artificial neural network can include a backbone 105, a neck 107, and a head 109. The backbone 105 processes an input image 101 to generate features 111. The neck 107 combines or fuses features to identify a region of interest 113. The head 109 assigns a classification 115 as a label for the object depicted in the region of interest 113 in the image 101. In FIG. 1, the backbone 105 is implemented via a cross stage partial network 106; and the neck 107 is implemented via another cross stage partial network 108. A cross stage partial network is a partial dense artificial neural network that splits the gradient flow for propagation through different network paths. The use of a cross stage partial network can reduce computation, and improve speed and accuracy. In FIG. 1, the head 109 uses minimum cost assignment 110 in object classification and bounding box regression. Minimum cost assignment is a technique to sum classification cost and location cost between sample and ground-truth. For each object ground-truth, only one sample of minimum cost is assigned as the positive sample; others are all negative samples. The use of minimum cost assignment can eliminate the need for costly post-processing operations of non-maximum suppression. The object detector 103 of FIG. 1 includes the combination of the use of cross stage partial networks 106 and 108 in the backbone 105 and the neck 107 and the use of minimum cost assignment 110 in the head 109. For example, the backbone 105 and the neck 107 can be implemented in a way as discussed in Chien-Yao Wang, et al., “Scaled-YOLOv4: Scaling Cross Stage Partial Network”, arXiv:2011.08036v2 (cs. CV), Feb. 22, 2021, the disclosure of which is hereby incorporated herein by reference. For example, the head 109 can be implemented in a way as discussed in Peize Sun, et al., “OneNet: Towards End-to-End One-Stage Object Detection”, arXiv:2012.05780v1 (cs. CV), Dec. 10, 2020, the disclosure of which is hereby incorporated herein by reference. As a result, the object detector 103 can be implemented efficiently on an integrated circuit device having a Deep Learning Accelerator (DLA) and random access memory. The object detector 103 implemented with a DLA can