EP-4091089-B1 - SYSTEMS AND METHODS FOR IMPROVED COMPUTER VISION IN ON-DEVICE APPLICATIONS

EP4091089B1EP 4091089 B1EP4091089 B1EP 4091089B1EP-4091089-B1

Inventors

WANG, Qifei
KUZNETSOV, ALEXANDER
GO, Alec Michael
CHU, GRACE
KIM, EUNYOUNG
YANG, FENG
HOWARD, Andrew Gerald
GILBERT, JEFFREY M.

Dates

Publication Date: 20260513
Application Date: 20200224

Claims (15)

A computer-implemented method for improving object detection efficiency, the method comprising: obtaining (302), by a computing system (100) comprising one or more computing devices, an image; providing (306), by the computing system, the image to a machine-learned model configured to generate an output comprising a prediction as to whether at least one object included in a class of objects is present in one or more of at least two pre-defined regions of the image, wherein together the pre-defined regions encompass the whole of the image; generating (308), by the computing system and using the output, a dataset representative of only the region or regions of the at least two pre-defined regions of the image where at least one object included in the class of objects is predicted to be present; providing, by the computing system, the dataset representative of only the region or regions of the at least two pre-defined regions of the image where at least one object included in the class of objects is present to a second machine-learned model, wherein the second machine-learned model is configured to determine a label for the object in the respective predefined region of the image.
The computer-implemented method of claim 1, wherein the dataset comprises a masked version of the image, or wherein the dataset comprises a cropped version of the image.
The computer-implemented method of any one of claims 1 to 2, wherein the label comprises a bounding box containing all of the regions where at least one object included in the class of objects is present.
The computer-implemented method of any one of claims 1 to 3, wherein one of the at least two pre-defined regions is not provided to the second machine-learned model.
The computer-implemented method of any one of claims 1 to 4, wherein the class of objects consists of one or more objects from the group: alphabetic characters, numbers, punctuation, words, machine-readable code, and faces.
The computer-implemented method of any one of claims 1 to 5, further comprising: partitioning (304), by the computing system, the image into the at least two pre-defined regions.
The computer-implemented method of claim 6, wherein partitioning the image into two or more regions that together encompass the whole of the image comprises: applying at least one horizontal partition (204) to divide the image into an upper region and a lower region; and applying at least one vertical partition (202) to divide the image into a left region and a right region.
The computer-implemented method of claim 7, wherein the at least one horizontal partition and the at least one vertical partition are static.
The computer implemented method of claim 7, wherein the at least one horizontal partition and the at least one vertical partition are adjustable.
The computer-implemented method of claim 7, wherein the at least one horizontal partition and the at least one vertical partition comprise a learned parameter, and optionally wherein the learned parameter is determined by a third machine-learned model configured to: generate a heat map of objects included in the set of objects for an example image; and partition in the image into one or more regions based on a constraint, wherein the constraint comprises: maximizing heat per box and minimizing the number of the one or more regions.
The computer-implemented method of any one of claims 1 to 10, wherein the machine-learned model includes at least two heads: a first head configured to generate the output; and a second head configured to generate a characteristic of the output, and optionally wherein the characteristic comprises an orientation.
The computer-implemented method of claim 1, wherein the second machine-learned model is configured to perform optical character recognition, OCR, face detection, or facial recognition.
A computing system(100) configured to perform object detection, the computing system comprising: one or more processors (112, 132, 152); one or more non-transitory computer-readable media (114, 134, 154) that collectively store instructions that, when executed by the one or more processors cause the computing system to perform operations, the operations comprising: obtaining (302) an image depicting at least one object in a class of objects; partitioning (304) the image into two or more regions that together encompass the whole of the image; providing (306) each of the one or more regions to a machine-learned model configured to generate an output comprising a prediction as to whether one of the objects included in the class of objects is present in the regions of the image; generating (308), using the output, a dataset representative of only the region or regions of the image where at least object included in the class of objects is predicted to be present; providing the dataset representative of only the region or regions of the at least two pre-defined regions of the image where at least one object included in the class of objects is present to a second machine-learned model, wherein the second machine-learned model is configured to determine a label for the object in the respective region of the image.
The computing system of claim 13, wherein the one or more non-transitory computer-readable media are stored on a local device, and optionally wherein the local device is a smartphone.
One or more tangible, non-transitory computer-readable media storing computer-readable instructions that when executed by one or more processors cause the one or more processors to perform operations, the operations comprising any of the methods of claims 1-12

Description

FIELD The present disclosure relates generally to object detection. More particularly, examples described herein relate to computer-implemented systems and methods which can provide more efficient models for performing object detection using an initial coarse filter prior to more computationally expensive computing processes. BACKGROUND State-of-the-art object detectors including RCNN, SSD, FPN, etc., can be used to generate a bounding box around objects. These detectors can achieve high precision and recall though normally at high model complexity. For engines performing OCR, these must run through the entire image or a pyramid of images at different scales to perform detection. In general, multi-scale object detectors can improve object detection precision and recall at the cost of additional complexities. The combination of both of these complexities can make it difficult for applications involving object detection and/or image recognition on a local device (e.g., a mobile device such as a smartphone). Needed in the art are methods and systems that can extend computer vision models to on-device applications. US 2012/0243731 A1 discloses an image processing method and an image processing apparatus for detecting an object. The image processing method includes the following steps: partitioning an image into at least a first sub-image covering a first zone and a second sub-image covering a second zone according to a designed trait; and performing an image detection process upon the first sub-image for checking whether the object is within the first zone to generate a first detecting result. The object is a human face, and the image detection process is a face detection process. US 10,372,981 B1 discloses a method for determining if a document is a text page including partitioning the document into a plurality of cells, scaling each of the cells to a standardized number of pixels to provide a corresponding snippet for each of the cells, using a classifier to examine the snippets to determine which of the cells are classified as text and which of the cells are not classified as text, determining a volume of text for the document based on a total amount of text in the document corresponding to a sum of an amount of text in each of the cells classified as text, and determining that the document is a text page in response to the total amount exceeding a pre-determined threshold. In response to the total amount being less than the pre-determined threshold, cells not classified as text may be examined further. The classifier may be provided by training a neural net. SUMMARY This specification describes a computer-implemented method for improving object detection efficiency according to claim 1, and a computing system configured to perform object detection according to claim 13. Example implementations can provide a first filter to process images via lightweight classification of one or more regions of the image. If a region is identified to include an object of interest (e.g., a word or character in a foreign language), the region can be sent to a second model that can be more specialized and memory intensive. For any region that is not identified to include an object, this region can be removed such as by masking the region or cropping the image so that underlying data is not sent to the second model. In this manner, example implementations can improve the overall efficiency of computer vision tasks such as object or character recognition by segmenting images and running an initial model to filter image data being sent to a second downstream model. BRIEF DESCRIPTION OF THE DRAWINGS Detailed discussion of embodiments directed to one of ordinary skill in the art is set forth in the specification, which makes reference to the appended figures, in which: Figure 1A illustrates an example computing system including one or more machine-learned model(s) in accordance with example implementations of the present disclosure.Figure 1B illustrates an example computing device including one or more machine-learned models(s) in accordance with example implementations of the present disclosure.Figure 1C illustrates another example computing device including one or more machine-learned model(s) in accordance with example implementations of the present disclosure.Figures 2A-2E depict images illustrating an example process for filtering image data in accordance with example implementations of the present disclosure.Figure 2F depicts an example for partitioning an image into two or more regions in accordance with example implementations of the present disclosure.Figure 3 illustrates a flow chart diagram providing an example method for improving object detection and/or optical character recognition (OCR) in accordance with example implementations of the present disclosure.Figure 4 illustrates example architectures for machine-learned models included as part of methods and systems for object detection and/or OCR in accordance with examp