EP-3800578-B1 - HIERARCHICAL SAMPLING FOR OBJECT IDENTIFICATION

EP3800578B1EP 3800578 B1EP3800578 B1EP 3800578B1EP-3800578-B1

Inventors

ROZNER, Amit
WESTMACOTT, IAN C.
AMORES LLOPIS, JAUME
FALIK, YOHAY
LEE, YOUNG M.

Dates

Publication Date: 20260506
Application Date: 20201001

Claims (3)

A method of hierarchical sampling for object re-identification executed by a server (140, 500) comprising a memory (508, 510) and a processor (504), wherein the server (140, 500) further comprises a communication component (142) receiving and/or sending data such as surveillance videos and/or images (112) from a plurality of cameras (110) including a plurality of snapshots, an identification component (144) performing the hierarchical sampling process for object re-identification, a classification component (146) classifying an image or objects of the image, and an artificial intelligence, AI, component (148) performing filtering and/or representative snapshot selection process, the method comprising: - receiving, by the identification component (144) of the server (140, 500), a first plurality of snapshots from a plurality of cameras (110); - generating, by the identification component (144) of the server (140, 500), a first plurality of descriptors, which are of low complexity relating to a person or object to be identified in the surveillance video and/or images (112), such as color, lighting, and/or shape, each associated with the first plurality of snapshots; - grouping, by the identification component (144) of the server (140, 500), the first plurality of snapshots into a plurality of clusters (120, 122, 124) based on the first plurality of descriptors; - selecting, by the identification component (144) of the server (140, 500), a representative snapshot (120a, 122a, 124a) for each of the plurality of clusters (120, 122, 124); - generating, by the identification component (144) of the server (140, 500), at least one second descriptor for the representative snapshot (120a, 122a, 124a) for each of the at least one cluster (120, 122, 124), wherein the at least one second descriptor is more complex than the first plurality of descriptors such as including spatial information, timing information, and/or class information; and - identifying a target by applying the at least second descriptor to a second plurality of snapshots.
A non-transitory computer readable medium comprising instructions stored therein that, when executed by a processor (504) of a server (140, 500) further comprising a communication component (142) receiving and/or sending data such as surveillance videos and/or images (112) from a plurality of cameras (110) including a plurality of snapshots, an identification component (144) performing the hierarchical sampling process for object re-identification, a classification component (146) classifying an image or objects of the image, and an artificial intelligence, AI, component (148) performing filtering and/or representative snapshot selection process, cause the processor (504) to: - receive, by the identification component (144) of the server (140, 500), a first plurality of snapshots from a plurality of cameras (110); - generate, by the identification component (144) of the server (140, 500), a first plurality of descriptors, which are of low complexity relating to a person or object to be identified in the surveillance video and/or images (112), such as color, lighting, and/or shape, each associated with the first plurality of snapshots; - group, by the identification component (144) of the server (140, 500), the first plurality of snapshots into a plurality of clusters (120, 122, 124) based on the plurality of descriptors; - select, by the identification component (144) of the server (140, 500), a representative snapshot (120a, 122a, 124a) for each of the plurality of clusters (120, 122, 124); - generate, by the identification component (144) of the server (140, 500), at least one second descriptor for the representative snapshot (120a, 122a, 124a) for each of the at least one cluster (120, 122, 124), wherein the at least one second descriptor is more complex than the first plurality of descriptors such as including spatial information, timing information and/or class information; and - identify a target by applying the at least second descriptor to a second plurality of snapshots.
A server (140, 500), comprising: - memory (508, 510) that stores instructions; and - a processor (504) configured to execute the instructions to: - receive, by an identification component (144) of the server (140, 500) configured to perform a hierarchical sampling process for object re-identification, a first plurality of snapshots from a plurality of cameras (110); - generate, by the identification component (144) of the server (140, 500), a first plurality of descriptors, which are of low complexity relating to a person or object to be identified in the surveillance video and/or images (112), such as color, lighting, and/or shape, each associated with the first plurality of snapshots; - group, by the identification component (144) of the server (140, 500), the first plurality of snapshots into a plurality of clusters (120, 122, 124) based on the plurality of descriptors; - select, by the identification component (144) of the server (140, 500), a representative snapshot (120a, 122a, 124a) for each of the plurality of clusters (120, 122, 124); - generate, by the identification component (144) of the server (140, 500), at least one second descriptor for the representative snapshot (120a, 122a, 124a) for each of the at least one cluster (120, 122, 124), wherein the at least one second descriptor is more complex than the first plurality of descriptors such as including spatial information, timing information and/or class information; and - identify a target by applying the at least second descriptor to a second plurality of snapshots.

Description

BACKGROUND In surveillance systems, numerous images (e.g., more than thousands or even millions) may be captured by multiple cameras. Each image may show people and objects (e.g., cars, infrastructures, accessories, etc.). In certain circumstances, security personnel monitoring the surveillance systems may want to locate and/or track a particular person and/or object through the multiple cameras. Amiri in "Hierarchical Keyframe-based Video Summarization using QR-Decomposition and Modified k-Means Clustering" (Eurasip Journal on Advances in Signal Processing, 2010), disclosed a multi-stage descriptor extraction and clustering summarisation technique whereby the same color descriptors are computed at each stage. With such a technique, it may be computationally intensive for the surveillance systems to accurately track the particular person and/or object by searching through the images. Therefore, improvements may be desirable. SUMMARY The invention is defined by the independent claims. BRIEF DESCRIPTION OF THE DRAWINGS The features believed to be characteristic of aspects of the disclosure are set forth in the appended claims. In the description that follows, like parts are marked throughout the specification and drawings with the same numerals, respectively. The drawing figures are not necessarily drawn to scale and certain figures may be shown in exaggerated or generalized form in the interest of clarity and conciseness. The disclosure itself, however, as well as a preferred mode of use, further objects and advantages thereof, will be best understood by reference to the following detailed description of illustrative aspects of the disclosure when read in conjunction with the accompanying drawings, wherein: FIG. 1 illustrates an example of an environment for implementing the hierarchical sampling for re-identification process in accordance with aspects of the present disclosure;FIG. 2 illustrates an example of a method for implementing the hierarchical sampling for re-identification process in accordance with aspects of the present disclosure;FIG. 3 illustrates an example of a method for implementing the hierarchical sampling for re-identification process including classification in accordance with aspects of the present disclosure;FIG. 4 illustrates an example of a method for implementing the hierarchical sampling for re-identification process using neural networks in accordance with aspects of the present disclosure; andFIG. 5 illustrates an example of a computer system in accordance with aspects of the present disclosure. DETAILED DESCRIPTION The following includes definitions of selected terms employed herein. The definitions include various examples and/or forms of components that fall within the scope of a term and that may be used for implementation. The examples are limited by the scope of the independent claims. The term "processor," as used herein, can refer to a device that processes signals and performs general computing and arithmetic functions. Signals processed by the processor can include digital signals, data signals, computer instructions, processor instructions, messages, a bit, a bit stream, or other computing that can be received, transmitted and/or detected. A processor, for example, can include microprocessors, microcontrollers, digital signal processors (DSPs), field programmable gate arrays (FPGAs), programmable logic devices (PLDs), state machines, gated logic, discrete hardware circuits, and other suitable hardware configured to perform the various functionality described herein. The term "bus," as used herein, can refer to an interconnected architecture that is operably connected to transfer data between computer components within a singular or multiple systems. The bus can be a memory bus, a memory controller, a peripheral bus, an external bus, a crossbar switch, and/or a local bus, among others. The term "memory," as used herein, can include volatile memory and/or nonvolatile memory. Non-volatile memory can include, for example, ROM (read only memory), PROM (programmable read only memory), EPROM (erasable PROM) and EEPROM (electrically erasable PROM). Volatile memory can include, for example, RAM (random access memory), synchronous RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), and direct RAM bus RAM (DRRAM). The input to the hierarchical sampling re-identification system is a set of object tracks, where each track is a sequence of snapshots captured across consecutive frames of the video stream. Given this input, the re-identification system may extract meta-data in the form of descriptors (also called visual features), which are arrays of numbers representing the visual appearance of the object in each track. A typical approach for is to extract a descriptor for each snapshot in the track, and store either all the descriptors or an aggregated descriptor (e.g., using average or max pooling) in the database. The resulting collection