EP-3945434-B1 - IMAGE/VIDEO ANALYSIS WITH ACTIVITY SIGNATURES

EP3945434B1EP 3945434 B1EP3945434 B1EP 3945434B1EP-3945434-B1

Inventors

MIGDAL, JOSHUA
SRINIVASAN, VIKRAM

Dates

Publication Date: 20260506
Application Date: 20210323

Claims (7)

A method (200) implemented as executable instructions programmed and residing within memory and/or a non-transitory computer-readable storage medium and executed by one or more processors of a server, comprising: obtaining (210) video frames from a video; generating (220) a single image from the video frames, wherein generating (220) further includes tracking (221) a region within each video frame associated with a modeled activity, wherein tracking (221) further includes obtaining (222) pixel values for each video frame associated with points along an expected path of movement for the modeled activity within the region, wherein obtaining (222) the expected path further comprises determining (223) an aggregated pixel value for each point along the expected path of movement across the video frames from the corresponding pixel values captured across the video frames for the corresponding point, wherein determining (223) further includes calculating (227) the aggregated pixel value as an optical flow summary value representing a magnitude of movement along the expected path of movement for the corresponding pixel values across the video frames; and providing (230) the single image as a visual flowprint for an activity that was captured in the video frames, wherein the visual flowprint represents a visual summary of the activity captured within the video in the single image; whereby the expected path is a path of expected motion associated with the modeled activity in obtained video frames.
The method (200) of claim 1, wherein generating (220) further includes removing (228) known background pixel values from the video frames before generating the single image.
The method (200) of claim 1, wherein generating (220) further includes removing (229-1) pixel values from the video frames that are not associated with a region of interest for the modeled activity before generating the single image.
The method (200) of claim 1, wherein generating (220) further includes extracting (229-2) features from the video frames and generating the single image as a summary of each feature across the video frames.
The method (200) of claim 1, wherein providing (230) further includes determining (231) that a size of the single image is incompatible with a machine-learning algorithm that expects a second size for an input image provided as input to the machine-learning algorithm, normalizing the single image into the second size and providing the single image in the second size as the input image to the machine-learning algorithm.
A system (100), comprising: a camera (130); a server (120) for executing the method of claim 1, the server (120) comprising a processor (121) and a non-transitory computer-readable storage medium (122) comprising executable instructions; and the executable instructions when executed by the processor (121) from the non-transitory computer-readable storage medium (122) causing the processor (121) to perform processing comprising: maintaining a single aggregated data structure from video frames of a video feed based on changes occurring along a path within each video frame associated with movement for a modeled activity along the path; and providing the single aggregated data structure as a summary of an activity appearing in the video frames of the video to an application that monitors, tracks, or performs an action based on the activity.
The system (100) of claim 6, wherein the application is a machine-learning algorithm that takes the single aggregated data structure as input in place of the video frames of the video.

Description

Deep learning has been widely used and successful at image classification tasks. However, due to how computationally expensive machine learning can be, deep learning has largely been unable to be applied to video classification tasks. Unlike a single image, the content of video changes over time. In other words, there is a temporal aspect to analyzing video that is not present in still images. So, while an image understanding task may involve understanding what is in the image, a video understanding task requires determining an object that is in the video at a given time and determining how the object changes over time from frame to frame within the video. This temporal based analysis, which is required for understanding video activity, presents significant challenges to current machine-learning approaches. The search space required for video analysis is exponentially larger than what is needed for still image analysis; such that if a machine-learning algorithm is not meticulously constructed and configured appropriately, intractable problems can be encountered with learning and evaluation for videos processed by the algorithm. Furthermore, even with an optimal machine-learning algorithm for video analysis, the sheer data size of the corresponding video frames and the number of features being tracked from the frames prevent the algorithm from delivering timely results. Typically, there can be more than 30 frames (30 images) every second, a single minute of video contains almost 2,000 individual images. Moreover, hardware resources, such as processor load and memory utilization become extremely taxed while the algorithm processes the video frames. As a result, real time machine-learning applications are largely not feasible and not practical in the industry. EP3677174A1 discloses the generation of a "movement image" which represents areas of the original image in which subject movement is likely to be present. In various embodiments, methods and a system for image/video analysis with activity signatures are presented. The invention is defined by the attached claims. Embodiments of the present invention will now be described hereinafter, by way of example only, with reference to the accompanying drawings in which: FIG. 1A is a diagram of a system for image/video analysis with activity signatures, according to an example embodiment.FIG. 1B is a diagram of an example visual flowprint or activity signature, according to an example embodiment.FIG. 1C is an example overhead image of a video frame captured for a transaction area of a transaction terminal, according to an example embodiment.FIG. 1D is a diagram of an expected flowprint's path through multiple video frames superimposed on top of the video frame of FIG. 1C, according to an example embodiment.FIG. 1E is a diagram of images comprising superimposed optical flow data representing movement of optical data from one frame to a next frame, according to an example embodiment.FIG. 1F is a diagram of two flowprints associated with activities that took different amounts of time, according to an example embodiment.FIG. 1G is a diagram of two flowprints associated with activities that took different amounts of time normalized to a same size, according to an example embodiment.FIG. 1H is a diagram of a flowprint augmented with sensor data captured by two sensors, according to an example embodiment.FIG. 2 is a diagram of a method for image/video analysis with activity signatures, according to an example embodiment.FIG. 3 is a diagram of another method for image/video analysis with activity signatures, according to an example embodiment FIG. 1A is a diagram of a system 100 for image/video analysis with activity signatures, according to an example embodiment. It is to be noted that the components are shown schematically in greatly simplified form, with only those components relevant to understanding of the embodiments being illustrated. Furthermore, the various components (that are identified in the FIG. 1) are illustrated and the arrangement of the components is presented for purposes of illustration only. It is to be noted that other arrangements with more or less components are possible without departing from the teachings of image/video analysis with activity signatures, presented herein and below. System 100 includes a transaction terminal 110 a server 120 and one or more cameras 130. Terminal 110 comprises a display 111, peripheral devices 112, a processor 113, and a non-transitory computer-readable storage medium 114. Medium 114 comprises executable instructions for a transaction manager 115. Server 120 comprises a processor 121 and a non-transitory computer-readable storage medium 122. Medium 122 comprises executable instructions for a transaction manager 123, a visual signature/flowprint manager 124, and one or more video-based consuming applications 126. Medium 122 also comprises video signatures/flowprints 125 produced by visual signature/flowprint manager 1