Search

US-12620119-B2 - Video analytics accuracy using transfer learning

US12620119B2US 12620119 B2US12620119 B2US 12620119B2US-12620119-B2

Abstract

Systems and methods are provided for increasing accuracy of video analytics tasks in real-time by acquiring a video, and identifying fluctuations in the accuracy of video analytics applications across consecutive frames of the video. The systems and methods can support decision-making based on results of video analytics tasks. The fluctuations are quantified based on an average relative difference of true-positive detection counts across consecutive frames. Fluctuations in accuracy are reduced by applying transfer learning to a deep learning model trained using images, and retraining the deep learning model using video frames. A quality of object detections is determined based on an amount of track-ids assigned by a tracker across different video frames. Optimization of the reduction of fluctuations includes iteratively repeating the identifying, the quantifying, the reducing, and the determining the quality of object detections until a threshold is reached. Frame predictions are generated using the retrained deep learning model.

Inventors

  • Kunal Rao
  • Giuseppe Coviello
  • Murugan Sankaradas
  • Oliver Po
  • Srimat Chakradhar
  • Sibendu Paul

Assignees

  • NEC LABORATORIES AMERICA, INC.

Dates

Publication Date
20260505
Application Date
20230728

Claims (17)

  1. 1 . A method for increasing accuracy of video analytics tasks in real-time, comprising: acquiring a video using one or more video cameras, and identifying fluctuations in the accuracy of video analytics applications across consecutive frames of the video; quantifying the identified fluctuations by determining an average relative difference of true-positive detection counts across the consecutive frames; reducing the fluctuations in accuracy by applying transfer learning to a deep learning model initially trained using images, and retraining the deep learning model using video frames captured for a plurality of different scenarios; determining a quality of object detections based on an amount of track-ids assigned by a tracker across different video frames; optimizing the reducing the fluctuations by iteratively repeating the identifying fluctuations, the quantifying the identified fluctuations, the reducing the fluctuations in accuracy, and the determining a quality of object detections until a threshold is reached, wherein the fluctuations in accuracy are quantified by determining the average relative difference of the true-positive object detection counts across the consecutive frames by:  tp ⁡ ( i ) - tp ⁡ ( i + 1 )  mean ( gt ⁡ ( i ) , gt ⁡ ( i + 1 ) ) , where i represents a video frame, tp(i) represents a true positive object detection count on frame i, and gt(i) represents a ground-truth object count on frame i, on a moving window of 2 frames; and generating model predictions for each frame in the video using the retrained deep learning model for the video analytics tasks.
  2. 2 . The method as recited in claim 1 , further comprising performing object and person detection across the different video frames of the video in real-time with increased detection speed and accuracy by using the deep learning model retrained using the video frames.
  3. 3 . The method as recited in claim 1 , further comprising performing object and person counts for a particular area of interest based on the captured video using the retrained deep learning model, and generating a list of the predicted object and person counts in the area of interest.
  4. 4 . The method as recited in claim 1 , wherein the retraining of the deep learning model using transfer learning includes extracting a plurality of video frames from the video and pre-processing the extracted frames to match input requirements of the deep learning model.
  5. 5 . The method as recited in claim 1 , wherein the fluctuations in accuracy are reduced by adjusting a confidence threshold in the deep learning model based on a difficulty level associated with detection in particular frames of the video.
  6. 6 . The method as recited in claim 1 , wherein the fluctuations in accuracy result from an adversarial effect caused by automatic, dynamic camera parameter changes in a video camera.
  7. 7 . The method as recited in claim 1 , wherein the fluctuations in accuracy are quantified by determining the average relative difference of the true-positive object detection counts across the consecutive frames by:  max ⁢ ( tp ⁡ ( i ) , … , tp ⁡ ( i + 9 ) ) - min ⁡ ( tp ⁡ ( i ) , … , tp ⁡ ( i + 9 ) )  mean ( gt ⁡ ( i ) , … , gt ⁡ ( i + 9 ) ) , where i represents a video frame, tp(i) represents a true positive object detection count on frame i, and gt(i) represents a ground-truth object count on frame i, on a moving window of 10 frames.
  8. 8 . A system for increasing accuracy of video analytics tasks in real-time, comprising: a processor operatively coupled to a non-transitory computer-readable storage medium, the processor being configured for: acquiring a video using one or more video cameras, and identifying fluctuations in the accuracy of video analytics applications across consecutive frames of the video; quantifying the identified fluctuations by determining an average relative difference of true-positive detection counts across the consecutive frames; reducing the fluctuations in accuracy by applying transfer learning to a deep learning model initially trained using images, and retraining the deep learning model using video frames captured for a plurality of different scenarios; determining a quality of object detections based on an amount of track-ids assigned by a tracker across different video frames; optimizing the reducing the fluctuations by iteratively repeating the identifying fluctuations, the quantifying the identified fluctuations, the reducing the fluctuations in accuracy, and the determining a quality of object detections until a threshold is reached, wherein the fluctuations in accuracy are quantified by determining the average relative difference of the true-positive object detection counts across the consecutive frames by:  tp ⁡ ( i ) - tp ⁡ ( i + 1 )  mean ( gt ⁡ ( i ) , gt ⁡ ( i + 1 ) ) , where i represents a video frame, tp(i) represents a true positive object detection count on frame i, and gt(i) represents a ground-truth object count on frame i, on a moving window of 2 frames; and generating model predictions for each frame in the video using the retrained deep learning model for the video analytics tasks.
  9. 9 . The system as recited in claim 8 , wherein the processor is further configured for performing object and person detection across the different video frames of the video in real-time with increased detection speed and accuracy by using the deep learning model retrained using the video frames.
  10. 10 . The system as recited in claim 8 , wherein the processor is further configured for performing object and person counts for a particular area of interest based on the captured video using the retrained deep learning model, and generating a list of the predicted object and person counts in the area of interest.
  11. 11 . The system as recited in claim 8 , wherein the retraining of the deep learning model using transfer learning includes extracting a plurality of video frames from the video and pre-processing the extracted frames to match input requirements of the deep learning model.
  12. 12 . The system as recited in claim 8 , wherein the fluctuations in accuracy are reduced by adjusting a confidence threshold in the deep learning model based on a difficulty level associated with detection in particular frames of the video.
  13. 13 . The system as recited in claim 8 , wherein the fluctuations in accuracy result from an adversarial effect caused by automatic, dynamic camera parameter changes in a video camera.
  14. 14 . The system as recited in claim 8 , wherein the fluctuations in accuracy are quantified by determining the average relative difference of the true-positive object detection counts across the consecutive frames by:  max ⁢ ( tp ⁡ ( i ) , … , tp ⁡ ( i + 9 ) ) - min ⁡ ( tp ⁡ ( i ) , … , tp ⁡ ( i + 9 ) )  mean ( gt ⁡ ( i ) , … , gt ⁡ ( i + 9 ) ) , where i represents a video frame, tp(i) represents a true positive object detection count on frame i, and gt(i) represents a ground-truth object count on frame i, on a moving window of 10 frames.
  15. 15 . A computer program product for increasing accuracy of video analytics tasks in real-time, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising: acquiring a video using one or more video cameras, and identifying fluctuations in the accuracy of video analytics applications across consecutive frames of the video; quantifying the identified fluctuations by determining an average relative difference of true-positive detection counts across the consecutive frames; reducing the fluctuations in accuracy by applying transfer learning to a deep learning model initially trained using images, and retraining the deep learning model using video frames captured for a plurality of different scenarios; determining a quality of object detections based on an amount of track-ids assigned by a tracker across different video frames; optimizing the reducing the fluctuations by iteratively repeating the identifying fluctuations, the quantifying the identified fluctuations, the reducing the fluctuations in accuracy, and the determining a quality of object detections until a threshold is reached, wherein the fluctuations in accuracy are quantified by determining the average relative difference of the true-positive object detection counts across the consecutive frames by:  tp ⁡ ( i ) - tp ⁡ ( i + 1 )  mean ⁢ ( gt ⁡ ( i ) , g ⁢ t ( i + 1 ) ) , where i represents a video frame, tp(i) represents a true positive object detection count on frame i, and gt(i) represents a ground-truth object count on frame i, on a moving window of 2 frames; and generating model predictions for each frame in the video using the retrained deep learning model for the video analytics tasks.
  16. 16 . The computer program product as recited in claim 15 , further comprising performing object and person counts for a particular area of interest based on the captured video using the retrained deep learning model, and generating a list of the predicted object and person counts in the area of interest.
  17. 17 . The computer program product as recited in claim 15 , wherein the retraining of the deep learning model using transfer learning includes extracting a plurality of video frames from the video and pre-processing the extracted frames to match input requirements of the deep learning model.

Description

RELATED APPLICATION INFORMATION This application claims priority to U.S. Provisional App. No. 63/393,900, filed on Jul. 30, 2022, incorporated herein by reference in its entirety. BACKGROUND Technical Field The present invention relates to improved video analytics accuracy using transfer learning, and more particularly to improving video analytics accuracy for better detection and tracking of objects and faces in a video by retraining a neural network model using transfer learning to reduce fluctuations in accuracy of detection and tracking of objects and faces in consecutive video frames of the video. Description of the Related Art Significant advancements in machine learning and computer vision, coupled with the rapid expansion of Internet of Things (IoT), edge computing, and high-bandwidth access networks like 5G, have resulted in widespread adoption of video analytics systems. These systems utilize cameras deployed worldwide to support a wide range of applications in various market segments, including entertainment, health care, retail, automotive, transportation, home automation, safety, and security. Traditional video analytics systems predominantly depend on state-of-the-art (SOTA) deep learning models to analyze and interpret video stream content. It is a common practice to treat a video as a sequence of individual images or frames and apply deep neural network (DNN) learning models, initially trained on images, to similar analytics tasks on videos. The availability of extensive image datasets, such as Common Objects in Context (COCO), has facilitated the training of highly accurate SOTA deep learning models capable of detecting a wide range of objects in images, and conventional systems have worked under the assumption that these models will perform equally well on videos as on images. However, such conventional video analytics systems exhibit significant fluctuations (e.g., 40% or greater fluctuations) in the accuracy and consistency of video analytics tasks (e.g., object detection, face detection, etc.) across consecutive video frames, rather than remaining essentially the same, even in videos which predominantly exhibit static scenes with minimal activity (e.g., cars or people have negligible or no movement). While the ground truth (e.g., the total number of objects, persons, animals, etc. actually in a video frame) remains essentially constant across consecutive video frames of such predominantly static scenes, conventional video analytics systems exhibit significant, measurable fluctuations in detection counts across consecutive video frames during analytics, and thus provide inaccurate results. These fluctuations of video analytics accuracy are similarly present when performing video analytics tasks for videos with dynamic scenes (e.g., including movement of objects or persons) using conventional systems and methods, and occur using any camera model and for any type or quality of the video analyzed. The adverse impact of these fluctuations across frames causes a reduction in detection count accuracy and thus overall system performance, and extends to multiple video analytics applications, including, for example, those that rely on object or face detection insights for higher-level tasks like object tracking or person recognition. SUMMARY According to an aspect of the present invention, a method is provided for increasing accuracy of video analytics tasks in real-time, including acquiring a video using one or more video cameras, and identifying fluctuations in the accuracy of video analytics applications across consecutive frames of the video. The identified fluctuations are quantified by determining an average relative difference of true-positive detection counts across the consecutive frames, and fluctuations in accuracy are reduced by applying transfer learning to a deep learning model initially trained using images, and retraining the deep learning model using video frames captured for a plurality of different scenarios. A quality of object detections is determined based on an amount of track-ids assigned by a tracker across different video frames, and the reducing the fluctuations is optimized by iteratively repeating the identifying fluctuations, the quantifying the identified fluctuations, the reducing the fluctuations in accuracy, and the determining a quality of object detections until a threshold is reached. Model predictions are generated for each frame in the video using the retrained deep learning model for the video analytics tasks. According to another aspect of the present invention, a system is provided for increasing accuracy of video analytics tasks in real-time, and includes a processor, operatively coupled to a non-transitory computer-readable storage medium, and configured for acquiring a video using one or more video cameras, and identifying fluctuations in the accuracy of video analytics applications across consecutive frames of the video. The identified fluctuations ar