US-12627819-B2 - Reduced video stream resource usage

US12627819B2US 12627819 B2US12627819 B2US 12627819B2US-12627819-B2

Abstract

The description relates to resource aware object detection for encoded video streams that can identify frames of the video stream that include an object of interest, such as a human, without decoding the frames.

Inventors

Yichen HAO
Lihang LI
Anthony C. Romano
Naiteek SANGANI
Ryan S. Menezes

Assignees

MICROSOFT TECHNOLOGY LICENSING, LLC

Dates

Publication Date: 20260512
Application Date: 20221018

Claims (20)

1 . A system, comprising: a communication component configured to receive groups of pictures (GOPs) of a scene, the GOPs including encoded key frames and encoded non-key frames that relate to the encoded key frames; and, a processor configured to: determine a difference between size metadata of encoded non-key frames of a current GOP to size metadata of baseline non-key frames that do not include an object of interest, discard the current GOP where the determined difference is below a size difference threshold derived from the size metadata of the baseline non-key frames, decode the current GOP and perform object detection on the current GOP where the determined difference meets the size difference threshold; and dynamically adjust the size difference threshold based on feedback from the object detection.
2 . The system of claim 1 , wherein the encoded key frames comprise I frames and the encoded non-key frames comprise B frames and/or P frames.
3 . The system of claim 1 , further comprising a camera configured to capture images of the scene and generate the GOPs.
4 . The system of claim 3 , wherein the size difference threshold is specified as a ratio of sizes of the baseline non-key frames and the encoded non-key frames of the current GOP.
5 . The system of claim 1 , wherein the processor is configured to decode the current GOP where the determined difference meets the size difference threshold for two consecutive non-key frames.
6 . The system of claim 1 , wherein the processor is configured to: discard first subsequently-received GOPs that are below the dynamically-adjusted size difference threshold; and decode and perform object detection on second subsequently-received GOPs that meet the dynamically-adjusted size difference threshold.
7 . A device-implemented method, comprising: receiving groups of pictures (GOPs) of a scene, the GOPs including encoded key frames and encoded non-key frames; obtaining size metadata for baseline non-key frames of GOPs that do not include an object of interest; comparing size metadata of encoded non-key frames of a current GOP to the size metadata of the baseline non-key frames without decoding the current GOP; discarding the current GOP in instances when a difference between the size metadata of the encoded non-key frames of the current GOP and the size metadata of the baseline non-key frames is below a size difference threshold; decoding the current GOP and performing object detection on the current GOP in other instances when the difference between the size metadata of the encoded non-key frames of the current GOP and the size metadata of the baseline non-key frames meets the size difference threshold; and, refining the size difference threshold based at least on whether objects are detected in individual decoded GOPs.
8 . The method of claim 7 , wherein the receiving GOPs comprises receiving GOPs complying with real-time streaming protocol (RTSP) format, HTTP live streaming (HLS) format, web real-time communications (WebRTC) format, or secure reliable transport (SRT) format.
9 . The method of claim 7 , wherein the receiving comprises generating the GOPs.
10 . The method of claim 7 , wherein the obtaining size metadata for the baseline non-key frames comprises decoding GOPs and attempting to detect objects of interest in the decoded GOPs and labeling individual GOPs where no objects of interest are detected as baseline GOPs, the baseline non-key frames being included in the baseline GOPs.
11 . The method of claim 7 , wherein the size difference threshold is specified as a ratio between sizes of the baseline non-key frames and the encoded non-key frames of the current GOP.
12 . The method of claim 7 , wherein the comparing size metadata comprises generating multiple packet size thresholds from the baseline non-key frames and comparing the size metadata of the current GOP to the multiple packet size thresholds.
13 . The method of claim 7 , wherein the discarding comprises deleting the current GOP or wherein the discarding comprises not communicating the current GOP over a network.
14 . The method of claim 7 , wherein the decoding and performing object detection comprises communicating the current GOP over a network to a device that performs the decoding and object detection.
15 . The method of claim 7 , wherein the refining is performed in response to an indication that no object of interest is present in the individual decoded GOPs.
16 . One or more computer-readable storage media storing instructions which, when executed by one or more processing devices, cause the one or more processing devices to perform acts comprising: receiving groups of pictures (GOPs) of a scene, the GOPs including encoded key frames and encoded non-key frames; obtaining size metadata for baseline non-key frames of GOPs that do not include an object of interest; comparing size metadata of encoded non-key frames of a current GOP to the size metadata of the baseline non-key frames without decoding the current GOP; discarding the current GOP in instances when a difference between the size metadata of the encoded non-key frames of the current GOP and the size metadata of the baseline non-key frames is below a size difference threshold; causing the current GOP to be decoded and object detection to be performed on the current GOP in other instances when the difference between the size metadata of the encoded non-key frames of the current GOP and the size metadata of the baseline non-key frames meets the size difference threshold; and, refining the size difference threshold based at least on whether objects are detected by the object detection in individual decoded GOPs.
17 . The one or more computer-readable storage media of claim 16 , the acts further comprising: determining the size difference threshold as a threshold ratio of packet sizes of the baseline non-key frames and the encoded non-key frames of the current GOP.
18 . The one or more computer-readable storage media of claim 17 , the acts further comprising: based at least on the threshold ratio, determining a packet threshold size for discarding the current GOP or causing the current GOP to be decoded and object detection to be performed on the current GOP; and employing the packet threshold size to selectively discard or cause decoding and object detection to be performed on the current GOP.
19 . The one or more computer-readable storage media of claim 18 , the acts further comprising: adjusting the threshold ratio to an updated threshold ratio based at least on false positive results obtained by the object detection; and adjusting the packet threshold size to an updated packet threshold size based at least on the updated threshold ratio.
20 . The one or more computer-readable storage media of claim 19 , the acts further comprising: determining the packet threshold size based at least on a mean packet size of the baseline non-key frames, a standard deviation of packet sizes of the baseline non-key frames, and a multiplier that is based at least on the updated threshold ratio.

Description

BACKGROUND Cameras, such as security or surveillance cameras, capture large amounts of video, and often continuous streams of video. For instance, a security camera may continuously image the entrance to a building. In many cases, users are only interested when there is an object of interest, such as a human, animal, or vehicle in the video. As such, often a large percentage of this video contains nothing of interest to the users. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings illustrate implementations of the concepts conveyed in the present patent. Features of the illustrated implementations can be more readily understood by reference to the following description taken in conjunction with the accompanying drawings. Like reference numbers in the various drawings are used wherever feasible to indicate like elements. Further, the left-most numeral of each reference number conveys the figure and associated discussion where the reference number is first introduced. FIGS. 1A and 1B show an example system operating relative to an example scene to which some of the present resource saving object detection concepts can be applied. FIGS. 2 and 7 show example systems configured to employ the present resource saving object detection concepts in accordance with some implementations. FIGS. 3-6 show example flowcharts for accomplishing resource saving object detection concepts in accordance with some implementations. DETAILED DESCRIPTION This patent relates to cameras and video imaging scenarios. Often cameras are employed in a particular scene for a desired purpose, such as to detect when people are in the scene. Watching all of the video when the user is only interested in the subset of video when a person is present is burdensome and a waste of time. As such, automated processes have been developed to identify objects of interest in the video and only provide those portions of video to the user. However, these existing automated processes are extremely resource intensive. In these traditional processes, video is continuously captured and encoded as groups of pictures (GOPs). The GOPs include key frames, such as I frames and non-key frames, such as P and/or B frames. In these traditional processes, all of the encoded video is transmitted as GOPs over network resources, either locally or to remote locations. Processing resources are then utilized to decode the transmitted video (e.g., all of the GOPs). Additional processing resources are employed to detect (or not detect) objects of interest in the GOPs. The present concepts solve this technical problem of undesirable resource usage by providing a technical solution that greatly reduces resource usage while still providing object detection. The technical solution includes an object detection resource gateway that distinguishes frames of GOPs that include objects of interest from those that do not include objects of interest without decoding the frames. Thus, resources, such as network resources, decoding resources, and/or detecting resources are employed on a smaller percentage (e.g., a subset) of GOPs and frames of the video stream. This technical solution is accomplished by examining metadata associated with the GOPs. Specifically, metadata of non-key frames of a GOP can be analyzed to accurately identify whether the frames of GOPs include an object of interest or not. Resources can be expended on the GOPs that include objects of interest. Other GOPs that do not include objects of interest can be discarded or otherwise allocated less resources. This difference can be illustrated with an example use case scenario. Assume that a surveillance camera records a scene, such as a back entrance of a building for 24 hours. Assume further, that only one scene change occurs in those 24 hours. At one point, in a span of one minute a person (e.g., object of interest) walks up, opens the door, and enters the building. For the other for 23 hours and 59 minutes the scene is static (e.g., no object of interest is present). Existing techniques would handle encoded frames from the camera uniformly for the 24 hour period (e.g., resources would be used to transmit, decode, and attempt to detect objects of interest on all frames). In contrast, the present techniques could evaluate metadata of the encoded frames, identify the one minute period when the user entered the building, and expend additional resources on that one minute period and not the remaining 23 hours and 59 minutes of encoded frames. Thus, the present concepts provide a substantial technical solution for conserving resources relative to object detection in video streams. FIGS. 1A and 1B collectively show an example system 100A. A camera 102 captures video of a scene 104. In this case, the scene 104 is a hallway in a building. The video of the scene is encoded as a series of groups of pictures (GOPs) 106. Six GOPs (GOP106(1)-GOP106(6)) are represented for purposes of explanation and ease of illustration. (Note t