EP-4736135-A1 - SYSTEMS AND METHOD FOR DECODED FRAME AUGMENTATION FOR VIDEO CODING FOR MACHINES

EP4736135A1EP 4736135 A1EP4736135 A1EP 4736135A1EP-4736135-A1

Abstract

The present systems and methods for video decoding for machine processing extract features and image statistics from the decoded bitstream and evaluate the image statistics to predict whether augmentation of a frame will enhance task performance and generate at least one parameter for selectively applying an augmentation process. The parameter is applied to selectively alter at least a portion of a frame to enhance task performance by the machine processing the decoded bitstream.

Inventors

ADZIC, VELIBOR
FURHT, BORIVOJE
KALVA, HARI
MERLOS, Juan

Assignees

OP Solutions, LLC

Dates

Publication Date: 20260506
Application Date: 20240630

Claims (15)

1. A decoder for a video coding for machine consumption employing frame augmentation comprising: a video decoder receiving an encoded bitstream and providing a decoded bitstream comprising a plurality of frames; a feature extractor module for extracting features and image statistics from the decoded bitstream for machine processing; and an augmentation module applying the extracted features and image statistics and selectively altering at least a portion of at least one frame of the decoded bitstream to enhance task performance by the machine processing the decoded bitstream.
2. The decoder of claim 1, further comprising a prediction module, said prediction module interposed between the feature extractor module and the augmentation module and providing at least one parameter to the augmentation model indicating whether to selectively apply augmentation for at least one frame of the decoded bitstream.
3. The decoder of claim 2, wherein the prediction model includes a trained neural network evaluating decoded frame attributes including at least one of quantization parameters, motion parameters, block partitioning, and header information describing encoder parameters.
4. The decoder of claim 2, wherein selectively applying augmentation further comprises selectively adjusting the magnitude of feature augmentation for at least one frame of the decoded bitstream.
5. The decoder of claim 1 , wherein the extractor module extracts at least one image statistic from the decoded bitstream.
6. The decoder of claim 5, wherein the at least one image statistic includes at least one of statistics related to blur, brightness, color, BRUSQUE, resolution, contrast, and compression.
7. The decoder of claim 1, wherein a frame includes a plurality of coding blocks and wherein the augmentation module performs at least one of sharpening and blurring boundaries between adjacent coding blocks.
8. The decoder of claim 4, wherein the prediction module further comprises: acquiring image statistics from the extractor module; perform image augmentation on a current frame; performing object detection on a augmented current frame; determine image mAP higher and mAP lower parameters for the augmented frame; determining at least one image statistics score; and applying the image statistics and image statistics score to a trained prediction model and determining at least one mAP performance prediction.
9. A decoder for a video coding for machine consumption employing frame augmentation comprising: a video decoder receiving an encoded bitstream and providing a decoded bitstream comprising a plurality of frames; a feature extractor module extracting features and image statistics from the decoded bitstream for machine processing; a prediction module, said prediction module coupled to the extraction module, receiving image statistics therefrom, and providing at least one parameter to selectively apply augmentation for a at least one frame of the decoded bitstream; and an augmentation module receiving the at least one parameter from the prediction module and selectively altering at least a portion of at least one frame to enhance task performance by the machine processing the decoded bitstream.
10. The decoder of claim 9, wherein the prediction model includes a trained neural network evaluating decoded frame attributes including at least one of quantization parameters, motion parameters, block partitioning, and header information describing encoder parameters.
11. The decoder of claim 9, wherein the extractor module extracts at least one image statistic from the decoded bitstream.
12. The decoder of claim 11, wherein the at least one image statistic includes at least one of statistics related to blur, brightness, color, BRUSQUE, resolution, contrast, and compression.
13. The decoder of claim 9, wherein a frame includes a plurality of coding blocks and wherein the augmentation module performs at least one of sharpening and blurring boundaries between adjacent coding blocks.
14. The decoder of claim 9, wherein the prediction module further comprises a processor programmed with instructions for: acquiring image statistics from the extractor module; performing image augmentation on a current frame; performing object detection on a augmented current frame; determine image mAP higher and mAP lower parameters for the augmented frame; determining at least one image statistics score; and applying the image statistics and image statistics score to a trained prediction model and determining at least one mAP performance prediction.
15. A method for improving task performance of a machine processing encoded image data, comprising: receiving an encoded bitstream comprising compressed image data; decoding the encoded bitstream; extracting features and image statistics from the decoded bitstream; evaluating the image statistics to predict whether augmentation of a frame will enhance task performance and generate at least one parameter for selectively applying an augmentation process; and using the at least one parameter to selectively alter at least a portion of at least one frame to enhance task performance by the machine processing the decoded bitstream.

Description

Systems and Method for Decoded Frame Augmentation for Video Coding for Machines Statement of Related Cases [0001] The present application claims the benefit of priority to U.S. provisional application serial number 63/524,455, filed on June 30, 2023, and entitled "System and Method for Decoded Frame Augmentation for Video Coding for Machines," the disclosure of which is hereby incorporated by reference in its entirety. Background of the Disclosure [0002] In recent times, a significant portion of all the images and videos that are recorded in the field are consumed by machines only, without ever reaching human eyes. Those machines process images and videos with the goal of completing tasks such as object detection, object tracking, segmentation, event detection etc. Recognizing that this trend is prevalent and will only accelerate in the future, international standardization bodies established efforts to standardize image and video coding that is primarily optimized for machine consumption. For example, standards like JPEG Al and Video Coding for Machines are in development in addition to already established standards such as Compact Descriptors for Visual Search, and Compact Descriptors for Video Analytics. Solutions that improve efficiency compared to the classical image and video coding techniques are needed. [0003] Video Coding for Machines (VCM) is the process of compressing image/video information for machine consumption. As used herein, VCM is not limited to any particular protocol or standard and is intended to broadly convey compression and decompression of data for machine consumption. Machine consumption is the process of machines consuming information, in this case in the form of images/video. This can include object detection and segmentation tasks. Current VCM systems follow the following architecture: video/images are captured by a digital camera or recording device, video/image information is compressed using a device or software called a codec (compression-decompression). The compressed image/video information is sent to a receiving device where the image/video information is decompressed and interpreted by a machine. The compression can be performed using traditional block-based video encoders such as versatile video coding (VVC), neural -network based compression, or hybrid of traditional coding and neural -network based compression. [0004] Figure 1 is a block diagram of a system for encoding and decoding video for machinebased applications. Video coding in the system of Figure 1 can include any standard video encoder and/or encoding techniques such as, for example, Advanced Video Codec (AVC), Versatile Video Coding (VVC), or High Efficiency Video Coding (HEVC). [0005] Still referring to Fig. 1, frames from decoded videos at the receiver are used as input to the trained neural networks to perform tasks such as object detection, object segmentation, and object tracking. At the encoder side, the system typically includes an image capture system 105, such as a video camera, or other system for image capture including LIDAR and other non-visual "image" data or high-bandwidth machine readable data. The system further includes an image/video compression system 110. An encoded bitstream is transmitted over a suitable transmission channel to a receiving machine. The received bitstream is received by a decompression block, which substantially inverts the compression process and passes the decompressed bitstream to a machine analysis system 120. In general, applying compression is a lossy process that can degrade the quality of the compressed video relative to the source and may impact the performance of machine tasks. Higher levels of compression will be more efficient for transmission but could lead to larger degradation in quality and degradation in machine task performance. Methods to improve the task performance without increasing the size of the compressed video will improve VCM applications and services. [0006] Conventional approaches unfortunately, may require a massive video transmission especially in applications having multiple cameras or other high-bandwidth endpoints, which may take significant time for efficient and fast real-time analysis and decision-making. In certain embodiments, a VCM approach may resolve this problem by both encoding video and extracting some features at a transmitter site and then transmitting a resultant encoded bit stream to a VCM decoder. At a decoder site, video may be decoded for human vision and features may be decoded for machines. As used herein, the term VCM refers broadly to video coding and decoding for machine consumption and is not limited to a specific proposed protocol. [0007] A “feature,” as used in this disclosure, is a specific structural and/or content attribute of data. Examples of features may include SIFT, audio features, color hist, motion hist, speech level, loudness level, or the like. Features may be time stamped. Each feature may b