KR-20260068107-A - System and method for content-adaptive multi-scale feature layer filtering

KR20260068107AKR 20260068107 AKR20260068107 AKR 20260068107AKR-20260068107-A

Abstract

A system and method for encoding and decoding video for machine consumption are provided, wherein bandwidth is reduced by filtering feature layers at an encoder site that are determined to have redundancy or reduced relevance. The video encoder includes a neural network frontend that receives image data and generates multiple feature layers. A layer context processor and a redundant layer identifier cooperate to evaluate the context of objects within the image data and determine the relevance of the multiple feature layers to a machine task at a decoder site. Then, a layer filter performs at least one of removing redundant layers and scaling layers identified as having low relevance to a machine task to generate a filtered set of layers.

Inventors

아지치, 벨리보르
푸르트, 보리조브
칼바, 하리
멜로스, 후안

Assignees

오피 솔루션즈, 엘엘씨

Dates

Publication Date: 20260513
Application Date: 20240912
Priority Date: 20230912

Claims (12)

A neural network frontend that receives image data and generates multiple feature layers; A layer context processor that evaluates objects in image data to influence the importance that a layer of a feature map has for a machine task; a layer context processor that evaluates context objects in image data that influence the importance of a layer of a feature map for a machine task; A redundant layer identifier that applies the output of the context processor and determines whether a plurality of feature layers are relevant to the machine task; a layer filter that receives a plurality of feature layers and the output of the redundant layer identifier from the neural network frontend and performs to generate a filtered set of layers by removing at least one redundant layer or scaling layers identified as having low relevance to the machine task; and A layer filter that receives the outputs of a plurality of feature layer and redundant layer identifiers from the neural network frontend, and generates a filtered set of layers by performing at least one of redundant layer removal and scaling of layers identified as having low relevance to machine tasks: and A video encoder of a video machine coding system comprising: an encoder that receives the filtered set of layers and generates a coded bitstream of the filtered set of layers.
In claim 1, A video encoder that further includes signal information representing a plurality of layers removed or modified by a layer filter in the above-mentioned coded bitstream.
In claim 1, A video encoder having the above-mentioned plurality of layers having a layer size that gradually decreases.
In claim 1, A video encoder in which the above neural network frontend includes a feature pyramid network.
In claim 1, A video encoder in which the above-mentioned overlapping layer identifier further includes a lightweight object detector.
Receiving an encoded bitstream generated by an encoder that generates a plurality of feature layers and optionally filters the plurality of feature layers before encoding, and said bitstream includes signal information identifying the filtered feature layers and the filtered layers. The above bitstream includes signal information indicating a filtered feature layer and which layer was affected by filtering, and Decompress the above encoded bitstream, and The above signal information is applied to the decompressed bitstream to create a layer filtered by the encoder, and A decoder for an image machine coding system comprising a circuit configured to apply a reconstructed feature layer to a neural network trained for machine work. A decoder for a machine video coding system.
In claim 6, A decoder that explicitly signals the layer removed during encoding of the above signal information.
In claim 6, A decoder that implicitly signals the layer removed during encoding of the above signal information.
A step of applying a neural network frontend that receives image data and generates multiple feature layers; A layer context processor that evaluates objects in image data and influences the importance of a layer of a feature map for a machine task; a layer context processor that evaluates context objects in image data that influence the importance of a layer of a feature map for a machine task; A step of applying the output of the above context processor and applying a redundancy layer identifier that determines whether a plurality of feature layers are relevant to machine work; A layer filter that receives the output of a plurality of feature layers and a duplicate layer identifier from the neural network frontend and generates a filtered set of layers by removing at least one duplicate layer or scaling layers identified as having low relevance to the machine task; and a layer filter that receives the output of a plurality of feature layers and a duplicate layer identifier from the neural network frontend and generates a filtered set of layers by performing at least one of removing duplicate layers and scaling layers identified as having low relevance to the machine task; and A method for transmitting an encoded bitstream for video coding of an image machine, comprising the step of receiving the filtered set of layers and applying an encoder to generate a coded bitstream of the filtered set of layers for transmission.
In claim 9, An encoded bitstream transmission method comprising signal information representing a plurality of layers that are removed or modified by a layer filter, wherein the above-mentioned coded bitstream further comprises
In claim 10, An encoded bitstream transmission method in which the bitstream explicitly signals a layer removed by a layer filter.
In claim 10, A method for transmitting an encoded bitstream in which the bitstream implicitly signals a layer removed by a layer filter.

Description

System and method for content-adaptive multi-scale feature layer filtering The present application claims the benefit of priority to U.S. provisional application serial number 63/537,927, filed on September 12, 2023, titled “Content Adaptive Multi-Scale Feature Layer Filtering,” the disclosure of said application is incorporated herein by reference in its entirety. The present application generally relates to the field of video encoding and decoding. In particular, the present invention relates to a system and method for reducing bandwidth in a machine video coding system by selecting, removing, or modifying a selected feature layer from an encoded bitstream. As the number and scale of deployed video sensors and devices increase, increasingly larger volumes of video are expected to be processed by machines. Systems or solutions utilizing thousands of cameras will generate massive amounts of video that cannot be monitored by humans in a cost-effective manner. Machines or computational systems that collect and analyze video provide effective solutions that enable decision support systems and analytics engines. Machines designed to perform analytical tasks are not as sensitive to video quality and resolution as human operators are. Video Coding for Machines (VCM) addresses these opportunities to transform and represent video in a way that minimizes the computing, storage, and streaming of video data, while ensuring that machine operations are performed with high operational efficiency. Feature Coding based Video Coding for Machines (FCVCM) is based on the observation that while video analysis based on convolutional neural networks (CNNs) is the most prominent solution, it requires significant computational resources. The FCVCM method attempts to compress and transmit video features extracted from a CNN, and at the receiving end, processing continues in a CNN that performs machine tasks using the decompressed features. A typical CNN for performing object detection is illustrated in FIG. 1. A CNN trained using training data such as video, images, audio, LiDAR, thermal imagery, and even text is used to perform tasks and provide information about data inputs to the network. The input image is first processed by a Feature Pyramid Network (FPN, 105), which generates feature maps in layers P2, P3, P4, and P5. Feature maps from the layers P2, P3, P4, and P5 are further processed by a Region Proposal Network (RPN, 110) and a Box Head (115), the Box Head (115) may include a Fast R-CNN ConvFCHead (120) and a second Fast R-CNN for output layers 125 that determine regions of interest (ROI) and labels detected for said ROI. In these architectures, the FPN (105), RPN (110), and Box Head (115) represent computationally complex processes. By splitting the computations performed in the CNN between the transmitter/camera side and the receiver side, system complexity can be reduced and data usage can also be reduced. In such architectures, the camera or video source end of the system will consist of a first part of the CNN, such as an FPN (105), or a front end. The output of the FPN, namely feature maps P2, P3, P4, and P5, is then compressed and transmitted to a receiver, where the received and decompressed feature maps are input to the second part of the network. Components such as an RPN (110) and a box head (115) are included to complete the execution of the above network and generate a network output. This CNN can be trained to perform tasks such as object detection, segmentation, action detection, and object tracking. FIG. 2 illustrates a typical approach for processing such a split network. NN part 1 (205) represents a camera/video source in which an input video or image is processed by part 1 of the neural network to generate multiple feature maps, e.g., four feature maps P2, P3, P4, and P5. The number and structure of the feature maps vary depending on the neural network. For example, some networks may have only three feature maps instead of four. The size or dimension of each feature map also varies depending on the neural network. The feature maps are prepared for compression by packing and quantizing the feature maps. NN Part 1 FIG. 3 further illustrates NN part 1 (305). In this step, features are extracted from a neural network such as Res-Net of FPN (105). These features may consist of four layers P2 (310a), P3 (310b), P4 (310c), and P5 (310d), each having 256 channels. Each channel of each layer represents a convolution by a kernel and represents input image features. Typically, all channels of a given layer have the same dimensions. For example, as illustrated in FIG. 4, in layer P2, each channel has two dimensions P2cW and P2cH. These two dimensions, representing the width and height of the channels, vary depending on the input image and the neural network. Similarly, the channel dimensions in the other three layers of this exemplary network can be represented as P3cW X P3cH, P4cW X P4cH, and P5cW X P5cH, with cor