US-12627822-B2 - Video and feature coding for multi-task machine learning

US12627822B2US 12627822 B2US12627822 B2US 12627822B2US-12627822-B2

Abstract

A system and method for video and feature coding of neural-network structures used for multi-task machine learning includes an encoder, decoder and a decoder-compliant bitstream. A task-specific video decoder includes a first decoder receiving a bitstream having at least one feature and a description of a neural network backbone used to generate the bitstream, and a task-specific neural network head. The neural network head recreates the neural network backbone from the description, receives a feature from the bitstream a generates a task-specific output.

Inventors

Hari Kalva
Borivoje Furht
Velibor Adzic

Assignees

OP SOLUTIONS LLC

Dates

Publication Date: 20260512
Application Date: 20240617

Claims (20)

1 . A machine video decoder for a machine video task comprising: a first decoder receiving a bitstream, the bitstream having at least one feature map extracted from a video signal by a neural network backbone at an encoder and a description of the neural network backbone, the neural network backbone being a first part of a convolutional neural network and comprising a first set of convolution layers and a first set of pooling layers, the first decoder being one of an AVC decoder, an HEVC decoder and a VVC decoder and providing the at least one feature map and backbone description as outputs; a neural network head comprising a second part of the convolutional neural network, the neural network head being trained for a specific machine task, the neural network head receiving the backbone description and at least one feature map from the first decoder and generating a task-specific output for the machine video task, wherein the neural network head comprises a deep neural network.
2 . The machine video decoder of claim 1 , further comprising a plurality of neural network heads, each of said neural network heads being trained for a specific task and receiving the at least one feature map and the output of the neural network backbone and generating a task-specific output.
3 . The machine video decoder of claim 1 wherein the second part of the convolutional neural network further comprises a second set of one or more convolution layers and a second set of one or more pooling layers.
4 . The machine video decoder of claim 1 wherein the bitstream includes a feature sequence parameter set containing first information about feature maps, a plurality of feature picture parameter sets containing second information about the feature maps, and a plurality of feature picture headers containing third information about the feature maps.
5 . The machine video decoder of claim 1 wherein the bitstream contains an SEI message containing information about the first part of the convolutional neural network.
6 . The machine video decoder of claim 1 wherein a split point between the first part of the convolutional neural network and the second part of the convolutional neural network is adaptively selected.
7 . The machine video decoder of claim 1 wherein the bitstream contains information about the size and position of the feature maps.
8 . A method for decoding an encoded bitstream for a machine video task comprising: receiving a bitstream encoded using one of an AVC, HEVC, or VVC compliant encoding protocol, the bitstream including a sequence of feature maps extracted from a source video by an encoder using a first part of a convolutional neural network comprising a first set of one or more convolution layers and a first set of one or more pooling layers, and decoding the bitstream with one of an AVC decoder, an HEVC decoder, or a VVC decoder and outputting the sequence of feature maps; and applying the sequence of feature maps from the first part of the convolutional neural network to a second part of the convolutional neural network, the second part of the convolutional neural network completing the machine video task, wherein the second part of the convolutional neural network comprises a deep neural network.
9 . The decoding method of claim 8 wherein the second part of the convolutional neural network further comprises a second set of one or more convolution layers and a second set of one or more pooling layers.
10 . The decoding method of claim 8 wherein the bitstream includes a feature sequence parameter set containing first information about feature maps, a plurality of feature picture parameter sets containing second information about the feature maps, and a plurality of feature picture headers containing third information about the feature maps.
11 . The decoding method of claim 8 wherein the bitstream contains an SEI message containing information about the first part of the convolutional neural network.
12 . The decoding method of claim 8 wherein the bitstream includes information about the first part of the convolutional neural network.
13 . The decoding method of claim 9 wherein a split point between the first part of the convolutional neural network and the second part of the convolutional neural network is adaptively selected.
14 . The decoding method of claim 8 wherein the machine task is a machine vision task.
15 . The decoding method of claim 8 , wherein the machine vision task is one of detecting a class of an object, tracking an object, and object segmentation.
16 . The decoding method of claim 8 further comprising outputting the feature maps to a plurality of second parts of a convolutional neural network, each second part for completing a different machine video task.
17 . The decoding method of claim 8 wherein the deep neural network is a fully connected neural network.
18 . The decoding method of claim 8 wherein the bitstream contains information about the size and position of the feature maps.
19 . A machine video encoder for a machine video task, the encoder comprising: a feature map extractor, the feature map extractor being a first part of a convolutional neural network and comprising a first set of convolution layers and a first set of pooling layers, the feature map extractor outputting a sequence of feature maps extracted from an input source video, and an encoder encoding the extracted feature maps using one of an AVC, an HEVC encoder, or a VVC encoding protocol to generate an encoded bitstream for a machine video task to be completed at a decoding site having a second part of the convolutional neural network.
20 . The machine video encoder of claim 19 wherein the bitstream includes a feature sequence parameter set containing first information about the feature maps, a plurality of feature picture parameter sets containing second information about the feature maps, and a plurality of feature picture headers containing third information about the feature maps.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application is a continuation of international application serial number PCT/US22/53579 filed on Dec. 21, 2022, entitled VIDEO AND FEATURE CODING FOR MULTI-TASK MACHINE LEARNING, which claims the benefit of priority to U.S. Provisional Patent Application Ser. No. 63/293,157 filed on Dec. 23, 2021, and entitled VIDEO AND FEATURE CODING FOR MULTI-TASK MACHINE LEARNING and also claims the benefit of priority to U.S. Provisional Patent Application Ser. No. 63/293,217 filed on Dec. 23, 2021, and entitled SYSTEMS AND METHODS FOR ADAPTIVE NEURAL NETWORK OPTIMIZATION FOR MULTIPLE TASK FEATURE CODING, the disclosures of which are hereby incorporated by reference in their entireties. FIELD OF THE INVENTION The present invention generally relates to the field of video encoding and decoding. In particular, the present invention is directed to systems and methods for video and feature coding for multi-task machine learning. BACKGROUND A video codec can include an electronic circuit or software that compresses or decompresses digital video. It can convert uncompressed video to a compressed format or vice versa. In the context of video compression, a device that compresses video (and/or performs some function thereof) can typically be called an encoder, and a device that decompresses video (and/or performs some function thereof) can be called a decoder. A format of the compressed data can conform to a standard video compression specification. The compression can be lossy in that the compressed video lacks some information present in the original video. A consequence of this can include that decompressed video can have lower quality than the original uncompressed video because there is insufficient information to accurately reconstruct the original video. There can be complex relationships between the video quality, the amount of data used to represent the video (e.g., determined by the bit rate), the complexity of the encoding and decoding algorithms, sensitivity to data losses and errors, case of editing, random access, end-to-end delay (e.g., latency), and the like. Motion compensation can include an approach to predict a video frame or a portion thereof given a reference frame, such as previous and/or future frames, by accounting for motion of the camera and/or objects in the video. It can be employed in the encoding and decoding of video data for video compression, for example in the encoding and decoding using the Motion Picture Experts Group (MPEG)'s advanced video coding (AVC) standard (also referred to as H.264). Motion compensation can describe a picture in terms of the transformation of a reference picture to the current picture. The reference picture can be previous in time when compared to the current picture, from the future when compared to the current picture. When images can be accurately synthesized from previously transmitted and/or stored images, compression efficiency can be improved. While video content is often considered for human consumption, there is a growing need for video in industrial settings and other settings in which the contend is evaluated by machines rather than humans. Recent trends in robotics, surveillance, monitoring, Internet of Things, etc. introduced use cases in which significant portion of all the images and videos that are recorded in the field is consumed by machines only, without ever reaching human eyes. Those machines process images and videos with the goal of completing tasks such as object detection, object tracking, segmentation, event detection etc. Recognizing that this trend is prevalent and will only accelerate in the future, international standardization bodies established efforts to standardize image and video coding that is primarily optimized for machine consumption. For example, standards like JPEG AI and Video Coding for Machines are initiated in addition to already established standards such as Compact Descriptors for Visual Search, and Compact Descriptors for Video Analytics. Further improving encoding and decoding of video for consumption by machines and in hybrid systems in which video is consumed by both a human viewer and a machine is, therefore, of growing importance in the field. In many applications, such as surveillance systems with multiple cameras, intelligent transportation, smart city applications, and/or intelligent industry applications, traditional video coding may require compression of large number of videos from cameras and transmission through a network for both machine consumption and for human consumption. Subsequently, at a machine site, algorithms for feature extraction may applied typically using convolutional neural networks or deep learning techniques including object detection, event action recognition, pose estimation and others. SUMMARY OF THE DISCLOSURE A task-specific decoder is provided that includes a first decoder receiving a bitstream, the bitstream having at least one feature an