CN-122003867-A - Method, apparatus and system for encoding and decoding tensors

CN122003867ACN 122003867 ACN122003867 ACN 122003867ACN-122003867-A

Abstract

Systems and methods for decoding tensors and video from a bitstream, the tensors being related to the video. The method includes decoding a set of video parameters indicating the presence of a plurality of sub-bitstreams in a bitstream, decoding video from a first sub-bitstream generated from the bitstream by performing a first sub-bitstream extraction process on the bitstream, and decoding a tensor for the decoded video from a second sub-bitstream obtained by performing a second sub-bitstream extraction process on the bitstream.

Inventors

Christoffer James Luo Siwoen

Assignees

佳能株式会社

Dates

Publication Date: 20260508
Application Date: 20240828
Priority Date: 20231010

Claims (9)

1. A method for decoding a tensor and video from a bitstream, the bitstream comprising a first sub-bitstream and a second sub-bitstream, the tensor being related to the video, the method comprising: decoding a set of video parameters indicating the presence of a plurality of sub-bitstreams in the bitstream; Decoding the video from a first sub-bitstream obtained by subjecting the bitstream to a first sub-bitstream extraction process, and Tensors for the decoded video are decoded from a second sub-bitstream obtained by performing a second sub-bitstream extraction process on the bitstream.
2. The method of claim 1, wherein the video parameter set is decoded using an HEVC multi-layer master profile, wherein a 4:2:0 chroma format and bit depth 8 are used for both the first and second sub-bitstreams, and a neutral value is used for chroma samples in the second sub-bitstream, the neutral value being 128.
3. The method of claim 1, wherein the video parameter set is decoded using a VVC multi-layer master 10 profile having a 4:2:0 chroma format for video layers and a 4:0:0 chroma format for feature layers.
4. The method of claim 1, wherein the bitstream may include a plurality of second sub-bitstreams.
5. The method of claim 1, wherein the first sub-bitstream extraction process filters the bitstream based on a first layer identifier to obtain the first sub-bitstream from the bitstream, the second sub-bitstream extraction process filters the bitstream based on a second layer identifier to obtain the second sub-bitstream from the bitstream, the first sub-bitstream and the second sub-bitstream each being independently decodable.
6. The method of claim 1, further comprising determining task results of the decoded tensor using a portion of the neural network.
7. The method of claim 6, further comprising rendering the task results in association with the video.
8. A decoder for decoding a tensor and video from a bitstream, the bitstream comprising a first sub-bitstream and a second sub-bitstream, the tensor being associated with the video, the decoder being configured to: decoding a set of video parameters indicating the presence of a plurality of sub-bitstreams in the bitstream; Decoding the video from a first sub-bitstream obtained by subjecting the bitstream to a first sub-bitstream extraction process, and Tensors for the decoded video are decoded from a second sub-bitstream obtained by performing a second sub-bitstream extraction process on the bitstream.
9. A non-transitory computer readable storage medium storing a program for performing a method for decoding a tensor and video from a bitstream, the bitstream comprising a first sub-bitstream and a second sub-bitstream, the tensor being related to the video, the method comprising: decoding a set of video parameters indicating the presence of a plurality of sub-bitstreams in the bitstream; Decoding the video from a first sub-bitstream obtained by subjecting the bitstream to a first sub-bitstream extraction process, and Tensors for the decoded video are decoded from a second sub-bitstream obtained by performing a second sub-bitstream extraction process on the bitstream.

Description

Method, apparatus and system for encoding and decoding tensors Citation of related application The present application is based on the benefit of the filing date of australian patent application 2023248075 filed on 10 th 2023, 35 u.s.c. ≡119, which is incorporated herein by reference in its entirety as if set forth in its entirety herein. Technical Field The present invention relates generally to digital video signal processing and, in particular, to methods, apparatus and systems for encoding and decoding tensors from convolutional neural networks. The invention also relates to a computer program product comprising a computer readable medium having recorded thereon a computer program for encoding and decoding tensors from a convolutional neural network using video compression techniques. Background Convolutional Neural Networks (CNNs) are emerging technologies for dealing with use cases involving machine vision, such as object detection, instance segmentation, object tracking, human body pose estimation, and motion recognition. The application of CNNs may involve the use of "edge devices" with sensors and some processing power coupled to an application server as part of a "cloud". CNNs may require relatively high computational complexity, which exceeds that which can typically be provided in terms of computational capacity or power consumption with edge devices. Executing CNN in a distributed manner has become one solution to running a front-end network with limited capability edge devices without requiring all computational complexity to be generated within the cloud server, while the edge devices may have underutilized reasoning resources. In other words, the distributed processing allows legacy edge devices to still provide the capability of the leading edge CNN by distributing processing between the edge devices and other processing components such as cloud servers. Such a distributed network architecture may be referred to as "Collaborative Intelligence (CI)", and provides benefits such as reusing partial results from a first portion of the network for several different second portions (possibly with each portion being optimized for a different task). The CI architecture introduces the need for efficient compression of tensor data for transmission over a network such as a WAN. CNNs typically comprise many layers, such as convolutional layers and fully-connected layers, in which data is passed from one layer to the next in the form of a "tensor". Splitting the network across different devices introduces the need to compress the intermediate multidimensional tensor data passing from one layer to the next within the CNN to facilitate transmission over a network with bandwidth limitations or costs. Such compression of tensors may be referred to as "feature compression", and intermediate tensor data is often referred to as "feature" or "feature map". A feature or feature map is typically a collection of two-dimensional (2D) arrays of values that, when combined into a 3D (or 4D) data structure, form a tensor, with each feature map corresponding to one "channel" of the tensor. The intermediate tensor data represents a partially processed version of the input, such as an image frame or video frame, encountered within the neural network. The international organization for standardization/international electrotechnical association 1/group association 29/working group 2 (ISO/IEC JTC1/SC29/WG 2) (also known as "moving picture experts group" (MPEG)) technical requirements are assigned the task of studying the requirements of compression techniques in various contexts and often associated with video. WG2 "MPEG specifications" have established "feature compression of machine video coding" (FCVCM) ad-hoc group (ad-hoc group), which was commissioned to study feature compression. FCVCM AHG have issued a "proposal solicitation" of solicitation responses to form the basis of standardized projects related to feature compression. Previously, the response to the "evidence collection" (CfE) has shown that it can be significantly better than the technique of feature compression results achieved using the most advanced normalization technique applied directly to tensors. CNNs typically require weights for the various layers to be predetermined in a training phase in which a very large amount of training data is passed through the CNN and the results determined by the trained network are compared to ground truth values (ground truth) associated with the training data. The difference between the obtained result and the desired result is denoted as "loss" and is measured using a "loss function". Using the determined loss, a process for updating the network weights, such as random gradient descent (SGD), or the like, is performed. Network weight updates typically involve a "gradient" back-propagation process that starts at the output layer of the network and reverses to terminate when the input layer to the network is updated, the