EP-4738826-A1 - NEURAL NETWORK-BASED METHOD AND APPARATUS FOR COMPRESSING MULTI-LAYER FEATURE MAP

EP4738826A1EP 4738826 A1EP4738826 A1EP 4738826A1EP-4738826-A1

Abstract

A method and an apparatus for encoding a multi-layer feature map, and a recording medium, of the present disclosure may comprise the steps of: extracting, from an input image, a multi-layer feature map including a plurality of feature maps in a hierarchical form; outputting a single-layered fusion latent representation by sequentially encoding the multi-layer feature map through consecutive encoding blocks; and encoding the single-layered fusion latent representation into a bit stream.

Inventors

JEONG, SE YOON
JEONG, HYE WON
KIM, YOUN HEE
LEE, JOO YOUNG
CHOI, JIN SOO
KANG, JUNG WON
KIM, YEONG WOONG
KIM, HUI YONG
YU, JANG HYUN
JANG, SEUNG HWAN

Assignees

Electronics and Telecommunications Research Institute
University-Industry Cooperation Group Of Kyung Hee University

Dates

Publication Date: 20260506
Application Date: 20240628

Claims (20)

A multi-layer feature map encoding method, the method comprising: extracting a multi-layer feature map including a plurality of feature maps in a layer form from an input image; sequentially encoding the multi-layer feature map through consecutive encoding blocks to output a fusion latent representation of a single layer; and encoding the fusion latent representation of the single layer into a bitstream.
The method of Claim 1, wherein the consecutive encoding blocks include a first encoding block that uses a feature map having a highest resolution of the multi-layer feature map as an input, a second encoding block that uses a first intermediate latent representation which is an output of the first encoding block as an input, a third encoding block that uses a first fusion latent representation which is an output of the second encoding block as an input, and a fourth encoding block that uses a second fusion latent representation which is an output of the third encoding block as an input to output the fusion latent representation of the single layer.
The method of Claim 2, wherein an input of the second encoding block, the third encoding block, and the fourth encoding block includes a feature map of the multi-layer feature map having a same resolution as an input intermediate latent representation or input fusion latent representation.
The method of Claim 2, wherein the first encoding block, the second encoding block, and the third encoding block include a convolutional neural network layer using a nxn-sized kernel in a same structure and a nonlinear layer of LeakyReLU.
The method of Claim 4, wherein the fourth encoding block does not include the nonlinear layer of the LeakyReLU.
The method of Claim 2, wherein a resolution of an output of the first encoding block, the second encoding block, the third encoding block and the fourth encoding block is lower than a resolution of an input.
The method of Claim 1, wherein the consecutive encoding blocks include a first encoding block that uses a feature map having a highest resolution of the multi-layer feature map as an input, a second encoding block that uses a first intermediate latent representation which is an output of the first encoding block as an input, a third encoding block that uses a feature map of the multi-layer feature map having a lower resolution than a first fusion latent representation which is an output of the second encoding block as an input, and a fourth encoding block that uses a second intermediate latent representation which is an output of the third encoding block and the first fusion latent representation as an input to output the fusion latent representation of the single layer.
The method of Claim 7, wherein an input of the second encoding block and the fourth encoding block includes a feature map of the multi-layer feature map having a same resolution as an input intermediate latent representation or input fusion latent representation.
The method of Claim 7, wherein the first encoding block, the second encoding block, and the third encoding block include a convolutional neural network layer using a nxn-sized kernel in a same structure and a nonlinear layer of LeakyReLU.
The method of Claim 9, wherein the fourth encoding block does not include the nonlinear layer of the LeakyReLU.
The method of Claim 7, wherein a resolution of an output of the first encoding block, the second encoding block, and the fourth encoding block is lower than a resolution of an input, and wherein a resolution of the output of the third encoding block is higher than a resolution of an input.
A multi-layer feature map decoding method, the method comprising: decoding a fusion latent representation of a single layer for a multi-layer feature map including a plurality of feature maps in a layer form from a bitstream; and sequentially decoding the multi-layer feature map through consecutive decoding blocks from the fusion latent representation of the single layer.
The method of Claim 12, wherein the consecutive decoding blocks include a first decoding block that uses the fusion latent representation of the single layer as an input, a second decoding block that uses a first feature map of the multi-layer feature map having a lowest resolution which is an output of the first decoding block as an input, a third decoding block that uses a second feature map which is an output of the second decoding block having a higher resolution than the first feature map as an input, and a fourth decoding block that uses a third feature map which is an output of the third decoding block having a higher resolution than the second feature map as an input to output a fourth feature map of the multi-layer feature map having a highest resolution.
The method of Claim 13, wherein the first decoding block, the second decoding block, and the third decoding block include a convolutional neural network layer using a nxn-sized kernel in a same structure and a nonlinear layer of LeakyReLU.
The method of Claim 14, wherein the fourth decoding block does not include the nonlinear layer of the LeakyReLU.
The method of Claim 13, wherein a resolution of an output of the first decoding block, the second decoding block, the third decoding block and the fourth decoding block is higher than a resolution of an input.
The method of Claim 12, wherein the consecutive decoding blocks include a first decoding block that uses the fusion latent representation of the single layer as an input, a second decoding block that uses a first feature map which is an output of the first decoding block as an input to output a second feature map of the multi-layer feature map having a lowest resolution, a third decoding block that uses the first feature map as an input to output a third feature map having a higher resolution than the first feature map, and a fourth decoding block that uses the third feature map as the input to output a fourth feature map of the multi-layer feature map having a highest resolution.
The method of Claim 17, wherein the second decoding block, the third decoding block, and the fourth decoding block include a convolutional neural network layer using a nxn-sized kernel in a same structure and a nonlinear layer of LeakyReLU.
The method of Claim 17, wherein a resolution of an output of the first decoding block, the third decoding block, and the fourth decoding block is higher than a resolution of an input, and wherein a resolution of an output of the second decoding block is lower than the resolution of an input.
A computer readable recording medium storing a bitstream generated by a multi-layer feature map encoding method, wherein the multi-layer feature map encoding method includes: extracting a multi-layer feature map including a plurality of feature maps in a layer form from an input image; sequentially encoding the multi-layer feature map through consecutive encoding blocks to output a fusion latent representation of a single layer; and encoding the fusion latent representation of the single layer into a bitstream.

Description

[Technical Field] The present disclosure relates to a technology for providing a method for encoding a feature map extracted through an artificial neural network to reduce the compression bit amount of a feature map and minimize performance degradation of a machine task performed by using a decoded feature map. [Background Art] As a machine task is widely used in various devices including a mobile device as well as a large server, a growing number of feature map extraction means and task execution means are not in the same device, but in a different device. When a feature map extraction means and a task execution means are separated, an extracted feature map must be delivered to a task execution means, but since the amount of data in a feature map is very large, a feature map encoding method for dramatically reducing the amount of data in a feature map while minimizing the degradation of task execution performance is required. [Disclosure] [Technical Problem] The existing technology encodes a multi-layer feature map independently or makes it into one fusion feature map to perform encoding, and for the former, there is a problem of increased complexity due to multiple encodings in parallel, and for the latter, there is a problem of increased complexity due to the serial execution of a fusion process and a latent representation extraction process and a disadvantage of not being able to sufficiently utilize a sequential attribute between layers. [Technical Solution] A multi-layer feature map encoding method, device and recording medium of the present disclosure may include extracting a multi-layer feature map including a plurality of feature maps in a layer form from an input image, sequentially encoding the multi-layer feature map through consecutive encoding blocks to output a fusion latent representation of a single layer, and encoding a fusion latent representation of the single layer into a bitstream. In a multi-layer feature map encoding method, device and recording medium of the present disclosure, the consecutive encoding blocks may include a first encoding block that uses a feature map having the highest resolution of the multi-layer feature map as input, a second encoding block that uses a first intermediate latent representation, which is output of a first encoding block, as input, a third encoding block that uses a first fusion latent representation, which is output of a second encoding block, as input, and a fourth encoding block that uses a second fusion latent representation, which is output of a third encoding block, as input to output a fusion latent representation of the single layer. In a multi-layer feature map encoding method, device and recording medium of the present disclosure, input of the second encoding block, the third encoding block and the fourth encoding block may include a feature map of the multi-layer feature map having the same resolution as an input intermediate latent representation or fusion latent representation. In a multi-layer feature map encoding method, device and recording medium of the present disclosure, the first encoding block, the second encoding block and the third encoding block may include a convolutional neural network layer using a nxn-sized kernel in the same structure and a nonlinear layer of LeakyReLU. In a multi-layer feature map encoding method, device and recording medium of the present disclosure, the fourth encoding block may not include a nonlinear layer of the LeakyReLU. In a multi-layer feature map encoding method, device and recording medium of the present disclosure, resolution of output of the first encoding block, the second encoding block, the third encoding block and the fourth encoding block may be lower than resolution of input. In a multi-layer feature map encoding method, device and recording medium of the present disclosure, the consecutive encoding blocks may include a first encoding block that uses a feature map having the highest resolution of the multi-layer feature map as input, a second encoding block that uses a first intermediate latent representation, which is output of a first encoding block, as input, a third encoding block that uses a feature map of the multi-layer feature map having lower resolution than a first fusion latent representation, which is output of a second encoding block, as input, and a fourth encoding block that uses a second intermediate latent representation, which is output of a third encoding block, and the first fusion latent representation as input to output a fusion latent representation of the single layer. In a multi-layer feature map encoding method, device and recording medium of the present disclosure, input of the second encoding block and the fourth encoding block may include a feature map of the multi-layer feature map having the same resolution as an input intermediate latent representation or fusion latent representation. In a multi-layer feature map encoding method, device and recording me