KR-20260066745-A - Method and device for encoding and decoding image sequences

KR20260066745AKR 20260066745 AKR20260066745 AKR 20260066745AKR-20260066745-A

Abstract

The present invention relates to a method for coding and decoding a sequence of images. The decoding method comprises, for at least one current image of a sequence: - decoding a set of images referred to as correction images, comprising at least one correction image representing correction data for said at least one current image; - decoding a set of feature maps referred to as prediction feature maps, comprising at least one feature map representing motion of said at least one current image; - decoding a set of parameters representing a prediction neural network; - acquiring a set of reference images comprising at least one decoded reference image; - processing said prediction feature map and said reference image up to an input unit to which said prediction feature map is applied through a synthesis module comprising said at least one prediction neural network to generate at least one prediction image; and - reconstructing a current image from said at least one prediction image and said at least one correction image.

Inventors

라두느, 테오
필리프, 피에릭
르게, 토마

Assignees

오렌지

Dates

Publication Date: 20260512
Application Date: 20240912
Priority Date: 20230914

Claims (15)

A method for coding a sequence of images (SV), comprising the following steps for at least one current image (Iv) of the sequence: - A step of initializing at least one calibration image (IC) (E21); - Step (E21) of initializing a set of parameters (WP) representing the predictive neural network (MLPP); - Step (E21) of initializing a set of feature maps referred to as predictive feature maps (FP); - A step (E24) of obtaining a set of reference images (FREF) including at least one reference image that is coded and subsequently decoded; - A step (E24) of processing the prediction feature map (FP) and the reference image up to an input unit to which the prediction feature map is applied through a synthesis module including the at least one prediction neural network to generate at least one prediction image (ICO); - A step (E24) of reconstructing the current image (Iv) from the at least one predicted image (ICO) and the at least one corrected image (IC); - As a function of measuring coding performance, a step of updating at least one value of the at least one feature map and/or at least one parameter (E22, E25); - As a step for coding the bit stream (E26), - A step of coding at least one calibration image (IC) above; - A step of coding a set of parameters (WPc) representing the above prediction neural network (MLPP); - Step of coding the set of predictive feature maps (FPc) A step of coding the bit stream including
A method for decoding a sequence of images (SV) from a bit stream, comprising the following steps for at least one current image (Iv) of the sequence: - A step (E31) of decoding a set of images referred to as a correction image (IC), comprising at least one correction image representing correction data for at least one current image (Iv); - A step (E32) of decoding a set of feature maps referred to as a predictive feature map (FPc), comprising at least one feature map representing the motion of at least one current image; - Step (E34) of decoding a set of parameters (WPc) representing the predictive neural network (MLPP); - A step (E35) of obtaining a set of reference images (FREF) including at least one decoded reference image; - A step (E36) of processing the prediction feature map (FP) and the reference image (FREF) up to an input section to which the prediction feature map is applied, through a synthesis module (MPS) including the at least one prediction neural network, in order to generate at least one prediction image (ICO); - A step (E38) of reconstructing the current image (Iv) from the at least one predicted image (ICO) and the at least one corrected image (IC).
A coding or decoding method according to claim 1 or 2, characterized in that the processing step includes the following sub-steps: - A step (E33, E34) of applying the prediction feature map (FP) to the input of at least one prediction neural network to generate motion information (MV); - A step (E37) of correcting the at least one reference image (FREF) using the motion information to generate the at least one predicted image (ICO).
In paragraph 3, the step of applying the prediction feature map to the input portion of the at least one prediction network comprises, for at least one sample referred to as the current sample ( Pn ) of the at least one current image (Iv) associated with a location ( xn , yn ) within the image, - A step (E33) of constructing a feature vector (Z n ) from at least one predicted feature map as a function of the position (x n , y n ) of at least one current sample, and: - A step of applying the vector (Z n ) to the input of the prediction neural network (MLPP) to provide a vector representing the prediction of at least one current sample (P n ). A coding or decoding method characterized by including
A coding or decoding method according to claim 1 or 2, wherein the processing step comprises the step (E33, E34, E35) of applying the prediction feature map (FP) and the at least one reference image (FREF) to the input of the at least one prediction neural network to generate the at least one prediction image (ICO).
In claim 5, the step of applying the prediction feature map and the at least one reference image to the input portion of the at least one prediction network comprises, for at least one sample referred to as the current sample ( Pn ) of the at least one current image (Iv) associated with a location ( xn , yn ) within the image, - A step (E33) of constructing a feature vector (Z n ) from the at least one predicted feature map and the at least one reference image as a function of the position (x n , y n ) of the at least one current sample, and: - A step (E34) of applying the vector (Z n ) to the input of the prediction neural network (MLPP) to provide a vector representing the prediction of at least one current sample (P n ) A coding or decoding method characterized by including
A decoding method characterized in that, in any one of claims 2 to 6, the step of decoding the set of correction images also includes the following sub-steps: - Step (E31) of decoding a set of parameters (WCc) representing the correction neural network (MLPC); - A step (E31) of decoding a set of feature maps referred to as the correction feature map (FCc); - A step (E31) of applying the decoded correction feature map (FC) to the input of the at least one correction neural network to generate at least one correction image (IC).
A decoding method according to claim 7, wherein the correction neural network (MLPC) comprises an MLP and/or a convolution network.
A coding or decoding method according to any one of claims 1 to 8, wherein the predictive neural network (MLPP) comprises an MLP and/or a convolutional network.
A coding or decoding method characterized in that, in any one of claims 1 to 9, the reconstructing step (E38) includes addition or subtraction applied to the correction image and the prediction image.
A coding or decoding method according to any one of claims 1 to 10, wherein the reconstructing step (E38) uses a convolutional neural network applied to the correction image and the prediction image.
A coding or decoding method according to claim 3 or 4, characterized in that the motion information is a dense field of two-dimensional motion vectors.
A device for decoding a sequence of images (SV) from a bit stream, wherein for at least one current image (Iv) of the sequence, - A step (FCD) of decoding a set of images referred to as a correction image (IC), comprising at least one correction image representing correction data for at least one current image (Iv); - A step (FPD) of decoding a set of feature maps referred to as a predictive feature map (FPc), comprising at least one feature map representing the motion of at least one current image; - A step (NND) for decoding a set of parameters (WPc) representing a predictive neural network (MLPP); - A step of obtaining a set of reference images (FREF) including at least one decoded reference image (REF); - A step of processing the prediction feature map and the reference image up to the input unit to which the prediction feature map is applied, through a synthesis module including the at least one prediction neural network, in order to generate at least one prediction image; - A step (MIX) of reconstructing the current image (Iv) from the at least one predicted image (ICO) and the at least one corrected image (IC) A device for decoding configured to implement
A device for coding a sequence of images (SV), wherein for at least one current image (Iv) of the sequence, - A step of initializing a set of images referred to as the calibration images (IC); - A step of initializing a set of parameters (WPc) representing the predictive neural network (MLPP); - A step of initializing a set of feature maps referred to as the predictive feature map (FPc); - A step of obtaining a set of reference images (FREF) comprising at least one reference image that is coded and subsequently decoded; - A step of processing the prediction feature map and the reference image up to the input unit to which the prediction feature map is applied, through a synthesis module including the at least one prediction neural network, in order to generate at least one prediction image; - A step of reconstructing the current image (Iv) from the at least one predicted image (ICO) and the at least one corrected image (IC); - As a function of measuring coding performance, a step of updating at least one value among at least one feature map and/or at least one parameter of the network; - As a step for coding the bit stream, - A step of coding at least one calibration image (IC) (FCC); - A step (NNC) of coding a set of parameters (WPc) representing the above prediction neural network (MLPP); - Step for coding the set of prediction feature maps (FPc) (FPC) A step of coding the bit stream including A device for coding configured to implement
A computer program comprising instructions for performing steps of a coding or decoding method according to claim 1 or 2 when the program is executed by a computer.

Description

Method and device for encoding and decoding image sequences The present invention relates to the general field of coding sequences of digital images. More specifically, the present invention relates to the compression of digital video. Digital video generally undergoes source coding for compression purposes to limit the resources required for transmission and/or storage. There are many coding standards, such as standards from the ITU/MPEG organization (H.264/AVC, H.265/HEVC, H.266/VVC, etc.) and their extension standards (MVC, SVC, 3D-HEVC, etc.). An image is typically encoded by dividing the image into multiple rectangular blocks and encoding these pixel blocks into a given processing sequence. In conventional video compression techniques, processing a block generally involves predicting the pixels of the block, and the prediction is performed using pixels present in an image that was previously coded and subsequently decoded and encoded, in which case "intra-prediction" is referred to, or using a previously coded image, in which case "inter-prediction" is referred to. Utilizing spatial and/or temporal redundancy in this manner makes it possible to prevent the transmission or storage of the pixel values of each block by representing at least some of the blocks as a remainder that indicates the difference between the predicted value of the block's pixels and the actual value of the predicted block's pixels. As video formats continue to evolve to provide much better compression and adapt to various expected formats and communication networks, the number of predictable scenarios is increasing, and conventional coding and decoding algorithms are becoming very complex. In addition to the methods proposed by these conventional compression standards (MPEG, ITU), there is a trend toward the development of AI-based, particularly neural, methods. Some of these neural methods can be viewed as simply extensions of the concept of competition between the aforementioned compression techniques, such as predictive mode competition and video coding transformation. Another approach uses the concept of an "autoencoder." An autoencoder is an artificial neural network-based learning algorithm that enables the construction of new representations of a dataset. The architecture of an autoencoder consists of two parts: an encoder and a decoder. The encoder is composed of a set of layers of neurons that process data to construct new representations, referred to as "encoded" and also called "latent representations." Finally, the layers of neurons in the decoder receive these representations and attempt to reconstruct the original data by filtering them. The difference between the reconstructed data and the initial data makes it possible to measure the random error made by the autoencoder. Training consists of modifying the autoencoder's parameters to reduce the reconstruction error measured across various samples of the dataset. The performance capability of such autoencoder-based systems is achieved at the cost of a significant increase in memory footprint and with complexity comparable to conventional approaches, such as those proposed by compression standards. These can have millions of parameters and may require up to a million Multiply-Accumulate (MAC) operations to decode a single pixel. This makes these decoders much more complex than conventional decoders, which may hinder the adoption of learning-based compression. More recently, simple neural network-based image encoding techniques This was described in the paper "Compression with Implicit Neural representations" (arXiv:2103.03123) by milien Dupont et al. The proposed coding technique consists of tuning a neural network for an image, quantizing the network's weights, and transmitting them. During decoding, the neural network is evaluated at each pixel location to reconstruct the image. Nevertheless, this technique is inefficient in terms of compression and involves independently coding the video image. In the field of video, an equivalent technique was presented in the paper "Scalable Neural Video Representations with Learnable Positional Features" by Subin Kim et al. (NeurIPS 2022). The temporal dimension is considered. The coding algorithm learns common images of the video by generating a set of three 2D latent key images (one on each spatiotemporal axis). This also generates a latent 3D representation grid to capture local details of the video. This representation is used to tune the neural network trained to process the three key images. However, this approach lacks flexibility in terms of the choice of representations for the key images (a number set to 3) and the 3D grid (fixed for the entire sequence). Furthermore, it requires processing the transmitted 3D grid, which is burdensome in terms of both the amount of transmitted data and the memory required to process it. A patent document disclosed under US No. 2023/0145525 A1 describes a method for decoding an image com