CN-121585816-B - Generated image compression training method based on feature space semantic anchor point

CN121585816BCN 121585816 BCN121585816 BCN 121585816BCN-121585816-B

Abstract

The invention discloses a feature space semantic anchor point-based generated image compression training method which comprises the steps of obtaining a training image, mapping the training image into quantized potential features by utilizing an encoder network, inputting a generated decoding network, obtaining corresponding reconstructed images through a condition generation process, respectively inputting the training image and the reconstructed images into semantic encoders which are constructed in advance and have frozen parameters, extracting semantic feature representations corresponding to the training image and the reconstructed images, calculating semantic anchor point loss based on consistency between the semantic feature representations of the training image and the reconstructed images, and updating parameters of the encoder network and the generated decoding network according to the semantic feature loss. According to the invention, by introducing a frozen semantic benchmark and a multi-granularity space alignment mechanism in a feature space, the problems of semantic drift and space structure dislocation generated by a generated model are effectively solved under an extremely low code rate compression scene, so that the semantic content and space layout of a reconstructed image are consistent with those of an original image while the high perceived quality is maintained.

Inventors

LENG CONG
ZHAO TIANLI

Assignees

中科方寸知微(南京)科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260123

Claims (8)

1. The generated image compression training method based on the feature space semantic anchor point is characterized by comprising the following steps of: acquiring a training image, and mapping the training image into quantized potential features by using an encoder network; Inputting the quantized potential features into a generating type decoding network, and obtaining corresponding reconstructed images through a condition generating process; Respectively inputting the training image and the reconstructed image into a pre-constructed semantic encoder with frozen parameters, and extracting respective corresponding semantic feature representations; calculating a semantic anchor point loss based on consistency between the semantic feature representation of the training image and the semantic feature representation of the reconstructed image; updating parameters of the encoder network and the generative decoding network based on the semantic anchor loss; updating parameters of an encoder network and a generative decoding network, comprising: calculating pixel level reconstruction loss and perception loss based on the training image and the reconstructed image; Calculating a code rate estimation loss based on the quantized latent features; constructing a total optimization target based on a weighted sum of pixel-level reconstruction loss, perception loss, code rate estimation loss and semantic anchor point loss; Updating parameters of the encoder network and the generated decoding network by utilizing a total optimization target; mapping the training image to quantized latent features using the encoder network, comprising: Extracting successive potential feature vectors using the encoder network; vector quantization processing is carried out on the continuous potential feature vectors based on the learnable codebook, so that discrete quantized potential features are obtained; In updating the parameters of the encoder network and the generative decoding network, the gradient is returned from the generative decoding network to the encoder network using the pass-through estimator.
2. The method according to claim 1, wherein extracting respective corresponding semantic feature representations comprises extracting global semantic vectors, in particular: respectively inputting the training image and the reconstructed image into a semantic encoder; The classification Token is extracted from the last network layer of the semantic encoder and used as the global semantic vector corresponding to each of the training image and the reconstructed image.
3. The method of claim 2, wherein calculating the semantic anchor penalty comprises: Calculating cosine similarity between the global semantic vector corresponding to the training image and the global semantic vector corresponding to the reconstructed image; a semantic anchor penalty is determined based on the cosine similarity, wherein the semantic anchor penalty is inversely related to the cosine similarity.
4. The method according to claim 1, wherein extracting respective corresponding semantic feature representations comprises extracting a patch Token sequence, in particular: Extracting classification Token from the last network layer of the semantic encoder as a global semantic vector; based on the global semantic vector, a patch Token sequence is extracted from at least one intermediate network layer of the semantic encoder, the patch Token sequence containing local semantic information for different regions of the image.
5. The method according to claim 4, wherein extracting respective corresponding semantic feature representations further comprises extracting a two-dimensional spatial feature map, in particular: based on the patch division size of the semantic encoder, carrying out space dimension recombination on the patch Token sequence to obtain a two-dimensional space feature map; Wherein each feature vector in the two-dimensional spatial feature map corresponds to semantic content of a predetermined spatial region in a training image or a reconstructed image input to the semantic encoder.
6. The method of claim 5, wherein calculating the semantic anchor penalty comprises: For each spatial position in the two-dimensional spatial feature map, calculating cosine similarity between the feature vector of the training image at the spatial position and the feature vector of the reconstructed image at the corresponding spatial position; carrying out average processing on cosine similarity of all the spatial positions to obtain regional semantic anchor point loss; A semantic anchor penalty is determined based on the regional semantic anchor penalty.
7. The method of claim 4, wherein calculating the semantic anchor penalty comprises: For each network layer of the semantic encoder, the semantic features extracted by the network layer are used for representing the corresponding hierarchical semantic losses through independent calculation, wherein the hierarchical semantic losses of different network layers are not interfered with each other; and carrying out weighted summation on the hierarchical semantic losses of all the network layers to obtain the semantic anchor point loss.
8. The method of claim 7, wherein weighting and summing hierarchical semantic losses for all network layers follows a semantic priority principle, comprising: The weight coefficient corresponding to the hierarchical semantic loss of the deeper network layer is greater than the weight coefficient corresponding to the hierarchical semantic loss of the shallower network layer.

Description

Generated image compression training method based on feature space semantic anchor point Technical Field The invention relates to an image compression technology, in particular to a generated image compression training method based on a feature space semantic anchor point. Background With the acceleration of the digitizing process, the explosive growth of visual data presents a significant challenge to memory space and transmission bandwidth. Particularly, in limited scenes such as satellite communication, weak network transmission and the like, an extremely low code rate (for example, lower than 0.05 bpp) image compression technology is important, and the core aim is to recover high-fidelity visual information as much as possible while extremely compressing the data volume. Currently mainstream image compression techniques have evolved from traditional transform coding (e.g., JPEG, BPG) to deep learning based end-to-end compression. The existing learning type compression method generally utilizes convolutional neural network or a transducer architecture to optimize rate distortion performance, and in order to cope with the problem of detail loss at an extremely low code rate, a generation type compression scheme based on a Generation Antagonism Network (GAN) and a denoising Diffusion probability model (Diffusion Models) becomes a research hotspot. Such methods take advantage of the strong prior ability of the generated model to attempt to complement the high frequency texture details lost in the quantization process by guessing, thereby achieving better subjective visual quality than traditional methods. However, under the condition of extremely lacking data constraint, the existing generated compression scheme has the defects of semantic drift and spatial semantic dislocation. The method comprises the steps of generating a model, performing creative filling by means of statistics prior, and enabling the model to be subjected to illusion, wherein the illusion is easy to generate, the object types or key attributes in a reconstructed image are changed, for example, an original orange cat is reconstructed into a white cat or even a dog, so that the original image is deviated from the semantic reality, secondly, the existing constraint mechanism mostly adopts global sense knowledge or global semantic vectors, the coarse-granularity constraint lacks locking capacity on the spatial position, and the reconstructed object is enabled to have true texture, but the spatial position, the gesture or the local structure (such as five sense organs distribution) in the image cannot be accurately corresponding to the original image, so that the technical problem of correct semantic but spatial dislocation is caused. Disclosure of Invention The invention aims to provide a generated image compression training method based on feature space semantic anchor points, so as to solve the problems in the prior art. The technical scheme is that the generated image compression training method based on the feature space semantic anchor point comprises the following steps: acquiring a training image, and mapping the training image into quantized potential features by using an encoder network; Inputting the quantized potential features into a generating type decoding network, and obtaining corresponding reconstructed images through a condition generating process; Respectively inputting the training image and the reconstructed image into a pre-constructed semantic encoder with frozen parameters, and extracting respective corresponding semantic feature representations; calculating a semantic anchor point loss based on consistency between the semantic feature representation of the training image and the semantic feature representation of the reconstructed image; Parameters of the encoder network and the generative decoding network are updated based on the semantic anchor penalty. The method has the beneficial effects that by introducing the frozen semantic standard and the multi-granularity space alignment mechanism in the feature space, the problems of semantic drift and space structure dislocation generated by the generated model are effectively solved under the extremely low code rate compression scene, so that the semantic content and the space layout of the reconstructed image are consistent with those of the original image while the high perceived quality of the reconstructed image is maintained. Drawings Fig. 1 is a flowchart of steps of a method for training generated image compression based on feature space semantic anchor points according to an embodiment of the present application. Fig. 2 is a flowchart of a step of extracting a global semantic vector according to an embodiment of the present application. FIG. 3 is a flowchart illustrating steps for calculating loss of a semantic anchor according to an embodiment of the present application. Fig. 4 is a flowchart of steps for updating parameters of an encoder network and a genera