Search

CN-122021731-A - Image reconstruction training method, electronic device, storage medium, and program product

CN122021731ACN 122021731 ACN122021731 ACN 122021731ACN-122021731-A

Abstract

The embodiment of the application provides an image reconstruction training method, electronic equipment, a storage medium and a program product. The method comprises the steps of obtaining characteristic information of an original image, judging whether the number of input channels is a preset value, obtaining initial output values if the number of the input channels is not the preset value, sorting the output values, obtaining a new output value and a target number of input channels by two pairs, judging whether the target number of input channels is the preset value again, sampling the characteristic information to obtain a latent space embedded component if the target number of input channels is the preset value, continuously improving the dimension of the latent space embedded component to the initial number of input channels, and decoding the up-scaled latent space embedded component to output a reconstructed image. The high-frequency information fusion is carried out on the encoder part, the potential space output dimension is improved, the output dimension is overlapped and the multidimensional coding potential space with dimension reduction is utilized by utilizing the characteristic of output distribution, and the structure requirement of the current multiple main stream literature graph networks is met while more details of original graphs are maintained.

Inventors

  • ZHAO FANG
  • LU TING
  • SHI FUYUAN
  • TAN CHAO
  • WANG KAI
  • LIAN SHIGUO

Assignees

  • 中国联合网络通信集团有限公司
  • 联通数字科技有限公司
  • 联通数据智能有限公司

Dates

Publication Date
20260512
Application Date
20241029

Claims (10)

  1. 1. A method of image reconstruction training, the method comprising: Acquiring characteristic information of an original image, wherein the characteristic information comprises a combined low-frequency component and a combined high-frequency component; Judging whether the number of input channels is a preset value, wherein the number of input channels is an initial number of input channels or a target number of input channels, and the initial number of input channels is determined according to hardware resources; If not, acquiring initial output values, wherein the initial output values are acquired through characteristic information, and the number of the output values of the initial output values is the same as the number of the initial input channels; Sequencing the output values, adding the output values in pairs to obtain a new output value and a target input channel number, and re-judging whether the target input channel number is the preset value; Sampling the characteristic information to obtain a latent space embedding component under the condition that the target input channel number is the preset value; And continuously lifting the dimension of the latent space embedded component to the initial input channel number, and decoding the dimension-lifted latent space embedded component to output a reconstructed image.
  2. 2. The method of claim 1, wherein the determining whether the number of input channels is a preset value, the method further comprising: in the case where the number of input channels is the initial number of input channels, the feature information is sampled to obtain a latent spatial embedded component, and the latent spatial embedded component is decoded to output a reconstructed image.
  3. 3. The method according to claim 1, wherein the acquiring feature information of the original image includes: extracting a low-frequency component in an original image; extracting high-frequency sub-components in an original image, and up-converting the high-frequency sub-components to obtain high-frequency components; And combining the low-frequency component and the high-frequency component to obtain the characteristic information of the original image.
  4. 4. A method according to claim 3, wherein said combining the low frequency component and the high frequency component to obtain the feature information of the original image comprises: sequentially merging the merging components in each same neural network layer and outputting the merging components to the next layer, wherein the merging components comprise merging components of the upper neural network layer and high-frequency layer components of the upper neural network layer, the low-frequency components are distributed to the first neural network layer, and the high-frequency layer components of each neural network layer are distributed according to preset rules; and acquiring the merging component of the last neural network layer to obtain the characteristic information of the original image.
  5. 5. The method of claim 1, wherein said sorting said output values and summing said output values two by two to obtain a new output value and a target number of input channels comprises: The output values include pairs of means and variances; Sorting the plurality of output values according to the average value of each output value; And sequentially adding the average value and the variance in the output values without repeating every two to obtain a new output value and a target input channel number.
  6. 6. The method of claim 5, wherein after sequentially adding the mean and the variance in the output values to each other to obtain a new output value and a target number of input channels, the method further comprises: determining a feature vector of the latent space embedded component, and constructing a first loss function according to the feature vector, wherein the first loss function is used for representing a reconstruction loss function; Constructing a distribution loss function corresponding to each neural network layer according to the high-frequency layer component of each neural network layer; Determining a KL divergence according to the distribution loss function, and determining a second loss function based on the KL divergence, wherein the second loss function is used for representing a multi-stage KL divergence function; determining the loss function based on the first loss function and the second loss function; the new output value is supervised by the loss function.
  7. 7. The method of claim 1, wherein the sampling the feature information to obtain a latent spatial embedding component comprises: based on the characteristic information, acquiring a plurality of characteristic vectors of an input channel; and determining the set of the latent space embedded components according to the characteristic vectors of the input channels.
  8. 8. An image reconstruction training apparatus, the apparatus comprising: the high-frequency injection module is used for acquiring high-frequency components of the original image; An encoder for acquiring a low frequency component of an original image; the encoder is further used for acquiring characteristic information of the original image, wherein the characteristic information comprises a combined low-frequency component and a combined high-frequency component; The encoder is further configured to determine whether the number of input channels is a preset value, where the number of input channels is an initial number of input channels or a target number of input channels, and the initial number of input channels is determined according to hardware resources; If not, acquiring initial output values, wherein the initial output values are acquired through characteristic information, and the number of the output values of the initial output values is the same as the number of the initial input channels; Sequencing the output values, adding the output values in pairs to obtain a new output value and a target input channel number, and re-judging whether the target input channel number is the preset value; Sampling the characteristic information to obtain a latent space embedding component under the condition that the target input channel number is the preset value; and the decoder is used for continuously lifting the dimension of the latent space embedded component to the initial input channel number and decoding the dimension-lifted latent space embedded component to output a reconstructed image.
  9. 9. An image reconstruction training device is characterized by comprising a memory and a processor; The memory stores computer-executable instructions; The processor executing computer-executable instructions stored in the memory, causing the processor to perform the image reconstruction training method of any one of claims 1-7.
  10. 10. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are for implementing the image reconstruction training method of any of claims 1-7.

Description

Image reconstruction training method, electronic device, storage medium, and program product Technical Field The present application relates to the field of computer technologies, and in particular, to an image reconstruction training method, an electronic device, a storage medium, and a program product. Background A variational self-encoder (Variational Autoencoders, hereinafter simply "VAE") is a probabilistic coding-based generation model that combines the ideas of an automatic encoder and probabilistic coding to achieve a more flexible and controllable sample generation process by modeling potential representations. The VAE consists essentially of an encoder (Encoder), a potential layer, and a Decoder (Decoder), by which the input data is mapped onto the distribution parameters of the potential space, and then by which the samples from the potential layer samples are mapped back to the original input space. The core idea of the multi-modal meristematic graph model is to generate the corresponding image using a language description (text). The multi-modal text-to-graphic model typically includes two main parts, a text encoder that converts input text into a vector representation that captures semantic information in the text, and an image generator that generates a corresponding image using the semantic vectors. The VAE is one of the important components in the multimodal context diagram model, and can be part of an image generator for generating images from semantic vectors provided by a text encoder. Specifically, the encoder of the VAE maps the entered text description (or some pre-processed text vector) to a potential space, from which the decoder then samples and generates a corresponding image. At present, because the neural network is a Gaussian smoothing process, high-frequency details are gradually lost in the information transmission and feature extraction processes, so that the VAE has a situation of permanent loss on small or complex high-frequency details, such as small faces, characters and the like, in the reasoning process, and the reconstruction effect is weak. Disclosure of Invention The embodiment of the application provides an image reconstruction training method, electronic equipment, a storage medium and a program product, which can reduce high-frequency detail loss and improve image reconstruction effect. In a first aspect, an embodiment of the present application provides an image reconstruction training method, including: Acquiring characteristic information of an original image, wherein the characteristic information comprises a combined low-frequency component and a combined high-frequency component; Judging whether the number of input channels is a preset value, wherein the number of input channels is an initial number of input channels or a target number of input channels, and the initial number of input channels is determined according to hardware resources; If not, acquiring initial output values, wherein the initial output values are acquired through characteristic information, and the number of the output values of the initial output values is the same as the number of the initial input channels; Sequencing the output values, adding the output values in pairs to obtain a new output value and a target input channel number, and re-judging whether the target input channel number is the preset value; Sampling the characteristic information to obtain a latent space embedding component under the condition that the target input channel number is the preset value; And continuously lifting the dimension of the latent space embedded component to the initial input channel number, and decoding the dimension-lifted latent space embedded component to output a reconstructed image. In one possible implementation manner, the method for determining whether the number of input channels is a preset value further includes: in the case where the number of input channels is the initial number of input channels, the feature information is sampled to obtain a latent spatial embedded component, and the latent spatial embedded component is decoded to output a reconstructed image. In one possible implementation manner, the acquiring the feature information of the original image includes: extracting a low-frequency component in an original image; extracting high-frequency sub-components in an original image, and up-converting the high-frequency sub-components to obtain high-frequency components; And combining the low-frequency component and the high-frequency component to obtain the characteristic information of the original image. In a possible implementation manner, the combining the low frequency component and the high frequency component to obtain the feature information of the original image includes: sequentially merging the merging components in each same neural network layer and outputting the merging components to the next layer, wherein the merging components comprise merging components of