CN-120707682-B - Extreme environment degradation image simulation generation method and device

CN120707682BCN 120707682 BCN120707682 BCN 120707682BCN-120707682-B

Abstract

The embodiment of the invention discloses a simulation generation method and device for an extreme environment degradation image. The method comprises the steps of inputting and diffusing a degraded image to a self-encoder to obtain a latent image feature vector, generating a foreground priori visual feature vector set and a background priori visual feature vector, inputting the foreground priori visual feature vector set and the background priori visual feature vector to a double-end prototype resampling model to obtain a foreground visual perception marking feature vector set and a background visual perception marking feature vector, generating a context decoupling feature vector set, inputting the context decoupling feature vector set to a context topological relation coordination model, inputting the context decoupling feature vector set to a dynamic semantic relation aggregation model to obtain a context semantic feature vector, and carrying out weighted fusion on the context semantic feature vector set and the context decoupling feature vector set to obtain a target degraded image. This embodiment may improve the quality of the image in extremely degraded scenes.

Inventors

LI JIA
WANG WENZHUANG
ZHAO YIFAN
ZHAO QINPING

Assignees

北京航空航天大学

Dates

Publication Date: 20260508
Application Date: 20250623

Claims (8)

1. A simulation generation method of an extreme environment degradation image comprises the following steps: acquiring degradation images, text prompt information and geometric layout information of an object in an extreme environment; Inputting the degraded image to a diffusion variation self-encoder to obtain a potential image feature vector; Generating a foreground priori visual feature vector set and a background priori visual feature vector according to the text prompt information and the object geometric layout information; inputting the front Jing Xianyan visual feature vector set and the background priori visual feature vector into a double-end prototype resampling model to obtain a foreground visual perception marking feature vector set and a background visual perception marking feature vector; Generating a context decoupling feature vector set according to the potential image feature vector, the object geometric layout information, the foreground visual perception marking feature vector set, the text prompt information and the background visual perception marking feature vector; inputting the context decoupling feature vector set into a context topological relation coordination model to obtain a context topological relation correction feature vector set; Inputting the context topological relation correction feature vector set into a dynamic semantic relation aggregation model to obtain a context semantic feature vector; Performing weighted fusion processing on the context semantic feature vector and the context decoupling feature vector set to obtain a degraded visual feature vector; decoding the degraded visual feature vector to obtain a target degraded image; Wherein the double-ended prototype resampling model comprises a foreground prototype resampling model and a background prototype resampling model, and Inputting the front Jing Xianyan visual feature vector set and the background priori visual feature vector to a double-ended prototype resampling model to obtain a foreground visual perception labeling feature vector set and a background visual perception labeling feature vector, including: Acquiring a foreground learnable query tag corresponding to the front Jing Xianyan visual feature vector set, and a background learnable query tag corresponding to the background priori visual feature vector set; performing linear transformation on the front Jing Xianyan visual characteristic vector set to obtain a front Jing Xianyan key vector set and a foreground priori value vector set; Inputting the foreground learnable query markers, the front Jing Xianyan key vector set and the front Jing Xianyan value vector set to the foreground prototype resampling model to obtain a foreground visual perception marker feature vector set, wherein the foreground prototype resampling model comprises a plurality of cross attention layers and a plurality of feedforward neural networks; performing linear transformation processing on the background priori visual feature vector to obtain a background priori key vector and a background priori value vector; and inputting the background learning query mark, the background priori key vector and the background priori value vector into the background prototype resampling model to obtain a background visual perception mark feature vector.
2. The method of claim 1, wherein the generating a set of foreground prior visual feature vectors and background prior visual feature vectors from the text prompt information and the object geometric layout information comprises: searching a candidate instance group which is in the same category as each semantic information included in the geometric layout information of the object from a preset candidate instance dictionary, and taking the candidate instance group as a target candidate instance group to obtain a target candidate instance group set; Based on the set of target candidate instance groups, performing the following averaging pooling step: Performing image processing on each target candidate instance in the target candidate instance group to generate a processed candidate instance, so as to obtain a processed candidate instance group set; Inputting the processed candidate instance group set to an image convolution feature extraction model to obtain a front Jing Shili visual feature vector group set; carrying out average pooling treatment on the front Jing Shili visual feature vector set to obtain a foreground priori visual feature vector set; Searching training background images matched with text prompts corresponding to the degraded images from a preset training background image set to serve as target training background images; And determining the target training background image as a target candidate instance group set to execute the averaging pooling step again, and taking the obtained foreground priori visual feature vector set as a background priori visual feature vector.
3. The method of claim 1 wherein the contextual topological relationship coordination model comprises a relationship coordination model and a graph roll-up neural network model, and Inputting the context decoupling feature vector set into a context topological relation coordination model to obtain a context topological relation correction feature vector set, wherein the method comprises the following steps of: determining a context adjacency relation matrix of the context decoupling feature vector set through the relation coordination model; And carrying out graph symmetry normalization processing on the context adjacency relation matrix through the graph convolution neural network model to obtain a context topological relation correction feature vector set.
4. The method of claim 1, wherein the dynamic semantic relationship aggregation model comprises a gating network, a plurality of expert networks, and an aggregation operation network, and Inputting the context topological relation correction feature vector set to a dynamic semantic relation aggregation model to obtain a context semantic feature vector, wherein the method comprises the following steps of: acquiring a normal distribution noise vector; Inputting the normal distributed noise vector and the context topological relation correction feature vector set into the gating network to obtain expert activation probability weight sets aiming at the plurality of expert networks; screening at least two expert activation probability weights meeting preset activation probability conditions from the expert activation probability weight set; Determining at least two expert networks corresponding to the at least two expert activation probability weights from the plurality of expert networks; Inputting the context topological relation correction feature vector set to the at least two expert networks to obtain a context semantic aggregation feature vector set; and inputting the context semantic aggregation feature vector set and the at least two expert activation probability weights into the aggregation operation network to obtain the context semantic feature vector.
5. The method of claim 1, wherein the performing a weighted fusion process on the context semantic feature vector and the context decoupling feature vector set to obtain a degraded visual feature vector comprises: normalizing the context semantic feature vector to obtain a context semantic relation aggregation weight; Multiplying the context semantic relation aggregation weight and each context decoupling feature vector in the context decoupling feature vector set element by element to obtain a context weight feature vector set; And accumulating the context weight feature vector set to obtain a degraded visual feature vector.
6. An extreme environment degradation image simulation generation device, comprising: an acquisition unit configured to acquire a degraded image, text prompt information, and object geometric layout information in an extreme environment; A first input unit configured to input the degraded image to a diffusion variation self-encoder, resulting in a latent image feature vector; the first generation unit is configured to generate a foreground priori visual feature vector set and a background priori visual feature vector according to the text prompt information and the object geometric layout information; The second input unit is configured to input the front Jing Xianyan visual feature vector set and the background priori visual feature vector into a double-end prototype resampling model to obtain a foreground visual perception marking feature vector set and a background visual perception marking feature vector; A second generating unit configured to generate a context decoupling feature vector set according to the latent image feature vector, the object geometric layout information, the foreground visual perception labeling feature vector set, the text prompt information, the background visual perception labeling feature vector; the third input unit is configured to input the context decoupling feature vector set into a context topological relation coordination model to obtain a context topological relation correction feature vector set; the fourth input unit is configured to input the context topological relation correction feature vector set into a dynamic semantic relation aggregation model to obtain a context semantic feature vector; The fusion unit is configured to perform weighted fusion processing on the context semantic feature vector and the context decoupling feature vector set to obtain a degraded visual feature vector; The decoding unit is configured to decode the degraded visual feature vector to obtain a target degraded image; The double-end prototype resampling model comprises a foreground prototype resampling model and a background prototype resampling model; the second input unit is further configured to: Acquiring a foreground learnable query tag corresponding to the front Jing Xianyan visual feature vector set, and a background learnable query tag corresponding to the background priori visual feature vector set; performing linear transformation on the front Jing Xianyan visual characteristic vector set to obtain a front Jing Xianyan key vector set and a foreground priori value vector set; Inputting the foreground learnable query markers, the front Jing Xianyan key vector set and the front Jing Xianyan value vector set to the foreground prototype resampling model to obtain a foreground visual perception marker feature vector set, wherein the foreground prototype resampling model comprises a plurality of cross attention layers and a plurality of feedforward neural networks; performing linear transformation processing on the background priori visual feature vector to obtain a background priori key vector and a background priori value vector; and inputting the background learning query mark, the background priori key vector and the background priori value vector into the background prototype resampling model to obtain a background visual perception mark feature vector.
7. An electronic device, comprising: One or more processors; a storage device having one or more programs stored thereon, When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-5.
8. A computer readable medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-5.

Description

Extreme environment degradation image simulation generation method and device Technical Field The embodiment of the disclosure relates to the technical field of computers, in particular to a method and a device for generating an extreme environment degradation image simulation. Background Currently, images acquired in extremely degraded scenes in extreme environments (e.g., low light, remote sensing, underwater, and severe weather at the extreme end of fog rain) are low in quality and small in data volume, so that the effect of an auxiliary training data source serving as a downstream visual model is poor, visual data resources are scarce, and accordingly, the generation of images in extremely degraded scenes becomes an increasingly focused problem. For the simulation generation of the extreme environment degradation Image, a method is generally adopted in which geometric Layout information and label information corresponding to the degradation Image are encoded into a position sensing mark and a category sensing mark by using an L2I (Layout-to-Image) diffusion model, then the position sensing mark and the category sensing mark are input into a potential diffusion space, and then the geometric Layout information is subjected to multi-instance mask generation to generate the extreme environment degradation Image. However, in practice, when the simulation generation is performed on the extremely environment degradation image in the above manner, the technical problem that in the existing complex extremely environment degradation scene, due to the fact that high visual coupling exists between the foreground examples and the background due to appearance similarity and the problem that space shielding frequently occurs between the foreground examples exists, the problem that shielding among the foreground examples, the foreground examples and the background is not considered by the L2I diffusion model, and the quality of the generated degradation image is poor is often found. The above information disclosed in this background section is only for enhancement of understanding of the background of the disclosed concept and, therefore, it may contain information that does not form the prior art that is known to those of ordinary skill in the art in this country. Disclosure of Invention The disclosure is in part intended to introduce concepts in a simplified form that are further described below in the detailed description. The disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Some embodiments of the present disclosure propose a method and apparatus for generating an extreme environment degradation image simulation to solve one or more of the technical problems mentioned in the background section above. According to the first aspect, some embodiments of the present disclosure provide a method for generating an extreme environment degradation image simulation, which includes obtaining a degradation image, text prompt information and object geometric layout information in an extreme environment, inputting the degradation image to a diffusion variation self-encoder to obtain a latent image feature vector, generating a foreground priori visual feature vector set and a background priori visual feature vector according to the text prompt information and the object geometric layout information, inputting the foreground priori visual feature vector set and the background priori visual feature vector to a double-ended prototype resampling model to obtain a foreground visual perception marker feature vector set and a background visual perception marker feature vector, generating a context decoupling feature vector set according to the latent image feature vector, the text prompt information and the background visual perception marker feature vector, inputting the context decoupling feature vector set to a context topological relation coordination model to obtain a context topological relation correction feature vector set, inputting the context topological relation correction feature vector set to the context topological relation coordination model to obtain a context semantic relation vector, processing the context visual perception feature vector, and obtaining a context degradation feature vector, and performing context weighted feature vector fusion on the context visual perception feature vector. In a second aspect, some embodiments of the present disclosure provide an extreme environment degradation image simulation generating apparatus, including an acquiring unit configured to acquire a degradation image, text prompt information, and object geometric layout information in an extreme environment, a first input unit configured to input the degradation image to a diffusion variation self-encoder to obtain a latent image feature vector, a first generating unit configured to generate a set o