CN-121353508-B - Single synthetic image shadow generation method based on 3D perception

CN121353508BCN 121353508 BCN121353508 BCN 121353508BCN-121353508-B

Abstract

The invention discloses a single-frame synthesized image shadow generation method based on 3D perception, which comprises three-plane representation construction and pixel alignment volume drawing, wherein two feature extractors are used for processing different input combinations, a local channel-space cross attention module is used for fusing two obtained features to generate unified three-plane feature representation, a 3D coordinate space is defined for each image pixel, orthogonal rays are projected to sample 3D points along a depth axis, the sampling points are projected onto the three-plane representation, sampling features are retrieved through bilinear interpolation, the sampling features are fed into a two-stage MLP, the first stage predicts output and middle features related to shadows, the second stage predicts output and volume density related to colors, and foreground shadows and final shadow images are generated based on the volume density and Alpha synthesis rules.

Inventors

FU YANPING
Tao Jiazheng
ZHAO HAIFENG
ZHANG SHAOJIE

Assignees

安徽大学

Dates

Publication Date: 20260508
Application Date: 20251027

Claims (6)

1. The single synthetic image shadow generation method based on 3D perception is characterized by comprising the following steps of: Step 1, constructing a double-branch feature extraction module, masking foreground objects of the same image Predicted foreground shadow mask Composite image without foreground shadows And background object-shadow mask pairs Together input into a dual branch feature extraction module, background object-shadow mask pair Abbreviated as The dual-branch feature extraction module comprises a structure-aware branch And appearance aware branching Structure aware branching Masking foreground objects Predicted foreground shadow mask Background object-shadow mask pair Processing to obtain a structural feature diagram Appearance aware branching For composite images And background object-shadow mask pairs Processing to obtain appearance characteristic diagram ; Step 2, a local channel-space cross attention module LCSAM is constructed, the structural feature diagram and the appearance feature diagram obtained in step 1 are input into a local channel-space cross attention module LCSAM, and a local channel-space cross attention module LCSAM comprises a local channel cross attention mechanism and a local space cross attention mechanism, wherein the specific method comprises the following steps: Step 2.1, in the local channel cross attention mechanism, for the appearance feature map And structural feature map The channels are divided into G groups, each group including A plurality of channels, each set applying a convolution operation to the slave Generates a query vector Q c and generates a query vector from Generating a key K c and a value V c , sequentially performing dot product operation and Softmax normalization on Q c and K c to obtain attention weight Then, using the attention weight The value V c is weighted and aggregated to obtain attention weighted response, the response is remodeled into original feature dimension to obtain attention enhancement feature y c , and finally the attention enhancement feature y c and the original appearance feature diagram Adding to obtain local channel enhancement features X m2 ; in the step 2.2, in the local spatial cross attention mechanism, for the original appearance feature image X r and the local channel enhancement feature X m2 , firstly performing expansion operation with the window size w, dividing the corresponding feature image into non-overlapping local windows, generating a query Q s , a key K s and a value V s by convolution operation for each local window w, and calculating local attention weights Then, the obtained local attention weight Acting on the value V s , summarizing the output of all local windows, and reconstructing the output into a full-resolution feature map through folding operation to obtain local space enhancement features y s ; Step 2.3, adding the local channel enhancement feature X m2 obtained in step 2.1 to the local space enhancement feature y s obtained in step 2.2 to obtain a final output feature X LCSAM of the whole local channel-space cross attention module LCSAM; Step 3, based on the composite image Generating an orthogonal parallel sampling light ray, sampling a plurality of sampling points on each light ray to construct a virtual 3D space, obtaining a set of spatial coordinates of the sampling points, remolding a feature X LCSAM into three orthogonal two-dimensional planes F xy 、F xz and F yz based on the spatial coordinates, inputting the three orthogonal two-dimensional planes into a pixel pair Ji Ti drawing module, and injecting geometric perception into two-dimensional shadows to realize virtual 3D space sampling by utilizing the feature space coding capability of the three planes to obtain three-plane sampling point features F t ; Step 4, progressive volume rendering is carried out on the three-plane sampling point characteristic F t through a double MLP network based on a progressive two-stage volume rendering strategy, and the specific method is as follows: inputting the three-plane sampling point characteristic F t into a first MLP, predicting to obtain shadow mask probability g E [0, 1] and intermediate characteristic F s , then splicing the intermediate characteristic F s with the original three-plane sampling point characteristic F t , inputting the spliced characteristic into a second MLP to predict RGB color c and volume density sigma, then respectively carrying out volume drawing on RGB color value c and shadow mask probability g by using shared volume density sigma, and leading the drawing process to follow the standard alpha synthesis rule, finally, generating the optimized foreground shadow mask prediction result point by point through weighted summation And a shaded background image ; Step 5, in the training process, the composite loss function is utilized to perform network optimization on the double-branch feature extraction module, the local channel-space cross attention module LCSAM, the pixel pair Ji Ti drawing module and the double-MLP network, composite loss function Including shadow mask prediction loss Loss of image reconstruction And perceived loss 。
2. The method for generating shadows of a single composite image based on 3D perception according to claim 1, wherein the structural perception branches of the dual-branch feature extraction module of step1 The extracted features include object boundaries, spatial configurations, and potential shadow-projected areas, appearance-aware branching Extracted features include color contrast, illumination consistency, and material properties.
3. The 3D perception based single-frame synthetic image shadow generation method of claim 1, wherein step 2 captures fine-grained cross-channel dependencies in both channel and spatial dimensions using a local channel-space cross-attention module LCSAM, attention weights The calculation formula of (2) is as follows: ; the calculation formula of the local channel enhancement feature X m2 is: ; local attention weighting The calculation formula of (2) is as follows: ; the calculation formula of the final output characteristic X LCSAM is: 。
4. The method for generating shadows of a single composite image based on 3D perception according to claim 1, wherein the step 3 constructs a virtual three-dimensional volume space by orthogonally projecting sampling rays from each pixel of the composite image plane, ensures that the sampling process is aligned with the two-dimensional image structure such that spatially consistent tri-plane sampling point features F t can be extracted from the pre-computed tri-plane representation even in the absence of explicit three-dimensional cues, and comprises the following steps: Synthetic image of given resolution r x r First, a normalized two-dimensional coordinate grid is generated For each coordinate, an orthogonal sampling ray is defined, the starting point of the orthogonal sampling ray is O= [ x, y, -1], the direction of the orthogonal sampling ray is D= [0, 0, 1], and the orthogonal sampling ray is perpendicular to the plane of the synthesized image; Then, along each ray, s=32 points are uniformly sampled in the depth direction within the preset range [0, 2], and the depth value of the S-th sampling point is calculated as follows: ; ; the corresponding three-dimensional coordinates are given by P x,y,z = O + (Z s D); each sampling point P is projected onto three orthogonal two-dimensional planes F xy 、F xz and F yz ; Finally, extracting the characteristics of the three orthogonal two-dimensional planes at the projection position by a bilinear interpolation method, and fusing the characteristics into three-plane sampling point characteristics F t .
5. The method for shadow generation of a single composite image based on 3D perception according to claim 1, wherein in step 4 double MLP network, the formula of the first MLP predicted shadow mask probability g e [0, 1] and the intermediate feature F s is as follows: ; the second MLP predicts RGB color c and bulk density σ as follows: ; The specific method for performing volume rendering on the RGB color value c and the shadow mask probability g by using the volume density sigma is as follows: For sample point i, given its color value Shadow mask probability And depth value The absorption coefficient of each section is first calculated based on the beer-lambert law , The alpha value representing the amount of absorbed light per point is calculated as follows: ; Wherein the method comprises the steps of Representing the distance between two adjacent sampling points; the weight of each sampling point i From its absorption coefficient Multiplying the cumulative transmittance before the sampling point i by the formula: ; Wherein the method comprises the steps of The temperature is set to be 1 multiplied by 10 −10 , Representing the absorption coefficients of all points along the light direction before the ith sampling point; Finally, generating an optimized foreground shadow mask by weighted summation of the point-by-point prediction results And a shaded background image The formula is as follows: ; 。
6. the method for generating shadows of a single composite image based on 3D perception according to claim 1, wherein the step 5 composite loss function Including shadow mask prediction loss Loss of image reconstruction And a perceptual loss function ; Shadow mask prediction loss The pixel level difference is punished by adopting an L 1 loss function, and the formula is as follows: ; loss of image reconstruction Generating a shaded background image using a mean square error loss representation And a real image The pixel level difference between them is given by: ; loss of perception The calculation formula of (2) is as follows: ; Wherein, the Representing the second time from the pretrained VGG network Layer extracted features; Finally, the three losses are weighted and summed to obtain a composite loss function The formula is as follows: ; Where λ is a trade-off parameter.

Description

Single synthetic image shadow generation method based on 3D perception Technical Field The invention relates to an image synthesis technology, in particular to a single synthesized image shadow generation method based on 3D perception. Background The object of image synthesis is to combine the foreground of one image with another background image to generate a synthesized image, which has wide application in virtual reality, artistic creation, electronic commerce and other aspects. Simply pasting a foreground onto a background tends to result in visual inconsistencies, including illumination incompatibilities between foreground and background, lack of foreground shadows/reflections, etc. We focus on the shadow problem, i.e. if the inserted foreground object does not have a reasonable shadow on the background, this may severely reduce the realism and quality of the composite image. The shadow generation task may improve the realism of images and scenes. Existing shadow generation methods can be divided into two main categories, rendering-based methods and image-to-image conversion methods. Rendering-based methods typically require explicit knowledge of illumination, reflection, material properties, or scene geometry of foreground objects and background in order to generate shadows for inserted virtual objects using rendering techniques. However, such information is often not available in real world applications. Early Kee et al used 3D scene reconstruction or linear programming to generate shadows, but required user interaction to ensure accuracy. The Sheng uses the 2D object mask and the ambient light map to generate controllable shadows in the SSN, and combines with the ambient occlusion prediction module to generate interactively refined realistic shadows. Subsequent work uses the geometric representation to control the shadow direction and shape, mapping pixel height data to 3D space for geometric reconstruction in PixHt-Lab. These methods sometimes require 3D information to be predicted from 2D images, which is quite challenging in complex scenes. In the absence of user interaction, gardner attempts to recover explicit lighting conditions and scene geometry based on a single image, but inaccurate estimates may lead to unsatisfactory results. Image-to-image conversion method the image-to-image conversion method learns the mapping of an input image without foreground shadows to an output image with foreground shadows without explicit knowledge of illumination, reflection, etc. The mainstream methods are generally two methods, an countermeasure-based generation network and a diffusion model-based method. The method based on the antagonism generation network guarantees the accuracy and the authenticity of generating shadows by the joint supervision of the reconstruction loss and the antagonism loss. ShadowGan fine-tune details with a local discriminator, global view is fine-tuned using a global discriminator. The global and local condition discriminators are used in combination to enhance the realism of generating shadows. MASK-ShadowGan uses the generated shadow template to guide shadow generation through loop consistency constraints. AR-ShadowGan uses the attention mechanism to simulate the mapping between virtual shadows and the real environment. SGRNet first predicts the foreground shadow template through interaction of foreground and background information, and then predicts the shadow parameters to fill in the shadow region. However, these GAN-based approaches typically require training models from scratch on limited pairing data, which limits generalization ability. The method based on the diffusion model comprises DMASNet, firstly decomposing shadow template generation into cuboid prediction and shape prediction, and then focusing on background shadow pixels to realize accurate foreground shadow filling. SGDiffulation adapt control net to shadow generation and incorporate intensity modulation to optimize shadow intensity. Despite the introduction of diffusion models, these methods still face the problem of poor shadow geometry. One key reason for this limitation is the lack of 3D perception in conventional 2D image generation pipelines, which hinders their ability to capture spatial relationships and maintain geometric consistency, which often produce shadows that appear visually reasonable but geometrically inaccurate if the underlying scene structure is not properly understood. Disclosure of Invention The invention aims to solve the defects in the prior art and provide a single synthetic image shadow generation method based on 3D perception. The technical scheme is that the single synthetic image shadow generation method based on 3D perception comprises the following steps: Step 1, constructing a double-branch feature extraction module, masking foreground objects of the same image Predicted foreground shadow maskComposite image without foreground shadowsAnd background object-shadow mask pairsToget