CN-121982148-A - Self-adaptive shadow generation method based on plane projection guidance and depth perception diffusion
Abstract
The invention discloses a self-adaptive shadow generation method based on plane projection guidance and depth perception diffusion, belonging to the technical field of computer vision and image synthesis; the method comprises the first stage of generating a hard shadow mask of a foreground object under a virtual light source through physical projection calculation and providing geometric positions and shape prior, the second stage of uniformly encoding multi-mode conditions such as the hard shadow mask, a background depth map, a front background fusion map and the like and noise latent variables into a Token sequence, inputting the Token sequence into a Diffusion Transformer model, carrying out detail rendering through a self-attention mechanism, and outputting a synthetic image with realistic shadows. The invention solves the problems of shadow geometric distortion, inconsistent illumination, insufficient texture fitting and the like in the prior art by combining physical guidance and nerve rendering.
Inventors
- YAN XINYI
- YANG XU
- LIU ZHIHUI
- LI HENGDA
Assignees
- 厦门真景科技有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260122
Claims (9)
- 1. An adaptive shadow generation method based on plane projection guidance and depth perception diffusion is characterized by comprising the following steps: S00, acquiring a background image, a foreground object image and a corresponding foreground mask, extracting scene parameters of the background image, and acquiring a background depth image and a background illumination vector; S10, constructing a hard shadow guide mask conforming to a physical perspective relation through geometric projection based on the background illumination vector and the foreground mask; S20, splicing the hard shadow guide mask, the background depth map and a front background fusion map obtained by splicing the foreground object image and the background image as multi-mode conditions; s30, constructing a joint input sequence containing noise latent variables and the multi-modal conditions, and inputting the joint input sequence into a diffusion converter DiT model; S40, denoising and generating through the DiT model, and outputting a synthetic image fused with the self-adaptive shadow.
- 2. The adaptive shadow generation method based on planar projection guidance and depth-aware diffusion of claim 1, wherein: in step S00, the step of extracting a scene parameter from the background image includes: S01, depth estimation, namely extracting a background depth map of the background image by using a monocular depth estimation network; S02, estimating illumination parameters, analyzing the brightness distribution of the background image by using an illumination estimation module, and outputting the background illumination vector : Wherein, the For the azimuth angle of the light source, Is the light source height angle.
- 3. The method of claim 2, wherein in step S10, the step of constructing a hard shadow guide mask based on the background illumination vector and the foreground mask comprises: S11, anchor point positioning, detecting the center of the bottommost pixel area in the foreground mask, and defining the center as a projection root node ; S12, constructing a shearing projection matrix M based on the background illumination vector Constructing a shearing projection matrix M, and pixels in the foreground mask Mapping to ground projection points The mapping relation is as follows: and S13, transforming the foreground mask by using the shearing projection matrix M to generate a hard shadow guide mask for retaining the outline of the foreground object.
- 4. The method for generating adaptive shadow based on planar projection guidance and depth-aware diffusion according to claim 3, wherein in step S13, a non-planar geometry of a ground in the background image is identified according to the background depth map when a hard shadow guide mask is constructed, and the shearing projection matrix M is adaptively adjusted based on the identified non-planar geometry so that the generated hard shadow guide mask is adapted to a slope or curved surface topography.
- 5. The method according to claim 3, wherein in step S13, the illumination estimation module is further configured to estimate illumination vectors of a plurality of light sources, and the step of constructing a hard shadow guide mask includes generating corresponding hard shadow guide masks based on the illumination vectors of the plurality of light sources, and performing fusion to simulate a shadow superposition effect under a plurality of light sources.
- 6. A method of adaptive shadow generation based on planar projection guidance and depth-aware diffusion according to claim 3, wherein in step S30, the step of constructing the joint input sequence is: S31, blocking the multi-source feature map, and dividing the hard shadow guide mask, the background depth map, the front background fusion map and the noise latent variable to be denoised into image blocks respectively; S32, token embedding and encoding, wherein each image block is mapped into a one-dimensional visual Token through a linear projection layer, and a modality type code and a position code are added for the Token from different modalities; and S33, sequence splicing, namely splicing target sequences corresponding to the noise latent variables by taking Token corresponding to the hard shadow guide mask, the background depth map and the front background fusion map as a conditional sequence in the sequence length dimension to form the combined input sequence.
- 7. The method for adaptive shadow generation based on planar projection guidance and depth-aware diffusion of claim 6, the method is characterized in that in step S40, the step of denoising generation by using the DiT model includes: s41, full-sequence self-attention interaction, wherein the DiT model enables the Token corresponding to the noise latent variable to interact with the Token corresponding to the multi-mode condition through a self-attention mechanism so as to inquire the geometric position, the ground height, the distance information and the ground texture information of the generated shadow; And S42, selectively decoding and reconstructing, wherein the DiT model outputs the processed sequence, and decodes and restores the processed sequence into a shadow image with self-adaptive penumbra effect and ground texture fusion.
- 8. The method according to claim 7, wherein in step S41, the denoising generation process of DiT model introduces a classifier-free guided CFG-based regulation mechanism, and the method comprises receiving at least one shadow attribute regulation parameter input by a user, wherein the shadow attribute regulation parameter comprises shadow intensity, edge ambiguity or hue, inputting the shadow attribute regulation parameter as an additional condition into the DiT model, and controlling the appearance of the shadow attribute in an output image through the CFG mechanism.
- 9. The adaptive shadow generation method based on planar projection guidance and depth-aware diffusion of claim 1, further comprising introducing timing consistency constraints when performing the shadow generation method on each frame of image in a video, including warping a shadow image generated in a previous frame using optical flow information between adjacent frames and using the warped shadow image as an auxiliary condition or initialization reference for a current frame generation process to ensure continuous and stable shadows in timing.
Description
Self-adaptive shadow generation method based on plane projection guidance and depth perception diffusion Technical Field The invention relates to the technical field of computer vision and image synthesis, in particular to a self-adaptive shadow generation method based on plane projection guidance and depth perception diffusion. Background In applications such as image editing, augmented Reality (AR), e-commerce advertisement composition, etc., it is a core requirement to seamlessly blend foreground objects (e.g., models, merchandise) into background images. The shadow generation is a key factor for determining the sense of reality of synthesis, and the prior art mainly has the following defects: (1) The excessive simplification of geometry, part of the prior art (e.g., generating shadows by predicting the rotation bounding box or ellipse parameters) ignores the complex contours of the object itself, e.g., when a person makes a waving motion, the shadow generated based on the bounding box cannot embody the projection of the arm, resulting in "shadow and motion disagreement"; (2) The illumination consistency is difficult to ensure, a pure End-to-End (End-to-End) generation model (such as directly generating a shadow by GAN) usually depends on data memorization to infer a shadow position, lacks explicit modeling of the background real light source direction, and easily generates physical conflict of 'background shadow left and shadow right'; (3) Texture and fit are insufficient, and although shadows generated by traditional graphic Rendering (Rendering) are accurate in position, edges are Hard (Hard Shadow) and cannot simulate the effect of penumbra (Penumbra) which is blurred with distance in the real world, and texture fusion of complex floors such as grasslands, carpets and the like is difficult to process. Disclosure of Invention In order to overcome the defects of the prior art, the invention aims to provide a self-adaptive shadow generation method based on plane projection guidance and depth perception diffusion, and aims to solve the problem that a selected image cannot be placed in any background to generate a real shadow in the image editing process, namely the problems that the details of a projection shape are lost, the projection direction is inconsistent with ambient light and shadow textures are hard in the existing shadow generation technology, and provides a high-quality shadow generation method which not only accords with the optical physical rule, but also can adapt to the background textures. To achieve the purpose, the invention adopts the following technical scheme: The invention provides a self-adaptive shadow generation method based on plane projection guidance and depth perception diffusion, which comprises the following steps: S00, obtaining a background image, a foreground object image and a corresponding foreground mask, extracting scene parameters of the background image to obtain a background depth image and a background illumination vector, and extracting the scene parameters of the background image comprises the following steps: S01, depth estimation, namely extracting a background depth map of the background image by using a monocular depth estimation network; S02, estimating illumination parameters, analyzing the brightness distribution of the background image by using an illumination estimation module, and outputting the background illumination vector : Wherein, the For the azimuth angle of the light source,Is the light source height angle. S10, constructing a hard shadow guide mask conforming to a physical perspective relation through geometric projection based on the background illumination vector and the foreground mask, wherein the step of constructing the hard shadow guide mask based on the background illumination vector and the foreground mask comprises the following steps: S11, anchor point positioning, detecting the center of the bottommost pixel area in the foreground mask, and defining the center as a projection root node ; S12, constructing a shearing projection matrix M based on the background illumination vectorConstructing a shearing projection matrix M, and pixels in the foreground maskMapping to ground projection pointsThe mapping relation is as follows: and S13, transforming the foreground mask by using the shearing projection matrix M to generate a hard shadow guide mask for retaining the outline of the foreground object. S20, splicing the hard shadow guide mask, the background depth map and a front background fusion map obtained by splicing the foreground object image and the background image as multi-mode conditions; s30, constructing a combined input sequence containing noise latent variables and the multi-modal conditions, and inputting the combined input sequence into a diffusion converter DiT model, wherein the construction steps of the combined input sequence are as follows: S31, blocking the multi-source feature map, and dividing the hard shadow guid