CN-122023624-A - Synthetic instance and streetscape fusion coordination method based on video diffusion model

CN122023624ACN 122023624 ACN122023624 ACN 122023624ACN-122023624-A

Abstract

The invention relates to a synthetic example and streetscape fusion coordination method based on a video diffusion model, which comprises the steps of S1 obtaining automatic driving multi-source data, constructing a scene Gaussian database and an asset Gaussian database based on the automatic driving multi-source data, calibrating the asset Gaussian database according to the automatic driving multi-source data, S2 fusing the scene Gaussian database and the asset Gaussian database to obtain a fused Gaussian scene, rendering the fused Gaussian scene to obtain a training triplet set, S3 constructing a video diffusion coordination network model comprising a basic coordination network and an example mask guide module, training the basic coordination network and the example mask guide module by utilizing the training triplet set, and S4 carrying out harmonious reasoning on a video to be processed through the trained video diffusion coordination network model. The method has the beneficial effects of generating video asset fusion coordination and strong sense of reality of fusion scene vision.

Inventors

WU YINGBO
LI WENXIN
GAN LINHAO

Assignees

重庆大学

Dates

Publication Date: 20260512
Application Date: 20260202

Claims (10)

1. A method for combining and coordinating a synthesized instance and a street view based on a video diffusion model is characterized by comprising the following steps: S1, acquiring automatic driving multi-source data, constructing a scene Gaussian database and an asset Gaussian database based on the automatic driving multi-source data, and calibrating the asset Gaussian database according to the automatic driving multi-source data; s2, a fusion scene Gaussian database and an asset Gaussian database are used for obtaining a fusion Gaussian scene, and rendering is carried out on the fusion Gaussian scene to obtain a training triplet set; S3, constructing a video diffusion coordination network model comprising a basic coordination network and an instance mask guiding module, and training the basic coordination network and the instance mask guiding module in sequence by utilizing a training triplet set; and S4, carrying out harmonious reasoning on the video to be processed through the video diffusion coordination network model which is completed through training.
2. The method of claim 1, wherein the scene gaussian database is used for storing 3D gaussian representations of autopilot multisource data, and the asset gaussian database is used for storing dimensions and poses of autopilot multisource data.
3. The method of coordination of synthetic examples and streetscapes fusion of claim 1, wherein the asset gaussian database construction and calibration steps comprise: B1, acquiring a dynamic traffic participant image according to the automatic driving multi-source data; B2, generating a synthetic asset according to the scene Gaussian database and the texture grid asset; B3, decomposing the texture grid asset characteristics to obtain a vertex covariance matrix: In the formula, As the vertex covariance matrix, For the total number of vertices of the mesh model, For the coordinates of the ith vertex of the mesh model, Is the mean value of the vertex coordinate set of the grid model, Is that Is a transpose of (2); And B4, extracting principal axis information of texture grid assets according to the vertex covariance matrix, and constructing a rotation matrix according to the principal axis information: In the formula, For a rotation matrix constructed from the spindle information, Is a unit feature vector; b5, constructing a scaling matrix according to the scene Gaussian database and the texture grid asset: In the formula, In order to scale the matrix, For the true length of the corresponding dynamic object in the scene gaussian database, For the true width of the corresponding dynamic object in the scene gaussian database, For the true height of the corresponding dynamic object in the scene gaussian database, For the length of the texture grid asset, For the width of the texture grid asset, The height of the grid asset for texture; B6, transforming the texture grid asset vertexes according to the scaling matrix and the rotation matrix: In the formula, For the transformed vertex coordinates, Mesh asset for original texture Coordinates of the vertices; And B7, converting the transformed texture grid asset into an asset Gaussian primitive, and storing the asset Gaussian primitive into an asset Gaussian database.
4. The method for coordinating a synthesized instance and a street view according to claim 1, wherein step S2 specifically comprises: S201, generating an dissonance video and an instance mask video according to a fusion Gaussian scene; s202, acquiring original real videos of a preset scene through automatic driving multi-source data, and combining the dissonance videos, the instance mask videos and the original real videos to form a training triplet set.
5. The method of coordination of composite instance and streetscape fusion of claim 4, wherein the specific step of instance mask video generation comprises: C1, respectively distributing classification labels for scene Gaussian primitives and asset Gaussian primitives in a fused Gaussian scene; c2, rendering an instance mask diagram according to the classification labels: In the formula, The pixel values are masked for the rendered instance, For all sets of gaussian primitives projected to the pixel, Is the first The class label value of the individual gaussian primitives, Is the first The cumulative weight of the individual gaussian primitives, Is the first The accumulated weights of the gaussian primitives; c3, splitting the instance mask map into a plurality of binarized single-channel masks according to the classification labels; and C4, combining the plurality of binarized single-channel masks to form an instance mask video.
6. The method according to claim 1, wherein in step S3, the basic coordination network includes a pre-training encoding unit, a denoising unit and a decoding unit which are sequentially connected; the instance mask guidance module includes: An input layer for inputting the mask video and the Fourier position code in the training triplet set; The convolution projection layer is used for splicing the mask video and the Fourier position code in the training triplet set on the image channel and obtaining a feature map through convolution projection; The downsampling feature extraction layer comprises a plurality of downsampling blocks corresponding to the denoising unit, the downsampling blocks are connected with a plurality of residual connecting blocks, and the downsampling feature layer is used for capturing mask features of the feature map; The gating feature modulation layer is distributed corresponding to the downsampling blocks and is used for generating scale factors and offset factors; And the zero initialization strategy layer is used for initializing parameters in the gating characteristic modulation layer.
7. The method of claim 6, wherein the reasoning step of the instance mask guidance module comprises: D1, embedding fourier position codes in example mask videos in an input training triplet set: In the formula, Is the position Is used for the encoding of the vector of (c), Is the first The frequency components of the frequency spectrum are used, As a function of the temperature parameter(s), For the frequency index to be used, Is the total number of coding dimensions; D2, projecting the example mask video convolution layer embedded with the Fourier position codes to a high-dimensional feature space to obtain a feature map; d3, extracting the feature map through a downsampling feature extraction layer to obtain mask features; d4, the gating feature modulation layer generates scale factors and offset factors according to mask features; and D5, obtaining a fusion feature map according to the scale factors, the offset factors and the feature map: In the formula, In order to fuse the feature map(s), As a characteristic diagram of the device, the device is provided with a plurality of display units, For the matrix-by-element multiplication, As a scale factor of the dimensions of the device, Is an offset factor; and D6, injecting the fusion feature map obtained after calculation into a basic coordination network.
8. The method for coordination of synthesis examples and streetscape fusion according to claim 6, wherein the specific step of training the video diffusion coordination network model comprises: s301, constructing a training triplet set into a plurality of training batches, wherein each training batch comprises a plurality of training samples; S302, generating dissonance video potential features and real video potential features according to training samples: In the formula, As a potential feature of the discordant video, For the pre-training of the coding unit encoding operation, In order to be a non-harmonious video segment, As a potential feature of a real video, Is a real video clip; s303, carrying out noise adding processing on the potential features of the real video to obtain the potential features of the noisy real video: In the formula, To pass through random time steps The potential characteristics of the noisy real video, For the parameters of the noise scheduling, As a potential feature of a real video, Noise that is subject to a standard normal distribution; S304, calculating a noise prediction value according to the noisy real video potential characteristics and the dissonance video potential characteristics: In the formula, As a result of the predicted value of the noise, To coordinate the denoising operations of denoising units in a network on a basic basis, Is a time step; S305, optimizing parameters of a basic coordination network according to the real video potential characteristics, the dissonance video potential characteristics and the noise prediction value; S306, freezing parameters after the basic coordination network optimization, and optimizing parameters of an instance mask guiding module through the dissonance video, the instance mask video and the original real video.
9. The synthetic instance-to-street view fusion coordination method of claim 8, wherein the loss function optimized by the base coordination network and the instance mask guidance module is: In the formula, As a function of the diffusion loss, To potentially feature real video Noise compliant with standard normal distribution Potential features of discordant video And time step The joint distribution of (a) takes the mathematical expectation, Is the square of the two norms.
10. The method for coordinating a synthesized instance and a street view according to claim 1, wherein the step S4 specifically includes: s401, splitting a video to be processed into a plurality of video segments with the same frame number, and reasoning potential characteristics of each video segment through a video diffusion coordination network model, wherein the adjacent two video segments are overlapped by a preset frame number; s402, connecting potential features of all video clips obtained through reasoning to obtain potential features of a complete video; s403, generating a coordinated video according to the potential characteristics of the complete video.

Description

Synthetic instance and streetscape fusion coordination method based on video diffusion model Technical Field The invention relates to the technical field of scene fusion in the fields of automatic driving and computer vision, in particular to a method for fusing and coordinating a synthesized instance and streetscape based on a video diffusion model. Background With the continuous development of artificial intelligence technology, research and application of automatic driving technology begin to appear on vehicles such as automobiles, wherein automatic driving mainly recognizes road information through computer vision so as to realize control of running of automobiles, and in the fields of automatic driving and computer vision, high-quality and diversified street view data are always important bases for research and application. In the current automatic driving field, a real street view is generally rebuilt through 3D Gaussian splats, 3D synthesized assets are generated and converted into Gaussian representation, and finally, a general vision coordination method is adopted to carry out post-processing on the fused and rendered images or videos, so that images or videos which can be visually identified by a machine are generated. However, when reconstructing a real street view through 3D Gaussian splats, gaussian primitives of a reconstructed scene contain real illumination, shadow and texture information, while Gaussian primitives of a synthesized asset lack scene-specific illumination baking, and direct fusion is prone to problems of color distortion, texture mismatch, shadow deletion and the like. The general coordination method only optimizes color distribution, but can not solve the problems of geometric information such as scale alignment and physical illumination reflection consistency of a 3D layer. The coordination method based on the images (such as Harmonizer, diffHarmony ++) is used for processing videos in a frame-by-frame mode generally, ignoring space-time association among frames to cause inconsistent textures of the same target under different view angles, and the special video method such as VTT and MVOC introduces space-time constraint, but is not optimized for 3D Gaussian fusion scenes, so that problems of flickering among frames, incoherent target motion trail and the like are easy to occur during long video processing. When fusion rendering is carried out on images or videos, the currently commonly used visual language model VLM method such as Qwen-VL is easy to modify the structure or target attribute of an original scene in the editing process, such as changing a white vehicle into black, the controllability is poor, the existing coordination method is mostly black box processing, marking information (such as a boundary box and an instance mask) of the original scene and a synthesized asset can be damaged, accurate marking is needed to be reserved for automatic driving data to support model training, and the defect causes that the rendered images or videos are difficult to apply on the ground. In addition, the existing coordination data set focuses on a general image or video editing scene, does not contain a 3D Gaussian heterogeneous fusion scene, does not provide a training sample for space-time alignment, and causes lack of pertinence in training of a coordination model, and performance is difficult to improve. Meanwhile, when the video coordination method is used for processing long videos, the inter-frame connection is incoherent due to the adoption of block processing, or the calculation cost is increased rapidly due to the adoption of full-sequence processing, and the generation requirement of large-scale long track data of automatic driving cannot be met. In summary, in the current street view fusion technology, each link of street view fusion has obvious technical cutting, and an end-to-end fusion coordination frame cannot be formed, so that the technical problem that heterogeneous asset fusion is not coordinated in the automatic driving process synthetic data is caused. Disclosure of Invention The invention aims to provide a method for combining and coordinating a synthesized example and a street view based on a video diffusion model, which aims to solve the technical problem that in the prior art, heterogeneous asset combination is inconsistent in synthesized data in an automatic driving process. The technical scheme adopted for solving the technical problems is that the method for combining and coordinating the synthesized example and the street view based on the video diffusion model comprises the following steps: S1, acquiring automatic driving multi-source data, constructing a scene Gaussian database and an asset Gaussian database based on the automatic driving multi-source data, and calibrating the asset Gaussian database according to the automatic driving multi-source data; s2, a fusion scene Gaussian database and an asset Gaussian database are used for obtaining a fu