CN-121982223-A - Intelligent driving scene generation method based on feedforward reconstruction and space-time diffusion

CN121982223ACN 121982223 ACN121982223 ACN 121982223ACN-121982223-A

Abstract

The invention belongs to the field of automatic driving data generation in the new generation information technology industry, and discloses a method for generating an intelligent driving scene based on feedforward reconstruction and space-time diffusion. The 3D geometric perception capability of the DINOv backbone network is given by introducing the Prussian ray coordinates, the characteristics extracted by the DINOv backbone network are combined, and the Gao Sijie wharf, the dynamic head and the sky head decoding mechanism are utilized to explicitly decouple the static background, the dynamic object and the distant view sky in the scene, so that the problems of inconsistent cross-view geometry and insufficient cross-time continuity in the dynamic scene reconstruction process are solved, and the three-dimensional reconstruction in the dynamic scene is realized. The invention provides a training sample for solving the problem that the target identification of the perception algorithm is unstable and the robustness is poor in the burst scene.

Inventors

LI LINHUI
ZANG QIUYU
LIAN JING
ZHAO JIAN
LIU JUNYUAN

Assignees

大连理工大学

Dates

Publication Date: 20260505
Application Date: 20260403

Claims (4)

1. A method for generating a feedforward reconstruction and space-time diffusion intelligent driving scene is characterized by comprising the following steps: step 1, extracting space-time potential characteristics of multi-view data; Step 1.1, constructing a multi-mode input tensor comprising geometry and time; For each pixel point in the multi-view image, constructing a three-dimensional space ray which starts from a camera optical center and passes through the pixel point, calculating the Prussian coordinate of each three-dimensional space ray, wherein for each pixel point, the Prussian coordinate of the Prussian coordinate is represented by a ray direction vector Sum ray moment vector The composition, the calculation formula is as follows: In the formula, And The rotation matrix and translation vector of the camera coordinate system to the world coordinate system respectively, Representing a transverse coordinate under a pixel coordinate system, wherein the pixel coordinate system is a two-dimensional rectangular coordinate system which is established along the horizontal and vertical directions of the image by taking the upper left corner of the image as an origin and taking pixels as units; representing the lateral coordinates of the camera's optical center in the pixel coordinate system, Representing a pixel coordinate system along a middle edge of a camera reference matrix Splicing the normalized 6-channel Pluker coordinates and the 3-channel multi-view images in the channel dimension to form a 9-channel multi-mode input tensor Wherein The real number domain is represented by the number, Representing the height of the multi-view image, Representing the width of the multi-view image; step 1.2, extracting characteristics based on DINOv backbone network; Firstly, mapping multi-mode input tensor into image block sequence by means of image block embedding convolution layer, constructing time step embedding module formed from sine and cosine coding layer, first full-connection layer, siLU activating layer and second full-connection layer which are connected in sequence, making time step Mapping to high-dimensional feature vectors The calculation formula is as follows: In the formula, In order to be a time step, the time step, Representing the sine-cosine coding layer, For the first fully-connected layer, For the second fully-connected layer, siLU denotes SiLU active layer, 、 Respectively a first full connection layer And a second full connection layer Is a function of the learnable bias vector in (a), For the dimension length of the sine and cosine code, Is the first Frequency components, then, multi-modal input tensors Generating corresponding time steps through image block sequence coding And with high-dimensional feature vectors Adding element by element to obtain image block characteristics fused with time information, and finally inputting the image block characteristics into DINOv backbone network to extract the characteristics; step 2, decoding space-time potential features to obtain panoramic Gaussian parameters; Step 3, constructing control conditions of local redrawing; and 4, conditional denoising of space-time decomposition in the Gaussian potential space.
2. The method for generating a smart driving scene by feedforward reconstruction and space-time diffusion according to claim 1, wherein the specific implementation process of the step 2 is as follows: step 2.1, decoding the explicit scene attribute by a Gaussian decoding head; Decoding the features extracted by the DINOv backbone network into geometric parameters and appearance parameters of 3D Gaussian by utilizing Gao Sijie wharf, wherein the Gaussian decoding head consists of a multi-layer perceptron and a feature reconstruction module, the feature reconstruction module is used for rearranging the features extracted by the DINOv backbone network into a two-dimensional feature map according to the spatial corresponding relation of the features in the multi-view image, and the geometric parameters and the appearance parameters of the 3D Gaussian comprise depth Dimension and dimension Rotation quaternion Opacity of And color ; Step 2.2, decoding dynamic physical attributes by the dynamic header; constructing a dynamic head in parallel with the Gaussian decoding head, keeping the network structure of the dynamic head consistent with that of a Gao Sijie wharf, decoding dynamic physical attributes through the dynamic head to fit a target vehicle in a intelligent driving scene, and decoding dynamic probability through the dynamic head World speed And life cycle Wherein the dynamic probability Through Sigmoid activation function processing, the method is used for distinguishing static background area from target vehicle area in subsequent rendering stage, and world speed Representing the instantaneous velocity vector of a Gaussian sphere in world coordinates and employing hyperbolic tangent function Multiplied by maximum speed Performing range constraint, lifecycle The device is used for controlling the existence window of the Gaussian ball on a time axis so as to simulate the appearance and disappearance processes of the target vehicle; Step 2.3, reconstructing sky through a sky head; constructing a space head for parallel decoding to reconstruct a sky area in the multi-view image, wherein the space head adopts a hemispherical sampling mechanism to sample a fixed number of direction vectors on a hemispherical surface taking a camera optical center as a center Mapping it to world coordinate system and projecting it back into multi-view image to sample color features, combining multi-layer perceptron to sample color Dimension and dimension And opacity An adjustment is made to generate a set of gaussian balls for rendering the sky.
3. The method for generating the intelligent driving scene by feedforward reconstruction and space-time diffusion according to claim 2, wherein the specific implementation process of the step 3 is as follows: Step 3.1, encoding a bird's eye view global constraint condition; The method comprises the steps of obtaining aerial view features comprising occupied grids, lane line topology and speed field distribution based on aerial view grids, extracting the aerial view features by adopting a lightweight convolutional neural network encoder, wherein the lightweight convolutional neural network encoder consists of a cascade residual module and a downsampling convolutional layer, reducing the spatial resolution of the aerial view features layer by layer and increasing the number of channels through the lightweight convolutional neural network encoder, adjusting the spatial dimension of the aerial view feature embedding output, and obtaining aerial view condition features Bird's eye view condition features Global geometric constraints are provided for the denoising network to ensure that the generated target vehicle and the motion state thereof accord with the physical topological structure of the road; step 3.2, constructing an editing mask condition of depth perception; According to the motion track and the size of the target vehicle in the three-dimensional space, calculating the projection area of the three-dimensional target frame of the target vehicle on the pixel coordinate system at each moment to generate an initial binary mask Introducing a depth eliminating mechanism to solve the front and rear shielding problem in the intelligent driving scene, and constructing editing mask conditions with consistent depth : In the formula, The logical and operation is represented by a sequence of logical and operations, The depth of the target vehicle is D, namely the depth truth value of the multi-view image; For edit mask conditions Performing asymmetric expansion processing to enable the mask area to cover the target vehicle and a shadow area generated below the target vehicle so as to ensure that the generated target vehicle and a road surface generate natural light and shadow interaction; Step 3.3, coding text semantic conditions; Describing the appearance and behavior attributes of the target vehicle by using a large language model, inputting the appearance and behavior attributes into a pre-trained text encoder to obtain a semantic feature sequence, encoding the input text into the semantic feature sequence matched with the image block features by the text encoder, and performing pooling aggregation on the semantic feature sequence to obtain a global semantic representation Representing global semantics The semantic guidance vector is input to a condition fusion module and is combined with the condition characteristics of the aerial view Editing mask conditions Co-participation in cross-modal interactions and feature modulation to obtain feature vectors 。
4. The method for generating a smart driving scene by feedforward reconstruction and space-time diffusion according to claim 3, wherein the specific implementation process of the step 4 is as follows: Step 4.1, space-time decomposition conditioning denoising network; potential features in geometric parameters and appearance of 3D gaussian As a denoising object, wherein For the number of time frames, Is the number of the gaussian balls and is equal to the number of the gaussian balls, For single Gaussian parameter dimension, simultaneously introducing the feature vector constructed in the step 3 As conditional pilot signal, de-noising network accepts potential features Feature vector Outputting a noise estimate, and updating potential features based on the noise estimate Obtaining Until the de-noised potential characteristics are obtained The denoising network is composed of a plurality of transducer modules, and feature vectors Injecting the self-attention mechanism into each layer of the transducer module through a conditional normalization and affine modulation mechanism, and adopting a space-time decomposition self-attention mechanism for each transducer module, wherein the formula is as follows: In the formula, Expressed in time steps Lower, the first The potential characteristics of the Gaussian sphere after the joint modeling of the space and the time attention, Representing time steps Lower, the first A gaussian sphere potential feature; Representing the mechanism of spatial attention, Representing a time attention mechanism; Spatial attention mechanism The formula is: In the formula, Representing potential features Value vector obtained by linear mapping and spatial attention weight The definition is as follows: In the formula, 、 Linear projection functions representing the query and key, respectively; feature dimensions for query and key vectors; representing at the same time step In, index to all Gaussian balls Performing normalization operation; time self-attention mechanism The formula is: In the formula, Represent the first The Gaussian balls are at different moments Is a potential representation of the characteristic of (a), The time attention weight is expressed by the formula: In the formula, Representing along the time dimension Normalizing; Step 4.2, feature fusion and Gaussian parameter decoding; Using the preset trajectory, the edit mask condition is utilized in generating the target vehicle position And carrying out characteristic backfilling operation on the predicted characteristics, wherein the formula is as follows: In the formula, In step 4.1, the potential features are estimated by noise When updating is performed, the potential features corresponding to the area are not updated, Representing element-by-element multiplication And (3) inputting the video stream to the Gaussian decoding head in the step (2), generating edited 4D scene Gaussian parameters, and rendering the video stream containing the new generation vehicle by a micro rasterizer.

Description

Intelligent driving scene generation method based on feedforward reconstruction and space-time diffusion Technical Field The invention belongs to the field of automatic driving data generation in the new generation information technology industry, and relates to a method for generating intelligent driving scenes based on feedforward reconstruction and space-time diffusion. Background With the development of science and technology in the field of automatic driving in the new generation of information technology industry, the current automatic driving perception algorithm has shown more mature perception capability when coping with high-frequency conventional scenes such as urban traffic flow, standard structured roads and the like. However, when the intelligent driving perception algorithm faces an uncertain long-tail scene such as forced cutting-in of adjacent vehicles, crossing of blind pedestrians, and the like, the intelligent driving perception algorithm still has the robustness problems of unstable target identification, serious omission or false detection of key targets. This complex dynamic uncertainty scenario has become a core bottleneck that limits the automated driving to drive across from a particular conventional scenario to full-field Jing Zhi. Because of the problems of low occurrence frequency, large simulation difficulty and the like of the long tail data of the automatic driving, the automatic driving data acquisition mode is high in cost, and various events with high uncertainty are difficult to cover in a short time. The current image editing or video generation scheme is lack of strong constraint of physical geometry, so that the generated image sequence is difficult to keep geometric consistency among different camera view angles, physical errors such as object deformation, movement track violating physical laws and the like exist in the generated object, and the image and video generation method is difficult to provide high-dimensional space information, so that the lifting effect of the generated image sequence in closed-loop simulation and algorithm verification is very limited. The automatic driving data generation based on the generation type reconstruction gradually becomes a solution to the current automatic driving long tail problem due to the fact that the automatic driving data generation has the advantage of strong consistency of geometry and time sequence. Therefore, there is a need for a method for generating 4D scene editing based on generative reconstruction, which can accurately generate long-tail event data according to a preset intelligent driving intention on the premise of ensuring that the multi-view geometry is consistent with the time sequence in a dynamic environment. In this way, a high-value training sample with both visual fidelity and spatial logic consistency can be provided for the perception algorithm, so that the robustness problem of the automatic driving perception algorithm in the face of an event of city uncertainty is relieved. Disclosure of Invention In order to solve the problems in the prior art, the invention provides a method for generating an intelligent driving scene based on feedforward reconstruction and space-time diffusion, which realizes high-fidelity three-dimensional reconstruction of a complex dynamic environment by constructing a DINOv-based feedforward network and introducing Gaussian life cycle attributes, and simultaneously realizes controllable generation of targets such as target vehicles in a three-dimensional scene by means of multisource control conditions and a space-time diffusion editing mechanism. The technical scheme of the invention is as follows: A method for generating a feedforward reconstruction and space-time diffusion intelligent driving scene comprises the following steps: step 1, extracting space-time potential characteristics of multi-view data; Step 1.1, constructing a multi-mode input tensor comprising geometry and time; For each pixel point in the multi-view image, constructing a three-dimensional space ray which starts from a camera optical center and passes through the pixel point, calculating the Prussian coordinate of each three-dimensional space ray, wherein for each pixel point, the Prussian coordinate of the Prussian coordinate is represented by a ray direction vector Sum ray moment vectorThe composition, the calculation formula is as follows: In the formula, AndThe rotation matrix and translation vector of the camera coordinate system to the world coordinate system respectively,Representing a transverse coordinate under a pixel coordinate system, wherein the pixel coordinate system is a two-dimensional rectangular coordinate system which is established along the horizontal and vertical directions of the image by taking the upper left corner of the image as an origin and taking pixels as units; representing the lateral coordinates of the camera's optical center in the pixel coordinate system, Represen