CN-120931789-B - Driving scene reconstruction method and device, electronic equipment and program product

CN120931789BCN 120931789 BCN120931789 BCN 120931789BCN-120931789-B

Abstract

The embodiment of the disclosure discloses a driving scene reconstruction method, which comprises the steps of obtaining input data and output data of a video generation model in an N-th time step and input data in an N+1th time step, writing the output data of the N-th time step into a cache, determining an estimated value of the output data of the N+1th time step through a mapping function, comparing fluctuation difference between the output data of the N-th time step and the estimated value, determining whether the output data of the N-th time step stored in the cache is multiplexed in the N+1th time step or not based on whether the fluctuation difference meets a preset condition or not to obtain the output data of the N+1th time step, and obtaining a first image sequence to reconstruct a driving scene based on the first image sequence. The method can improve the reasoning efficiency of the video generation model and reduce the demand on computing resources.

Inventors

ZHU ZHENG
HUANG GUAN
ZHU JIAGANG
WANG XIAOFENG
WANG WEIJIE

Assignees

北京极佳视界科技有限公司

Dates

Publication Date: 20260505
Application Date: 20250725

Claims (14)

1. A driving scene reconstruction method, characterized in that the method comprises: the method comprises the steps of obtaining at least one condition feature corresponding to at least one condition, inputting the at least one condition feature into a control Net model, obtaining a fused first feature through the control Net model, embedding the first feature into an intermediate layer of a video generation model, and obtaining a first feature of the video generation model; acquiring input data and output data of a video generation model in the N time step and input data of the video generation model in the (N+1) time step, wherein N is an integer and is more than or equal to 1; Writing the output data of the nth time step into a cache; Determining an estimated value of the output data of the (N+1) th time step through a mapping function according to the input data and the output data of the (N+1) th time step and the input data of the (N+1) th time step, wherein the mapping function is obtained by fitting samples of a plurality of input data and a plurality of output data of the video generation model in advance; comparing a fluctuation difference between the output data of the nth time step and the estimated value of the output data of the n+1th time step; Determining whether to multiplex the output data of the nth time step stored in the buffer memory at the (n+1) th time step based on whether a fluctuation difference between the output data of the nth time step and the estimated value of the output data of the (n+1) th time step satisfies a preset condition, so as to obtain the output data of the (n+1) th time step; And obtaining a first image sequence to reconstruct a driving scene based on the first image sequence, wherein the first image sequence comprises output data from the 1 st time step to the (n+1) th time step.
2. The method of claim 1, wherein prior to determining the estimated value of the output data for the n+1th time step by a mapping function, the method further comprises: acquiring a plurality of input data and a plurality of corresponding output data of the video generation model as a plurality of samples; Acquiring a ratio of a difference value between input data of an (n+1) th time step and input data of an (N) th time step in the plurality of samples to the input data of the (N) th time step as a first variable; acquiring the ratio of the difference value between the output data of the (n+1) th time step and the output data of the (N) th time step in the plurality of samples to the output data of the (N) th time step as a second variable; Fitting a plurality of first variables and a plurality of second variables by adopting a least square method to obtain a multi-order polynomial as a mapping function, wherein the mapping function represents a functional relation from the first variables to the second variables.
3. The method according to claim 1, wherein determining whether to multiplex the output data of the nth time step stored in the buffer at the (n+1) th time step based on whether a fluctuation difference between the output data of the nth time step and the estimated value of the output data of the (n+1) th time step satisfies a preset condition, comprises: Determining that the output data of the nth time step stored in the buffer is multiplexed at the (n+1) th time step based on a fluctuation difference between the output data of the nth time step and the estimated value of the output data of the (n+1) th time step satisfying a preset condition; Or alternatively Determining that the output data of the nth time step stored in the buffer is not multiplexed at the (n+1) th time step based on a fluctuation difference between the output data of the nth time step and the estimated value of the output data of the (n+1) th time step does not satisfy a preset condition; After determining whether to multiplex output data of an N-th time step stored in the buffer at the n+1-th time step, the method further includes: Multiplexing the output data of the nth time step stored in the buffer as the output data of the n+1th time step in response to determining that the output data of the nth time step is multiplexed in the n+1th time step; Or alternatively In response to determining that the output data of the nth time step stored in the buffer is not multiplexed at the n+1th time step, input data of the n+1th time step is input to the video generation model so that the video generation model performs an inference process of the n+1th time step, and the output data of the n+1th time step is output via the video generation model.
4. The method of claim 1, wherein the video generation model is a video diffusion model, the video diffusion model comprising an attention layer, wherein the method further comprises, prior to the input data and the output data for the nth time step and the input data for the n+1th time step, obtaining input data for the video generation model: executing an inference process of at least one time step from the 1 st time step to the N th time step through the video diffusion model; in performing the reasoning of any of the at least one time step, the method further comprises: Converting the query vector Q and the key vector K processed by the attention layer into four-bit integer INT4 data; And converting the value vector V and the attention score vector P processed by the attention layer into 8-bit floating point FP8 data, wherein the query vector Q, the key vector K and the value vector V are obtained by mapping different weight matrixes for the input data, and the attention score vector P is obtained by calculating an inner product for the query vector Q and the key vector K and then carrying out normalization processing.
5. The method of claim 1, wherein the reasoning process from time 1 to each of the (n+1) th time steps includes a first reasoning process with conditional embedding and a second reasoning process with unconditional embedding; acquiring input data and output data of a video generation model at an nth time step and input data at an (n+1) th time step, including: acquiring first input data and first output data corresponding to a first reasoning process of a video generation model in an N-th time step and first input data in an (n+1) -th time step; writing the output data of the nth time step into a cache, including: writing first output data obtained in the first reasoning process corresponding to the nth time step into a cache; determining an estimated value of the output data of the (n+1) th time step according to the input data and the output data of the (n+1) th time step and the input data of the (n+1) th time step through a mapping function, wherein the estimated value comprises the following steps: According to the first input data and the first output data of the N time step and the first input data of the (n+1) time step, determining a first estimated value of the first output data corresponding to the first reasoning process in the (n+1) time step through a mapping function; Determining whether to multiplex the output data of the nth time step stored in the buffer memory at the (n+1) th time step based on whether a fluctuation difference between the output data of the nth time step and an estimated value of the output data of the (n+1) th time step satisfies a preset condition, comprising: And determining whether to multiplex the first output data obtained by executing the first reasoning process at the (N+1) th time step stored in the cache based on whether a fluctuation difference between the first output data obtained by executing the first reasoning process at the (N) th time step and the first estimated value meets a preset condition.
6. The method according to any one of claims 1-5, wherein after obtaining the first sequence of images, the method further comprises: obtaining image features based on the first image sequence; the image features are respectively input into a pre-trained depth estimation network and a pre-trained pose estimation network, depth information is output through the depth estimation network, and pose information is output through the pose estimation network; Inputting the depth information and the pose information and the image characteristics into a pre-trained Gaussian prediction network, and obtaining target parameters for displaying the 3D Gao Sidian cloud through the Gaussian prediction network, wherein the Gaussian prediction network is a residual network comprising attention calculation and convolution operation; and reconstructing a driving scene based on the target parameters and the time information.
7. The method of claim 6, wherein the target parameters include a plurality of covariance, position, opacity, spherical harmonics; The Gaussian prediction network comprises a plurality of groups of residual error networks, wherein one group of residual error networks in the plurality of groups of residual error networks comprises a plurality of residual error blocks connected in series and a header connected with the plurality of residual error blocks, and the group of residual error networks correspondingly outputs a target parameter; one of the plurality of residual blocks comprises a first convolution layer, a first ReLU layer, an attention layer, and a batch normalization layer; At least two residual blocks in the plurality of residual blocks are connected in a jumping manner; The one header includes a second convolution layer and a second ReLU layer.
8. The method of claim 6, wherein acquiring image features corresponding to the first image sequence comprises: decoding the first image sequence to obtain a second image sequence in a 2D video format; Based on a time attention mechanism, coding the M-1 frame, the M frame and the M+1 frame of the second image sequence to obtain a coded M frame image, wherein M is an integer and is more than or equal to 2, coding the first frame image and the second frame image of the first image sequence to obtain a coded first frame image, and taking the coded multi-frame image as the image characteristic.
9. A driving scene reconstruction method, characterized in that the method comprises: Acquiring image features based on a pre-acquired video or a first image sequence generated in real time; the image features are respectively input into a pre-trained depth estimation network and a pre-trained pose estimation network, depth information is output through the depth estimation network, and pose information is output through the pose estimation network; Inputting the depth information and the pose information and the image characteristics into a pre-trained Gaussian prediction network, and obtaining target parameters for displaying the 3D Gao Sidian cloud through the Gaussian prediction network, wherein the Gaussian prediction network is a residual network comprising attention calculation and convolution operation; and reconstructing a driving scene based on the target parameters and the time information.
10. A driving scene reconstruction device, characterized in that the device comprises: The system comprises a condition embedding module, a control Net model, a video generating model, a condition embedding module, a video generating model and a video generating model, wherein the condition embedding module is used for acquiring at least one condition characteristic corresponding to at least one condition; The acquisition module is used for acquiring the input data and the output data of the N time step of the video generation model and the input data of the (n+1) time step, wherein N is an integer and is more than or equal to 1; The buffer memory module is used for writing the output data of the Nth time step into a buffer memory; The mapping module is used for determining the output data of the (N+1) th time step through a mapping function according to the input data and the output data of the (N+1) th time step and the input data of the (N+1) th time step, wherein the mapping function is obtained by fitting a plurality of input data and a plurality of output data of the video generation model in advance; A comparison module for comparing a fluctuation difference between the output data of the nth time step and the estimated value of the output data of the (n+1) th time step; A judging module, configured to determine whether the output data of the nth time step stored in the cache is multiplexed at the (n+1) th time step based on whether a fluctuation difference between the output data of the nth time step and an estimated value of the output data of the (n+1) th time step satisfies a preset condition, so as to obtain the output data of the (n+1) th time step; the system comprises an output module, a first image sequence and a control module, wherein the output module is used for obtaining the first image sequence to reconstruct a driving scene based on the first image sequence, and the first image sequence comprises output data from the 1 st time step to the (n+1) th time step.
11. A driving scene reconstruction device, characterized in that the device comprises: the feature fusion module is used for acquiring image features corresponding to a pre-acquired video or a first image sequence generated in real time; the parameter estimation module is used for respectively inputting the image characteristics into a pre-trained depth estimation network and a pre-trained pose estimation network, outputting depth information through the depth estimation network and outputting pose information through the pose estimation network; The parameter prediction module is used for inputting the depth information and the pose information into a pre-trained Gaussian prediction network, and obtaining target parameters for displaying the 3D Gao Sidian cloud through the Gaussian prediction network, wherein the Gaussian prediction network is a residual network comprising attention calculation and convolution operation; and the reconstruction module is used for reconstructing the driving scene based on the target parameters and the time information.
12. An electronic device, comprising: A memory for storing a computer program product; a processor for executing a computer program product stored in said memory, which, when executed, implements the method of any of the preceding claims 1-9.
13. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of the preceding claims 1-9.
14. A computer program product comprising computer program instructions which, when executed by a processor, implement the method of any of the preceding claims 1-9.

Description

Driving scene reconstruction method and device, electronic equipment and program product Technical Field The disclosure relates to video generation technology, in particular to a driving scene reconstruction method and device, electronic equipment and a program product. Background The 4D driving scene is a simulation scene of a four-dimensional dynamic driving environment formed by adding a time dimension on the basis of the traditional 3D space dimension, and is mainly used for improving the perception and response capacity of an automatic driving system. A driving scene reconstruction method can firstly generate a 2D driving video, then realize dynamic scene reconstruction by separating a foreground dynamic object from a static background, and the reasoning process for generating the 2D driving video has higher requirements on hardware computing resources and higher calculation force requirements. Disclosure of Invention The present disclosure has been made in order to solve the above technical problems. The embodiment of the disclosure provides a driving scene reconstruction method and device, electronic equipment and a program product. According to a first aspect of the embodiments of the present disclosure, there is provided a driving scene reconstruction method, including obtaining input data and output data of an nth time step and input data of an n+1th time step of a video generation model, N being an integer, N being greater than or equal to 1, writing the output data of the nth time step into a buffer, determining the output data of the n+1th time step by a mapping function according to the input data and the output data of the nth time step and the input data of the n+1th time step, the mapping function being obtained by fitting samples of a plurality of input data and a plurality of output data of the video generation model in advance, comparing a fluctuation difference between the output data of the nth time step and the output data of the n+1th time step, determining whether the output data of the nth time step stored in the buffer is multiplexed or not based on whether the fluctuation difference between the output data of the nth time step and the output data of the n+1th time step satisfies a preset condition, to obtain the output data of the n+1th time step, and obtaining a first sequence of images of the nth time step to the first sequence of images of the first time step, and the first sequence of images of the first time step 1 to the first sequence of images of the first time sequence of reconstructing the driving scene. According to a second aspect of the disclosed embodiments, a driving scene reconstruction method is provided, which includes acquiring image features corresponding to a pre-acquired video or a first image sequence generated in real time, respectively inputting the image features to a pre-trained depth estimation network and a pre-trained pose estimation network, outputting depth information through the depth estimation network and pose information through the pose estimation network, inputting the depth information and the pose information to a pre-trained Gaussian prediction network, obtaining target parameters for displaying a 3D Gao Sidian cloud through the Gaussian prediction network, wherein the Gaussian prediction network is a residual network comprising attention calculation and convolution operation, and reconstructing a driving scene based on the target parameters and time information. According to a third aspect of the embodiments of the present disclosure, there is provided a driving scene reconstruction device, including an acquisition module configured to acquire a fluctuation difference between input data and output data of an nth time step and input data of an n+1th time step of a video generation model, N being an integer, N being equal to or greater than 1, a buffer module configured to write the output data of the nth time step into the buffer, a mapping module configured to determine the output data of the n+1th time step by a mapping function based on the input data and the output data of the nth time step and the input data of the n+1th time step, a comparison module configured to compare a fluctuation difference between the output data of the nth time step and the output data of the n+1th time step, a determination module configured to determine whether the fluctuation difference between the output data of the nth time step and the output data of the n+1th time step satisfies a buffer condition based on whether the fluctuation between the output data of the nth time step and the output data of the n+1th time step satisfies the preset buffer, a mapping function configured to obtain the output data of the n+1th time step to the first image sequence based on the output data of the n+1th time step, and a first image sequence based on the output scene reconstruction device. According to a fourth aspect of the embodiment of the disclosure, a driving scen