Search

CN-121982215-A - Active three-dimensional reconstruction method based on model reinforcement learning and three-dimensional Gaussian sputtering

CN121982215ACN 121982215 ACN121982215 ACN 121982215ACN-121982215-A

Abstract

The invention relates to the technical fields of computer vision, three-dimensional reconstruction and artificial intelligence, in particular to an active three-dimensional reconstruction method based on model reinforcement learning and three-dimensional Gaussian sputtering. Extracting scene geometric embedding and view gesture embedding and splicing to obtain an observation vector, inputting a world model to update the hidden state, and further performing imagination expansion on candidate view action sequences in a potential space to generate a prediction track. And selecting the view angle with the maximum predicted long-term rewards through the actor network to execute shooting, and optimizing a reconstruction model by using the new image. The observations are then updated, potential function differential rewards calculated and experience stored to optimize the model and strategy. And circularly executing the process until the view angle reaches the limit or the reconstruction converges, outputting a final three-dimensional model, and realizing high-efficiency high-fidelity reconstruction.

Inventors

  • LI YUQI
  • LUO XIHAO

Assignees

  • 宁波大学

Dates

Publication Date
20260505
Application Date
20260130

Claims (10)

  1. 1. The active three-dimensional reconstruction method based on model reinforcement learning and three-dimensional Gaussian sputtering is characterized by comprising the following steps of: s1, constructing and training a potential world model based on a cyclic state space model, and simulating the dynamics of a three-dimensional reconstruction process in a potential space; s2, performing incremental scene reconstruction on the RGB image of the currently acquired view angle by using a three-dimensional Gaussian sputtering model to obtain a current scene model represented by a group of three-dimensional Gaussian primitives; S3, extracting geometric embedded features of a scene based on the current scene model, and splicing the geometric embedded features with the gesture embedded features of the currently selected view angle to generate an observation vector at the current moment; S4, inputting the observation vector into the potential world model, updating the internal hidden state of the potential world model, and performing imagination expansion on a future multi-step action sequence formed by a candidate view action set in a potential space based on the updated hidden state to generate a plurality of predicted future states and rewarding tracks; s5, selecting a next best view angle action capable of maximizing predicted long-term jackpot according to the candidate view angle action set and based on the predicted future states and the rewarding tracks through an actor network, executing the next best view angle action, and acquiring an RGB image of a new view angle; S6, performing incremental optimization on the three-dimensional Gaussian sputtering model by utilizing the RGB image of the new view angle to obtain an updated scene model; s7, extracting updated geometric embedded features based on the updated scene model, and splicing the updated geometric embedded features with the updated selected view angle posture embedded features to generate updated observation vectors; s8, calculating potential function values comprising reconstruction precision indexes and view angle coverage indexes based on the updated scene model, and calculating instant rewards according to the difference value of the potential function values at the current time and the previous time; s9, taking the RGB image of the new view angle, the updated observation vector and the instant rewards as new experience data for iteratively optimizing the potential world model and the actor network; and S10, taking the updated observation vector as the observation vector of the current moment of the next round of decision, repeatedly executing the steps S4 to S9 until the preset view angle acquisition times or reconstruction convergence conditions are reached, and outputting a final three-dimensional reconstruction scene model.
  2. 2. The method for active three-dimensional reconstruction based on model reinforcement learning and three-dimensional gaussian sputtering according to claim 1, wherein in S1, the step of constructing and training a potential world model based on a cyclic state space model specifically comprises: Constructing a potential world model comprising an encoder, a cyclic state space model RSSM, an observation decoder, a rewards predictor, and a discount factor predictor, wherein said RSSM maintains a deterministic state and a stochastic state at each time step; Inputting the observation vector at the current moment into the encoder at each time step to obtain an encoded observation embedded; Embedding the deterministic state, the random state and the observation generated by the previous time step RSSM into RSSM of the current time step together, updating to obtain the deterministic state and the random state of the current time step, and forming the current hidden state together; Based on the current hidden state, outputting a predicted observation reconstruction of the current moment through the observation decoder, outputting a predicted reward of the current moment through the reward predictor, and outputting a predicted discount factor of the current moment through the discount factor predictor; Calculating a reconstruction loss between the predicted observation reconstruction and the real observation, a predicted loss of rewards between the predicted rewards and the real rewards, and a discount loss between the predicted discount factor and a preset value; based on the reconstruction loss, the rewarding prediction loss and the discount loss, the parameters of the potential world model are subjected to joint optimization training, so that the model learns to simulate the state evolution and rewarding feedback dynamics of three-dimensional reconstruction in a potential space.
  3. 3. The method for active three-dimensional reconstruction based on model reinforcement learning and three-dimensional gaussian sputtering according to claim 1, wherein in S2, the step of performing incremental scene reconstruction on the RGB image of the currently acquired view angle by using the three-dimensional gaussian sputtering model to obtain the current scene model represented by a set of three-dimensional gaussian primitives specifically comprises: initializing or maintaining a three-dimensional gaussian sputtering model as an internal representation of the scene, the model being composed of a set of three-dimensional gaussian primitives; Rendering the three-dimensional Gaussian sputtering model through micro-rendering to generate a corresponding predicted RGB image for the RGB image of the currently-to-be-processed acquired view angle; Calculating a rendering error between the predicted RGB image and the RGB image of the acquired view angle; iteratively optimizing parameters of all or part of Gaussian primitives forming the three-dimensional Gaussian sputtering model by a back propagation algorithm based on the rendering error; in the optimization process, according to the distribution density and opacity of Gaussian primitives in space, performing cloning and rejecting operations of the Gaussian primitives in a self-adaptive mode so as to adjust geometric details of scene representation; And iteratively executing the micro-rendering, rendering error calculation, parameter optimization and self-adaptive structure adjustment processes until the rendering error converges to a preset threshold or reaches the maximum iteration number, and outputting the optimized three-dimensional Gaussian sputtering model as an updated current scene model.
  4. 4. The method for active three-dimensional reconstruction based on model reinforcement learning and three-dimensional gaussian sputtering according to claim 1, wherein in S3, the step of generating an observation vector at the current moment specifically comprises: Acquiring a scene model represented by the three-dimensional Gaussian sputtering model, wherein the scene model is composed of a group of three-dimensional Gaussian primitives; for each three-dimensional Gaussian primitive in the scene model, extracting a geometric feature vector of the three-dimensional Gaussian primitive through a shared multi-layer perceptron, and mapping the geometric feature vector to a corresponding voxel unit in a preset three-dimensional voxel grid according to the spatial position of each Gaussian primitive; respectively executing maximum pooling operation and average pooling operation for each voxel unit containing a plurality of geometric feature vectors, and fusing pooling results to obtain the aggregation feature of the voxel unit; inputting the aggregate characteristics of the three-dimensional voxel grid and all voxel units thereof into a three-dimensional convolution neural network for characteristic extraction, and enhancing the local geometric relevance and spatial consistency of the characteristics; carrying out global average pooling on the feature volume processed by the three-dimensional convolutional neural network to obtain a one-dimensional geometric embedded vector; Constructing a binary bit matrix according to the view angles selected for shooting from all candidate view angles; Coding the bit matrix through a small multi-layer perceptron to obtain an attitude embedded vector; and splicing the geometric embedded vector and the gesture embedded vector to generate an observation vector at the current moment.
  5. 5. The method for active three-dimensional reconstruction based on model reinforcement learning and three-dimensional gaussian sputtering according to claim 1, wherein in S4, the step of generating a plurality of predicted future states and bonus trajectories comprises: Setting the updated current hidden state as an initial hidden state of an imagination unfolding process; Iteratively performing multi-step predictions under the initial hidden state by internal dynamics of the potential world model to generate a plurality of future prediction trajectories; outputting an imagination action through the actor network of the potential world model; Inputting the imagination actions into the internal dynamics of the potential world model, and predicting to obtain the hidden state of the next imagination; outputting, by a rewards predictor of the potential world model, a notional instant reward based on the next notional hidden state; taking the next imagined hidden state as a new current imagined hidden state for the prediction of the subsequent step; For each imagination developed future prediction track, integrating a series of imagination hidden states contained in the future prediction track with corresponding imagination instant rewards to form a predicted future state and reward track.
  6. 6. The method for active three-dimensional reconstruction based on model reinforcement learning and three-dimensional gaussian sputtering according to claim 1, wherein in S5, the step of acquiring RGB images of a new view angle specifically comprises: calculating, for each of the plurality of predicted future states and bonus tracks, a predicted long-term jackpot for that track based on a predicted series of imaginary instant rewards in that track starting from its initial imaginary hidden state; Selecting a track with the highest predicted long-term jackpot from all predicted future states and rewards tracks; Extracting the imagination corresponding to the first step from the selected track with the highest predicted long-term jackpot prize; Mapping the extracted imagination action into a corresponding actual view angle index in the candidate view angle action set, taking the corresponding actual view angle index as the next best view angle action, and executing the next best view angle action to control an image acquisition device to shoot from the view angle so as to acquire an RGB image of a new view angle.
  7. 7. The method for active three-dimensional reconstruction based on model reinforcement learning and three-dimensional gaussian sputtering according to claim 1, wherein in S6, the step of performing incremental optimization on the three-dimensional gaussian sputtering model by using the RGB image of the new view angle to obtain an updated scene model specifically comprises: Acquiring an RGB image of the new view angle; performing micro-rendering on the three-dimensional Gaussian sputtering model under camera parameters of a new view angle to obtain a predicted RGB image corresponding to the view angle; Calculating a pixel level difference between the predicted RGB image and the RGB image of the new view angle as a rendering error; Based on the rendering error, carrying out back propagation and iterative updating on the spatial position, covariance matrix, opacity and color coefficient parameters of the Gaussian primitives in the three-dimensional Gaussian sputtering model through a gradient descent algorithm; and taking the three-dimensional Gaussian sputtering model with the updated parameters as the updated scene model.
  8. 8. The method for active three-dimensional reconstruction based on model reinforcement learning and three-dimensional gaussian sputtering according to claim 1, wherein in S7, the step of generating updated observation vectors specifically comprises: acquiring the updated scene model, wherein the updated scene model is a three-dimensional Gaussian sputtering model subjected to incremental optimization; Extracting geometrical feature vectors of each three-dimensional Gaussian primitive in the updated scene model through a shared multi-layer perceptron, and mapping the geometrical feature vectors to corresponding voxel units in a preset three-dimensional voxel grid according to the spatial positions of the Gaussian primitives; respectively executing maximum pooling operation and average pooling operation for each voxel unit containing a plurality of geometric feature vectors, and fusing pooling results to obtain the aggregation feature of the voxel unit; Inputting the aggregate characteristics of the three-dimensional voxel grid and all voxel units thereof into a three-dimensional convolution neural network for characteristic extraction; carrying out global average pooling on the feature volume processed by the three-dimensional convolutional neural network to obtain an updated geometric embedded vector; updating a bit matrix representing a view distribution state according to the selected view set which has included the new view; coding the updated bit matrix through a small multi-layer perceptron to obtain an updated attitude embedded vector; and splicing the updated geometric embedded vector with the updated gesture embedded vector to generate the updated observation vector.
  9. 9. The method according to claim 1, wherein in S8, based on the updated scene model, calculating potential function values including a reconstruction precision index and a view angle coverage index, and calculating an immediate prize according to a difference between the current potential function value and the previous potential function value comprises: calculating a peak signal-to-noise ratio and a structural similarity index between a current moment reconstruction result and a scene integrity truth value based on the updated scene model; calculating a normalized value of an average value of included angles between all view angle pairs in a current selected view angle set based on direction vectors of all view angles in the set, and taking the normalized value as view angle dispersion; According to the peak signal-to-noise ratio, the structural similarity index and the view angle dispersion, calculating a potential function value at the current moment according to a preset weighted summation formula; Subtracting the potential function value recorded at the previous moment from the potential function value at the current moment, wherein the difference value is used as the instant rewards at the current moment; And recording the potential function value at the current moment and using the potential function value for the instant rewarding meter at the next moment.
  10. 10. The method of active three-dimensional reconstruction based on model reinforcement learning and three-dimensional gaussian sputtering according to claim 1, characterized in that in said S9 the step of iteratively optimizing said potential world model and actor network comprises in particular: storing the RGB image of the new visual angle acquired at the current moment, the generated updated observation vector and the calculated instant rewards as an experience tuple into a preset experience playback buffer area; Sampling a batch of historical experience data from the experience playback buffer for optimizing parameters of the potential world model to minimize its observed reconstruction errors, rewards prediction errors, and discount factor prediction errors; optimizing parameters of the actor network and the commentator network based on an imagination track generated by the potential world model in a potential space; wherein the actor network is optimized to maximize predicted long-term jackpots in the notional trajectories, and the commentator network is optimized to accurately evaluate state value in the notional trajectories.

Description

Active three-dimensional reconstruction method based on model reinforcement learning and three-dimensional Gaussian sputtering Technical Field The invention relates to the technical fields of computer vision, three-dimensional reconstruction and artificial intelligence, in particular to an active three-dimensional reconstruction method based on model reinforcement learning and three-dimensional Gaussian sputtering. Background In the field of active three-dimensional reconstruction, the next best view planning aims at restoring the complete scene with minimal image input by automatically selecting the view pose. The traditional method relies on artificial geometric heuristic rules, has limited adaptability in complex scenes, and the subsequent non-model reinforcement learning strategy based on learning can automatically make decisions, but lacks explicit modeling on environment dynamics, so that short vision planning and low sampling efficiency are caused, and the reconstruction representation and view angle planning are often loosely coupled. However, the existing active reconstruction method based on reinforcement learning generally and not effectively combines an explicit environmental dynamics model and a differentiable reconstruction representation, so that an intelligent body is difficult to perform long-sight distance and prospective view angle planning, and cannot adaptively adjust decisions according to real-time feedback of reconstruction states in an iterative process, and finally, the reconstruction efficiency and integrity under the limited view angle budget are insufficient. Disclosure of Invention In order to make up for the defects, the invention provides an active three-dimensional reconstruction method based on model reinforcement learning and three-dimensional Gaussian sputtering, and aims to solve the problems of low sampling efficiency and poor reconstruction integrity caused by short-term planning and reconstruction-decision disjoint in the active three-dimensional reconstruction. The invention provides a technical scheme that an active three-dimensional reconstruction method based on model reinforcement learning and three-dimensional Gaussian sputtering comprises the following steps: s1, constructing and training a potential world model based on a cyclic state space model, and simulating the dynamics of a three-dimensional reconstruction process in a potential space; s2, performing incremental scene reconstruction on the RGB image of the currently acquired view angle by using a three-dimensional Gaussian sputtering model to obtain a current scene model represented by a group of three-dimensional Gaussian primitives; S3, extracting geometric embedded features of a scene based on the current scene model, and splicing the geometric embedded features with the gesture embedded features of the currently selected view angle to generate an observation vector at the current moment; S4, inputting the observation vector into the potential world model, updating the internal hidden state of the potential world model, and performing imagination expansion on a future multi-step action sequence formed by a candidate view action set in a potential space based on the updated hidden state to generate a plurality of predicted future states and rewarding tracks; s5, selecting a next best view angle action capable of maximizing predicted long-term jackpot according to the candidate view angle action set and based on the predicted future states and the rewarding tracks through an actor network, executing the next best view angle action, and acquiring an RGB image of a new view angle; S6, performing incremental optimization on the three-dimensional Gaussian sputtering model by utilizing the RGB image of the new view angle to obtain an updated scene model; s7, extracting updated geometric embedded features based on the updated scene model, and splicing the updated geometric embedded features with the updated selected view angle posture embedded features to generate updated observation vectors; s8, calculating potential function values comprising reconstruction precision indexes and view angle coverage indexes based on the updated scene model, and calculating instant rewards according to the difference value of the potential function values at the current time and the previous time; s9, taking the RGB image of the new view angle, the updated observation vector and the instant rewards as new experience data for iteratively optimizing the potential world model and the actor network; and S10, taking the updated observation vector as the observation vector of the current moment of the next round of decision, repeatedly executing the steps S4 to S9 until the preset view angle acquisition times or reconstruction convergence conditions are reached, and outputting a final three-dimensional reconstruction scene model. Preferably, in the step S1, the step of building and training a potential world model based on the cycl