CN-122023495-A - Training method of monocular depth estimation model, monocular depth estimation method and device

CN122023495ACN 122023495 ACN122023495 ACN 122023495ACN-122023495-A

Abstract

The invention provides a training method of a monocular depth estimation model, a monocular depth estimation method and a device, wherein the training method comprises the steps of carrying out three-dimensional Gaussian scene prediction and rendering according to a sequence frame image to obtain an initial depth image, a multi-view synthetic image and a synthetic depth image; reconstructing the front and rear frame images and the target image by using camera pose information of the initial depth image and the front and rear frame images, and calculating minimum luminosity reconstruction loss, reconstructing the front and rear frame images and the target image according to the synthesized depth image and the camera pose information, and calculating minimum synthesized luminosity reconstruction loss, obtaining geometric consistency masks and depth consistency loss according to the initial depth image and the synthesized depth image, calculating synthesized luminosity loss according to the target image, the front and rear frame images and the multi-view synthesized image, and constructing joint loss for iteratively updating network parameters to obtain a monocular depth estimation model. The method improves the accuracy of depth estimation prediction of the power grid scene.

Inventors

DONG QIULEI
WANG PEIGUANG
LIN LIMING
ZHANG LITING
HUANG ZHIBIN

Assignees

中国科学院自动化研究所
国网思极位置服务有限公司

Dates

Publication Date: 20260512
Application Date: 20251223

Claims (10)

1. A method for training a monocular depth estimation model, comprising: based on a depth generation branch network, performing three-dimensional Gaussian scene prediction and rendering according to a sequence frame image to obtain an initial depth image, a multi-view synthetic image and a synthetic depth image, wherein the sequence frame image comprises a target image and front and rear frame images, the front and rear frame images comprise adjacent frames of the target image in a video sequence, and the multi-view synthetic image comprises synthetic images of the target image and the front and rear frame images; Reconstructing the front and rear frame images and the target image by utilizing camera attitude information between the initial depth map and the front and rear frame images based on a depth consistency perception network to obtain a first reconstructed image; reconstructing the front and rear frame images and the target image according to the synthesized depth map and the camera pose information to obtain a second reconstructed image, calculating the minimum luminosity reconstruction loss between the first reconstructed image and the target image, calculating the minimum synthesized luminosity reconstruction loss between the second reconstructed image and the target image, acquiring a geometric consistency mask and a depth consistency loss according to the initial depth map and the synthesized depth map, and calculating the synthesized luminosity loss according to the target image, the front and rear frame images and the multi-view synthesized image; Constructing joint loss according to the minimum luminosity reconstruction loss, the minimum synthesized luminosity reconstruction loss, the depth consistency loss, the synthesized luminosity loss and the geometric consistency mask, iteratively updating parameters of the depth generation branch network and the depth consistency perception network according to the joint loss, and obtaining a monocular depth estimation model under the condition that iteration conditions are met.
2. The method of training a monocular depth estimation model of claim 1, wherein the depth generating branch network comprises a depth encoder-decoder, a three-dimensional Gao Siyu gauge head, a three-dimensional gaussian splatter module, and a pose prediction module; the step of carrying out three-dimensional Gaussian scene prediction and rendering according to the sequence frame images to obtain an initial depth map, a multi-view synthetic image and a synthetic depth map comprises the following steps: extracting depth image features from the target image using the depth encoder-decoder; Extracting the camera gesture information from the front and rear frame images by using the gesture prediction module; Performing three-dimensional Gaussian scene prediction according to the depth image characteristics by using the three-dimensional Gaussian prediction head to obtain scene representation parameters, wherein the scene representation parameters comprise the initial depth image and three-dimensional Gaussian parameters of each pixel in the initial depth image; and performing three-dimensional Gaussian scene rendering according to the scene representation parameters and the camera gesture information by using the three-dimensional Gaussian splatter module to obtain the multi-view synthesized image and the synthesized depth map.
3. The training method of a monocular depth estimation model according to claim 2, wherein the three-dimensional gaussian parameters include opacity, covariance, and spherical harmonics, which are used to represent information of pixel points as a function of viewing angle; the multi-view composite image is obtained by: Projecting each three-dimensional gaussian in a three-dimensional gaussian scene representation to a viewing angle using a visibility-aware rendering algorithm Obtaining a plurality of splattered gauss in a camera two-dimensional plane; Sorting the plurality of gaussian splats according to a depth priority order, and calculating the multi-view composite image by using the following formula: ; Wherein, the For a multi-view composite image, The pixels are formed in a pattern of pixels, Is the ordered gaussian splatter order, Is a three-dimensional Gaussian Is used for the color filter, the opacity of (c) is, Is based on three-dimensional Gaussian Spherical harmonic of (2) The color of the light is calculated, In order to splatter the gaussian after-the-splash, For opacity that is attenuated by a gaussian function, Is the opacity attenuated by a gaussian function.
4. A method of training a monocular depth estimation model according to claim 2 or 3, wherein the synthetic depth map is obtained by: The synthetic depth map is calculated using the following formula: ; Wherein, the In order to synthesize the depth map, Is a three-dimensional Gaussian Is a depth of (c).
5. The method of training a monocular depth estimation model according to claim 1, wherein the joint loss is represented by: ; Wherein, the For the purpose of the said joint loss, For the minimum photometric reconstruction loss in question, For the minimum synthetic photometric reconstruction loss, In order for the depth consistency to be lost, For the purpose of the loss of the resultant luminosity, For the geometric consistency mask ⨀ is an element-level product.
6. A monocular depth estimation method, comprising: Acquiring a monocular image to be detected; Inputting the monocular image to be detected into a monocular depth estimation model to obtain a monocular depth estimation result, wherein the monocular depth estimation model is obtained by training the monocular depth estimation model according to the training method of any one of claims 1-5.
7. A training device for a monocular depth estimation model, comprising: The image generation module is used for carrying out three-dimensional Gaussian scene prediction and rendering according to a sequence frame image based on a depth generation branch network to obtain an initial depth image, a multi-view synthetic image and a synthetic depth image, wherein the sequence frame image comprises a target image and front and rear frame images, the front and rear frame images comprise adjacent frames of the target image in a video sequence, and the multi-view synthetic image comprises synthetic images of the target image and the front and rear frame images; The computing module is used for reconstructing the front and rear frame images and the target image by utilizing the camera attitude information between the initial depth map and the front and rear frame images based on a depth consistency perception network to obtain a first reconstructed image; reconstructing the front and rear frame images and the target image according to the synthesized depth map and the camera pose information to obtain a second reconstructed image, calculating the minimum luminosity reconstruction loss between the first reconstructed image and the target image, calculating the minimum synthesized luminosity reconstruction loss between the second reconstructed image and the target image, acquiring a geometric consistency mask and a depth consistency loss according to the initial depth map and the synthesized depth map, and calculating the synthesized luminosity loss according to the target image, the front and rear frame images and the multi-view synthesized image; The training module is used for constructing joint loss according to the minimum luminosity reconstruction loss, the minimum synthetic luminosity reconstruction loss, the depth consistency loss, the synthetic luminosity loss and the geometric consistency mask, iteratively updating parameters of the depth generation branch network and the depth consistency perception network according to the joint loss, and obtaining a monocular depth estimation model under the condition that iteration conditions are met.
8. A monocular depth estimation apparatus, comprising: the image acquisition module is used for acquiring a monocular image to be detected; The device comprises a depth estimation module, a monocular depth estimation module and a monocular depth estimation module, wherein the monocular depth estimation module is used for inputting the monocular image to be detected into the monocular depth estimation module to obtain a monocular depth estimation result, and the monocular depth estimation module is trained by the training method of the monocular depth estimation module according to any one of claims 1-5.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, characterized in that the processor implements the training method of the monocular depth estimation model according to any one of claims 1 to 5 or the monocular depth estimation method according to claim 6 when executing the computer program.
10. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the training method of the monocular depth estimation model according to any one of claims 1 to 5 or the monocular depth estimation method according to claim 6.

Description

Training method of monocular depth estimation model, monocular depth estimation method and device Technical Field The present invention relates to the field of image processing technologies, and in particular, to a training method for a monocular depth estimation model, a monocular depth estimation method and a monocular depth estimation device. Background The power grid is an important component in the city, the effective monitoring of the power grid has a very important effect on the safe operation and maintenance of the city, a self-supervision monocular depth estimation technology is generally adopted, a depth estimation task taking monocular images as input is realized through a training paradigm based on self supervision, and technical support is provided for the monitoring of the power grid scene and the realization of the digital twin power grid. In the related art, the self-supervision monocular depth estimation method comprises a training method by utilizing binocular stereo images and a training method by utilizing monocular video sequences, and the problems of low prediction precision of a texture sparse region and inconsistent depth prediction geometry exist in the methods, so that the accuracy of a depth estimation result of a power grid scene is low, further the further optimization of the performance of the self-supervision monocular depth estimation method of the power grid scene is limited, and a challenge is also presented to a digital twin power grid. Disclosure of Invention The invention provides a training method of a monocular depth estimation model, a monocular depth estimation method and a monocular depth estimation device, which are used for solving the defects that the self-supervision monocular depth estimation method in the prior art has low prediction precision on a texture sparse region and inconsistent depth prediction geometry. The invention provides a training method of a monocular depth estimation model, which comprises the following steps: based on a depth generation branch network, performing three-dimensional Gaussian scene prediction and rendering according to a sequence frame image to obtain an initial depth image, a multi-view synthetic image and a synthetic depth image, wherein the sequence frame image comprises a target image and front and rear frame images, the front and rear frame images comprise adjacent frames of the target image in a video sequence, and the multi-view synthetic image comprises synthetic images of the target image and the front and rear frame images; Reconstructing the front and rear frame images and the target image by utilizing camera attitude information between the initial depth map and the front and rear frame images based on a depth consistency perception network to obtain a first reconstructed image; reconstructing the front and rear frame images and the target image according to the synthesized depth map and the camera pose information to obtain a second reconstructed image, calculating the minimum luminosity reconstruction loss between the first reconstructed image and the target image, calculating the minimum synthesized luminosity reconstruction loss between the second reconstructed image and the target image, acquiring a geometric consistency mask and a depth consistency loss according to the initial depth map and the synthesized depth map, and calculating the synthesized luminosity loss according to the target image, the front and rear frame images and the multi-view synthesized image; Constructing joint loss according to the minimum luminosity reconstruction loss, the minimum synthesized luminosity reconstruction loss, the depth consistency loss, the synthesized luminosity loss and the geometric consistency mask, iteratively updating parameters of the depth generation branch network and the depth consistency perception network according to the joint loss, and obtaining a monocular depth estimation model under the condition that iteration conditions are met. According to the training method of the monocular depth estimation model provided by the invention, a depth generation branch network comprises a depth encoder-decoder, a three-dimensional Gao Siyu measuring head, a three-dimensional Gaussian splatter module and a gesture prediction module; the step of carrying out three-dimensional Gaussian scene prediction and rendering according to the sequence frame images to obtain an initial depth map, a multi-view synthetic image and a synthetic depth map comprises the following steps: extracting depth image features from the target image using the depth encoder-decoder; Extracting the camera gesture information from the front and rear frame images by using the gesture prediction module; Performing three-dimensional Gaussian scene prediction according to the depth image characteristics by using the three-dimensional Gaussian prediction head to obtain scene representation parameters, wherein the scene representation parameters comprise th