CN-121810754-B - Gradient sensing self-supervision monocular depth estimation method and device for dynamic scene

CN121810754BCN 121810754 BCN121810754 BCN 121810754BCN-121810754-B

Abstract

The invention discloses a gradient perception self-supervision monocular depth estimation method and device for a dynamic scene, which comprise the field of image processing, and comprise the steps of constructing a monocular depth estimation model, a pre-trained pseudo depth prediction model and a gesture estimation network, training the monocular depth estimation model by using the pre-trained pseudo depth prediction model and the gesture estimation network to obtain a trained monocular depth estimation model, wherein the monocular depth estimation model adopts a gradient perception dense jump connection structure, a loss function used in the training process comprises mask weighted luminosity loss, geometric consistency loss, global depth ordering constraint loss, normal line matching loss, edge relative normal line loss and multi-scale feature alignment loss, acquiring a single frame image to be estimated, inputting the single frame image into the trained monocular depth estimation model, and obtaining a predicted depth map through the gradient perception dense jump connection structure. The method solves the problems of inconsistent semantics, insufficient spatial alignment and the like of the existing depth estimation model.

Inventors

ZHANG JIANHUAN
DAI JIDONG
ZHANG CHENTAO
YE MAOZHANG
LI XIAOLI
XU ZHOUYI
ZHENG GAOFENG

Assignees

厦门大学
泉州经贸职业技术学院

Dates

Publication Date: 20260512
Application Date: 20260306

Claims (9)

1. The gradient sensing self-supervision monocular depth estimation method for the dynamic scene is characterized by comprising the following steps of: Constructing a monocular depth estimation model, a pre-trained pseudo depth prediction model and a gesture estimation network, and training the monocular depth estimation model by utilizing the pre-trained pseudo depth prediction model and the gesture estimation network to obtain a trained monocular depth estimation model, wherein the monocular depth estimation model comprises an encoder, an intermediate combination layer and a decoder and forms a gradient-aware dense jump connection structure, and a loss function used in the training process of the monocular depth estimation model comprises mask weighted luminosity loss, geometric consistency loss, global depth ordering constraint loss, normal matching loss, edge relative normal line loss and multi-scale feature alignment loss; The construction process of the mask weighted luminosity loss comprises the following steps of obtaining a monocular video sequence, taking monocular images of two adjacent frames as a reference frame and a source frame, respectively inputting the reference frame and the source frame into the monocular depth estimation model to obtain a predicted depth image of the reference frame and a predicted depth image of the source frame, inputting the reference frame and the source frame into the gesture estimation network to obtain corresponding relative camera pose, obtaining camera internal parameters, and constructing luminosity consistency loss according to the predicted depth image of the source frame, the relative camera pose, the camera internal parameters and the reference frame, wherein the construction process of the luminosity consistency loss comprises the following steps: inputting the predicted depth map of the source frame, the relative camera pose and the camera internal parameters into a re-projection and bilinear micro-sampling algorithm to obtain a corresponding reconstructed image; Calculating luminosity consistency loss according to the reference frame and the reconstructed image, wherein the luminosity consistency loss is shown in the following formula: ; Wherein, the Representing that the reference frame and the reconstructed image are at the first The photometric consistency of the individual active pixels is lost, The weight coefficient is represented by a number of weight coefficients, Representing that the reference frame and the reconstructed image are at the first The structural similarity of the individual active pixels, And (3) with Representing the reference frame and the reconstructed image in the first position respectively Pixel values at the locations of the valid pixel points, Represents an L1 norm; Re-projecting the predicted depth map of the reference frame to a source frame coordinate system based on the relative camera pose and aligning the predicted depth map of the source frame, calculating pixel-level normalized depth inconsistencies, and generating a self-discovery mask according to the pixel-level normalized depth inconsistencies, weighting the light consistency loss by using the self-discovery mask to obtain the mask weighted luminosity loss, wherein the calculation process of the mask weighted luminosity loss is as follows: Re-projecting the predicted depth value of each pixel point in the predicted depth map of the reference frame into a three-dimensional space, mapping the predicted depth value to the source frame coordinate system through the relative camera pose to obtain the predicted depth map of the reference frame under the source frame coordinate system, and carrying out bilinear interpolation sampling on the predicted depth map of the source frame by utilizing the predicted depth map of the reference frame under the source frame coordinate system to obtain an aligned depth map of the source frame, wherein the aligned depth map is shown as the following formula: ; ; Wherein, the Representing the relative camera pose of the reference frame to the source frame, Representing the first on the predicted depth map of the reference frame Predicted depth values for each active pixel point, Representing the operations of re-projection and mapping, Representing the reference frame on the predicted depth map under the source frame coordinate system Predicted depth values for each active pixel point, A bilinear interpolation sampling operation is shown, Representing the first on the predicted depth map of the source frame Predicted depth values for each active pixel point, Representing the first on the aligned depth map of the source frame Predicted depth values of the effective pixels; calculating pixel-level normalized depth inconsistencies between the predicted depth map of the source frame and the predicted depth map of the reference frame using: ; Wherein, the Representing the predicted depth map of the source frame and the predicted depth map of the reference frame at the first Pixel level normalized depth inconsistency of each effective pixel point; a self-discovery mask is calculated from the pixel-level normalized depth inconsistency, as shown in the following: ; Wherein, the Is shown in the first A find mask for each valid pixel point; And weighting and averaging the light consistency loss by using the found mask to obtain a mask weighted light loss, wherein the mask weighted light loss is shown in the following formula: ; Wherein, the Representing a set of valid pixel points that are successfully aligned during the projection and interpolation process, Representing a set of valid pixel points The total number of valid pixel points in (a), Representing a mask weighted luminosity loss; the geometric consistency loss is calculated using the formula: ; Wherein, the Representing the geometric consistency loss; And acquiring a single frame image to be estimated, inputting the single frame image to the trained monocular depth estimation model, and obtaining a corresponding predicted depth map through processing of a gradient-perceived dense jump connection structure.
2. The gradient sensing self-supervision monocular depth estimation method for a dynamic scene according to claim 1, wherein the construction process of the global depth ordering constraint loss, the normal matching loss and the edge relative normal loss is as follows: Inputting the reference frame into the pre-trained pseudo depth prediction model to obtain a corresponding pseudo depth map; Comparing the found mask with a preset mask threshold, dividing the reference frame into a dynamic region and a static region according to a comparison result, sampling effective pixel points in the dynamic region and effective pixel points in the static region for pairing, additionally sampling any two effective pixel points in the reference frame for pairing, and defining ordinal numbers labels according to the proportion of pseudo depths of the two paired effective pixel points, wherein the ordinal numbers are as shown in the following formula: ; Wherein, the As the depth difference threshold value, And Respectively represent the pseudo depth maps Two paired effective pixel points And Is used to determine the pseudo-depth value of (c), Representing ordinal labels; determining a set of effective pixel point pairs formed by two paired effective pixel points in the pseudo depth map according to the ordinal number label, wherein the set is shown in the following formula: ; Wherein, the Representing a set of valid pixel pairs; And calculating global depth ordering constraint loss according to predicted depth values of two paired effective pixel points in the pseudo depth map in the predicted depth map of the reference frame, wherein the global depth ordering constraint loss is shown in the following formula: ; ; Wherein, the Representing the total number of active pixel pairs in the set of active pixel pairs, And Predictive depth maps respectively representing the reference frames Two paired effective pixel points And Is used to predict the depth value of a block, Indicating a loss of ordering is indicated, Representing a global depth ordering constraint penalty; The construction process of the normal matching loss is as follows: Calculating gradients of the predicted depth map and the pseudo depth map of the reference frame at each pixel point through finite difference, and constructing an un-normalized normal of the predicted depth map and an un-normalized normal of the pseudo depth map of the reference frame, wherein the non-normalized normal is represented by the following formula: ; ; Wherein, the And Respectively representing the predicted depth map of the reference frame at the first position The un-normalized normal of each pixel point and the pseudo depth map are in the first place The non-normalized normals of the individual pixel points, And (3) with Respectively representing the predicted depth map of the reference frame Gradients of individual pixels in the x-axis direction and the y-axis direction, And Respectively representing the first and second pseudo depth maps Gradients of the individual pixels in the x-axis direction and the y-axis direction; Normalizing the non-normalized normal of the predicted depth map of the reference frame and the non-normalized normal of the pseudo depth map to obtain a normalized normal of the predicted depth map of the reference frame and a normalized normal of the pseudo depth map, and calculating a normal matching loss, wherein the normal matching loss is represented by the following formula: ; Wherein, the Representing the loss of normal matching, And Respectively representing the predicted depth map of the reference frame at the first position Normalized normals of individual pixel points and the pseudo depth map are at the first The normalized normal of the individual pixel points, Representing the total number of pixels; The construction process of the edge relative normal loss is as follows: edge-guided sampling structure point pairs are respectively adopted in the predicted depth map of the reference frame and the edge region of the pseudo depth map Pairs of points in the predicted depth map of the reference frame The corresponding normal is marked as Pairs of points of the pseudo depth map The corresponding normal is marked as The edge relative normal loss is: ; Wherein, the Representing pairs of points Is a sum of the number of (c), Representing edge relative normal loss.
3. The gradient sensing self-supervision monocular depth estimation method for a dynamic scene according to claim 1, wherein the multi-scale feature alignment loss is constructed as follows: The backbone network in the pre-trained pseudo-depth prediction model is used as a teacher model, the reference frame is input into the backbone network in the pre-trained pseudo-depth prediction model, and the ith token of the teacher feature is extracted from the first layer of the teacher model, wherein the ith token is represented by the following formula: ; Wherein, the A reference frame is represented and a reference frame is represented, Representing the first layer of the backbone network in the pre-trained pseudo-depth prediction model, Representing an ith token in the teacher feature extracted by a first layer of the backbone network in the pre-trained pseudo-depth prediction model, Representing the total number of token; The encoder of the monocular depth estimation model is used as a student model, the reference frame is input into the encoder of the monocular depth estimation model, and student characteristics are extracted from a k layer corresponding to the encoder of the monocular depth estimation model, wherein the student characteristics are represented by the following formula: ; Wherein, the A corresponding kth layer in the encoder representing the monocular depth estimation model, Representing student features extracted from a corresponding kth layer in an encoder of the monocular depth estimation model; Constructing a semantic projector, and projecting student features extracted from a kth layer corresponding to an encoder of the monocular depth estimation model into a space of teacher features through the semantic projector to obtain an ith token after the student features extracted from the kth layer corresponding to the encoder of the monocular depth estimation model are projected, wherein the ith token is shown in the following formula: ; Wherein, the The meaning of the semantic projector is represented by, An ith token representing projected student features extracted from a corresponding kth layer in an encoder of the monocular depth estimation model; the multiscale feature alignment loss is calculated using the following: ; Wherein, the The L2 norm is represented by the number, Representing the multi-scale feature alignment loss; The total loss function used in the training process of the monocular depth estimation model is a weighted sum of mask weighted luminosity loss, geometric consistency loss, global depth ordering constraint loss, normal matching loss, edge relative normal line loss and multi-scale feature alignment loss, parameters of a pre-trained pseudo-depth prediction model are fixed in the training process of the monocular depth estimation model, and the gesture estimation network is synchronously trained.
4. The gradient sensing self-supervision monocular depth estimation method for the dynamic scene according to claim 3, wherein the semantic projector comprises a multi-layer sensing machine, a SiLU activation function layer and an L2 normalization layer which are sequentially connected, student features extracted from a kth layer corresponding to an encoder of the monocular depth estimation model are input to the semantic projector, and bilinear interpolation operation, flattening operation, multi-layer sensing machine processing, siLU activation function and L2 normalization are sequentially performed, wherein the following formula is shown: ; Wherein, the A bilinear interpolation operation is shown and, Representing the spatial dimensions of the teacher model feature map, Represents a flattening operation, and the flattening device, A multi-layer perceptron is shown, The representation SiLU activates the function layer, Representing the L2 normalization layer.
5. The gradient-aware self-supervising monocular depth estimation method according to claim 1, wherein the pre-trained pseudo-depth prediction model comprises DEPTH ANYTHING V models, the encoder and the pose estimation network each comprise a backbone network of ResNet network, the decoder comprises Monodepth network, the intermediate combination layer comprises a convolution layer with a size of 1 x1, the gradient-aware dense jump connection structure comprises c-line dense jump connection structures, wherein the output feature of the encoder of each line is a feature obtained after an input image passes through the encoder of the line, the dense jump connection structure of the e-line comprises an encoder of the e-line, a c-e intermediate combination layer of the e-line and a decoder of the e-line, e=1, 2..c, in the dense jump connection structure of the e-line, the input feature of the g-intermediate combination layer of the e-line is the output feature of the encoder of the e-line, the output feature of the g-1-intermediate combination layer of the e-line is the output feature of the e-line is the c-1 h intermediate combination layer, and the output feature of the e-1-h intermediate combination layer is the output feature of the e-line is the c-h combination layer, and the output feature of the e-h-line is the c-h combination layer is the c-h combination of the input feature of the e-line, and the c-h combination layer is the c-h combination of the c-h combination layer and c-h 2, and the output characteristics of the decoder of the 1 st row are input into a decoder to obtain a corresponding prediction depth map.
6. A gradient sensing self-supervision monocular depth estimation device for a dynamic scene, comprising: The model construction module is configured to construct a monocular depth estimation model, a pre-trained pseudo depth prediction model and a gesture estimation network, and train the monocular depth estimation model by utilizing the pre-trained pseudo depth prediction model and the gesture estimation network to obtain a trained monocular depth estimation model, wherein the monocular depth estimation model comprises an encoder, an intermediate combination layer and a decoder and forms a gradient-aware dense jump connection structure, and a loss function used in the training process of the monocular depth estimation model comprises mask weighted luminosity loss, geometric consistency loss, global depth ordering constraint loss, normal matching loss, edge relative normal line loss and multi-scale feature alignment loss; The construction process of the mask weighted luminosity loss comprises the following steps of obtaining a monocular video sequence, taking monocular images of two adjacent frames as a reference frame and a source frame, respectively inputting the reference frame and the source frame into the monocular depth estimation model to obtain a predicted depth image of the reference frame and a predicted depth image of the source frame, inputting the reference frame and the source frame into the gesture estimation network to obtain corresponding relative camera pose, obtaining camera internal parameters, and constructing luminosity consistency loss according to the predicted depth image of the source frame, the relative camera pose, the camera internal parameters and the reference frame, wherein the construction process of the luminosity consistency loss comprises the following steps: inputting the predicted depth map of the source frame, the relative camera pose and the camera internal parameters into a re-projection and bilinear micro-sampling algorithm to obtain a corresponding reconstructed image; Calculating luminosity consistency loss according to the reference frame and the reconstructed image, wherein the luminosity consistency loss is shown in the following formula: ; Wherein, the Representing that the reference frame and the reconstructed image are at the first The photometric consistency of the individual active pixels is lost, The weight coefficient is represented by a number of weight coefficients, Representing that the reference frame and the reconstructed image are at the first The structural similarity of the individual active pixels, And (3) with Representing the reference frame and the reconstructed image in the first position respectively Pixel values at the locations of the valid pixel points, Represents an L1 norm; Re-projecting the predicted depth map of the reference frame to a source frame coordinate system based on the relative camera pose and aligning the predicted depth map of the source frame, calculating pixel-level normalized depth inconsistencies, and generating a self-discovery mask according to the pixel-level normalized depth inconsistencies, weighting the light consistency loss by using the self-discovery mask to obtain the mask weighted luminosity loss, wherein the calculation process of the mask weighted luminosity loss is as follows: Re-projecting the predicted depth value of each pixel point in the predicted depth map of the reference frame into a three-dimensional space, mapping the predicted depth value to the source frame coordinate system through the relative camera pose to obtain the predicted depth map of the reference frame under the source frame coordinate system, and carrying out bilinear interpolation sampling on the predicted depth map of the source frame by utilizing the predicted depth map of the reference frame under the source frame coordinate system to obtain an aligned depth map of the source frame, wherein the aligned depth map is shown as the following formula: ; ; Wherein, the Representing the relative camera pose of the reference frame to the source frame, Representing the first on the predicted depth map of the reference frame Predicted depth values for each active pixel point, Representing the operations of re-projection and mapping, Representing the reference frame on the predicted depth map under the source frame coordinate system Predicted depth values for each active pixel point, A bilinear interpolation sampling operation is shown, Representing the first on the predicted depth map of the source frame Predicted depth values for each active pixel point, Representing the first on the aligned depth map of the source frame Predicted depth values of the effective pixels; calculating pixel-level normalized depth inconsistencies between the predicted depth map of the source frame and the predicted depth map of the reference frame using: ; Wherein, the Representing the predicted depth map of the source frame and the predicted depth map of the reference frame at the first Pixel level normalized depth inconsistency of each effective pixel point; a self-discovery mask is calculated from the pixel-level normalized depth inconsistency, as shown in the following: ; Wherein, the Is shown in the first A find mask for each valid pixel point; And weighting and averaging the light consistency loss by using the found mask to obtain a mask weighted light loss, wherein the mask weighted light loss is shown in the following formula: ; Wherein, the Representing a set of valid pixel points that are successfully aligned during the projection and interpolation process, Representing a set of valid pixel points The total number of valid pixel points in (a), Representing a mask weighted luminosity loss; the geometric consistency loss is calculated using the formula: ; Wherein, the Representing the geometric consistency loss; the prediction module is configured to acquire a single frame image to be estimated, input the single frame image into the trained monocular depth estimation model, and obtain a corresponding prediction depth map through the gradient perceived dense jump connection structure processing.
7. An electronic device, comprising: one or more processors; storage means for storing one or more programs, When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-5.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-5.
9. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-5.

Description

Gradient sensing self-supervision monocular depth estimation method and device for dynamic scene Technical Field The invention relates to the field of image processing, in particular to a gradient sensing self-supervision monocular depth estimation method and device for a dynamic scene. Background The monocular depth estimation aims to recover the geometric depth information of the scene from a single RGB image. The traditional supervised method relies on high-cost depth truth values such as laser radar or structured light, is high in data acquisition and labeling cost, and is limited in cross-scene generalization. To reduce true-value dependence, self-supervising monocular depth estimation typically utilizes geometric consistency between adjacent frames, and microview synthesis is performed by "depth and camera pose" with photometric reconstruction errors as training signals. However, the self-supervision framework is generally based on a static rigid scene assumption that when dynamic objects such as vehicles and pedestrians or shielding/exposing phenomena exist in the scene, the cross-frame corresponding relation cannot be interpreted by the rigid pose, so that luminosity reconstruction errors comprise a large number of residuals irrelevant to real geometry, and if the residuals are directly and reversely propagated, incorrect supervision can be formed, and serious distortion of the depth of a dynamic region and even unstable overall training are caused. On the other hand, the conventional encoder-decoder structure easily introduces interpolation smoothing and spatial dislocation when the resolution is recovered by upsampling of the decoder, and particularly in the region of severe depth gradient changes such as object boundaries, occlusion boundaries and the like, boundary blurring, depth discontinuity or local artifacts often occur in the prediction result. The encoder continuously gathers high-level semantic information and loses low-level details in the process of downsampling for many times, the traditional U-Net type jump connection mostly adopts a direct splicing or simple addition mode, the characteristics of different semantic levels and different receptive fields are forcedly fused in the same decoding layer, the problems of inconsistent semantics, insufficient space alignment and the like are easy to occur, and the effect of jump connection in detail recovery is weakened. In addition, the external pre-training depth model can provide single-frame pseudo depth prior for compensating the insufficient supervision of a dynamic region, but the pseudo depth usually has scale deviation, boundary blurring and local noise, and if a screening and constraint mechanism is absent, the direct strong supervision can introduce noise geometry into training, so that the geometric consistency is destroyed. Disclosure of Invention The application aims to provide a gradient sensing self-supervision monocular depth estimation method and device for a dynamic scene aiming at the technical problems. In a first aspect, the present invention provides a gradient sensing self-supervision monocular depth estimation method for a dynamic scene, including the following steps: Constructing a monocular depth estimation model, a pre-trained pseudo depth prediction model and a gesture estimation network, and training the monocular depth estimation model by utilizing the pre-trained pseudo depth prediction model and the gesture estimation network to obtain a trained monocular depth estimation model; the monocular depth estimation model comprises an encoder, an intermediate combination layer and a decoder, and forms a gradient-aware dense jump connection structure, wherein a loss function used in the training process of the monocular depth estimation model comprises a mask weighted luminosity loss, a geometric consistency loss, a global depth ordering constraint loss, a normal matching loss, an edge relative normal line loss and a multiscale feature alignment loss, and the construction process of the mask weighted luminosity loss comprises the steps of acquiring a monocular video sequence, taking two adjacent monocular images as a reference frame and a source frame, respectively inputting the reference frame and the source frame into the monocular depth estimation model to obtain a predicted depth map of the reference frame and a predicted depth map of the source frame, inputting the reference frame and the source frame into a gesture estimation network to obtain corresponding relative camera poses, acquiring camera internal parameters, constructing the luminosity consistency loss according to the predicted depth map of the source frame, the relative camera poses, the camera internal parameters and the reference frame, re-projecting the predicted depth map of the reference frame to a source frame coordinate system based on the relative camera poses, aligning the predicted depth map of the reference frame with the predicted depth ma