US-12626381-B2 - Method for reducing error of depth estimation model, electronic device, and non-transitory storage medium

US12626381B2US 12626381 B2US12626381 B2US 12626381B2US-12626381-B2

Abstract

A method for reducing the error of depth estimation model comprises: obtaining a plurality of monocular images and a point cloud data of each of the plurality of monocular images, wherein each of the plurality of monocular images comprises an object frame image and a reference frame image; reconstructing the object frame image to obtain a reconstructed frame image according to the reference frame image and a first depth estimation model; determining a reconstructed error between the object frame image and the reconstructed frame image; and obtaining an inertia probability of each pixel of the object frame image according to speed information of the point cloud data and pixel information of the object frame image. This application provides more accurate depth estimation results for dynamic scenes. An electronic device and a non-transitory storage recording the method are also disclosed.

Inventors

TSUNG-WEI LIU
CHIN-PIN KUO

Assignees

HON HAI PRECISION INDUSTRY CO., LTD.

Dates

Publication Date: 20260512
Application Date: 20230721
Priority Date: 20220722

Claims (20)

1 . An image depth estimation method, comprising: obtaining a plurality of monocular images and a point cloud data of each of the plurality of monocular images, wherein each of the plurality of monocular images comprises an object frame image and a reference frame image; reconstructing the object frame image to obtain a reconstructed frame image according to the reference frame image and a first depth estimation model; determining a reconstructed error between the object frame image and the reconstructed frame image; obtaining an inertia probability of each pixel of the object frame image according to speed information of the point cloud data and pixel information of the object frame image; labeling pixels which inertia probabilities are higher than a preset threshold to obtain a mask data; obtaining a loss function according to the reconstructed error and the mask data, and training the first depth estimation model according to the loss function to obtain a second depth estimation model; and using the second depth estimation model to obtain depth information of an input image and obtaining a predicted depth image of the input image according to the depth information of the input image and a reference image corresponding to the input image.
2 . The image depth estimation method of claim 1 , wherein obtaining the inertia probability of each pixel of the object frame image according to the speed information of the point cloud data and the pixel information of the object frame image comprises: calculating an initial inertia probability of each pixel with respect to each point cloud according to the speed information of the point cloud data and the pixel information of the object frame image; and fusing the initial inertia probability of each pixel with respect to each point cloud by using a non-maximum suppression algorithm and selecting a maximum initial inertia probability as the inertia probability of each pixel.
3 . The image depth estimation method of claim 2 , wherein a calculation formula of the initial inertia probability comprises: P r ( x )= c ( x,r ) s ( I ( x ), I ( r )); wherein P r (x) is the initial inertia probability of each pixel for different point clouds, x is each pixel, r is each point cloud, I(x) is a color of each pixel x, I(r) is a color of each point cloud r, c is a similarity function, and s is a similarity function.
4 . The image depth estimation method of claim 2 , wherein reconstructing the object frame image to obtain the reconstructed frame image according to the reference frame image and the first depth estimation model comprises: estimating a depth information of the object frame image according to the first depth estimation model; inputting the object frame image and the reference frame image into a preset pose estimation model to obtain a camera pose change information between the object frame image and the reference frame image; and reconstructing the object frame image according to the depth information of the object frame image and the camera pose change information to obtain the reconstructed frame image corresponding to the object frame image.
5 . The image depth estimation method of claim 2 , wherein determining the reconstructed error between the object frame image and the reconstructed frame image comprises: calculating a luminosity difference between the object frame image and the reconstructed frame image to obtain the reconstructed error.
6 . The image depth estimation method of claim 2 , wherein obtaining a predicted depth image of the input image according to the depth information of the input image and a reference image corresponding to the input image comprises: using the second depth estimation model to reconstruct the input image to obtain the predicted depth image according to the reference image corresponding to the input image and the depth information of the input image.
7 . The image depth estimation method of claim 2 , wherein an acquisition method of the point cloud comprises: scanning the monocular image frames by using a lidar to obtain the point cloud data of the lidar.
8 . An electronic device, comprising: at least one processor; and a data storage storing one or more programs which when executed by the at least one processor, cause the at least one processor to: obtain a plurality of monocular images and a point cloud data of each of the plurality of monocular images, wherein each of the plurality of monocular images comprises an object frame image and a reference frame image; reconstruct the object frame image to obtain a reconstructed frame image according to the reference frame image and a first depth estimation model; determine a reconstructed error between the object frame image and the reconstructed frame image; obtain an inertia probability of each pixel of the object frame image according to speed information of the point cloud data and pixel information of the object frame image; label pixels which inertia probabilities are higher than a preset threshold to obtain a mask data; obtain a loss function according to the reconstructed error and the mask data, and train the first depth estimation model according to the loss function to obtain a second depth estimation model; and use the second depth estimation model to obtain depth information of an input image and obtain a predicted depth image of the input image according to the depth information of the input image and a reference image corresponding to the input image.
9 . The electronic device of claim 8 , wherein obtaining the inertia probability of each pixel of the object frame image according to the speed information of the point cloud data and the pixel information of the object frame image comprises: calculating an initial inertia probability of each pixel with respect to each point cloud according to the speed information of the point cloud data and the pixel information of the object frame image; and fusing the initial inertia probability of each pixel with respect to each point cloud by using a non-maximum suppression algorithm and selecting a maximum initial inertia probability as the inertia probability of each pixel.
10 . The electronic device of claim 9 , wherein a calculation formula of the initial inertia probability comprises: P r ( x )= c ( x,r ) s ( I ( x ), I ( r )); wherein P r (x) is the initial inertia probability of each pixel for different point clouds, x is each pixel, r is each point cloud, I(x) is a color of each pixel x, I(r) is a color of each point cloud r, c is a similarity function, and s is a similarity function.
11 . The electronic device of claim 9 , wherein reconstructing the object frame image to obtain the reconstructed frame image according to the reference frame image and the first depth estimation model comprises: estimating a depth information of the object frame image according to the first depth estimation model; inputting the object frame image and the reference frame image into a preset pose estimation model to obtain a camera pose change information between the object frame image and the reference frame image; and reconstructing the object frame image according to the depth information of the object frame image and the camera pose change information to obtain the reconstructed frame image corresponding to the object frame image.
12 . The electronic device of claim 9 , wherein determining the reconstructed error between the object frame image and the reconstructed frame image comprises: calculating a luminosity difference between the object frame image and the reconstructed frame image to obtain the reconstructed error.
13 . The electronic device of claim 9 , wherein obtaining a predicted depth image of the input image according to the depth information of the input image and a reference image corresponding to the input image comprises: using the second depth estimation model to reconstruct the input image to obtain the predicted depth image according to the reference image corresponding to the input image and the depth information of the input image.
14 . The electronic device of claim 9 , wherein an acquisition method of the point cloud comprises: scanning the monocular image frames by using a lidar to obtain the point cloud data of the lidar.
15 . A non-transitory computer readable medium having stored thereon instructions that, when executed by a processor of an electronic device, causes the electronic device to perform an image depth estimation method, the image depth estimation method comprising: obtaining a plurality of monocular images and a point cloud data of each of the plurality of monocular images, wherein each of the plurality of monocular images comprises an object frame image and a reference frame image; reconstructing the object frame image to obtain a reconstructed frame image according to the reference frame image and a first depth estimation model; determining a reconstructed error between the object frame image and the reconstructed frame image; obtaining an inertia probability of each pixel of the object frame image according to speed information of the point cloud data and pixel information of the object frame image; labeling pixels which inertia probabilities are higher than a preset threshold to obtain a mask data; obtaining a loss function according to the reconstructed error and the mask data, and training the first depth estimation model according to the loss function to obtain a second depth estimation model; and using the second depth estimation model to obtain depth information of an input image and obtaining a predicted depth image of the input image according to the depth information of the input image and a reference image corresponding to the input image.
16 . The non-transitory computer readable medium of claim 15 , wherein obtaining the inertia probability of each pixel of the object frame image according to the speed information of the point cloud data and the pixel information of the object frame image comprises: calculating an initial inertia probability of each pixel with respect to each point cloud according to the speed information of the point cloud data and the pixel information of the object frame image; and fusing the initial inertia probability of each pixel with respect to each point cloud by using a non-maximum suppression algorithm and selecting a maximum initial inertia probability as the inertia probability of each pixel.
17 . The non-transitory computer readable medium of claim 16 , wherein a calculation formula of the initial inertia probability comprises: P r ( x )= c ( x,r ) s ( I ( x ), I ( r )); wherein P r (x) is the initial inertia probability of each pixel for different point clouds, x is each pixel, r is each point cloud, I(x) is a color of each pixel x, I(r) is a color of each point cloud r, c is a similarity function, and s is a similarity function.
18 . The non-transitory computer readable medium of claim 16 , wherein reconstructing the object frame image to obtain the reconstructed frame image according to the reference frame image and the first depth estimation model comprises: estimating a depth information of the object frame image according to the first depth estimation model; inputting the object frame image and the reference frame image into a preset pose estimation model to obtain a camera pose change information between the object frame image and the reference frame image; and reconstructing the object frame image according to the depth information of the object frame image and the camera pose change information to obtain the reconstructed frame image corresponding to the object frame image.
19 . The non-transitory computer readable medium of claim 16 , wherein determining the reconstructed error between the object frame image and the reconstructed frame image comprises: calculating a luminosity difference between the object frame image and the reconstructed frame image to obtain the reconstructed error.
20 . The non-transitory computer readable medium of claim 16 , wherein obtaining a predicted depth image of the input image according to the depth information of the input image and a reference image corresponding to the input image comprises: using the second depth estimation model to reconstruct the input image to obtain the predicted depth image according to the reference image corresponding to the input image and the depth information of the input image.

Description

TECHNICAL FIELD The subject matter herein generally relates to computer vision. BACKGROUND How to recover the depth information of the scene from 2D images sequentially collected into a 3D scene is an important research content in the field of computer vision. Monocular depth estimation is an important method to understand the geometric relationship of 3D scenes. The monocular depth estimation refers to the process of obtaining the depth data corresponding to a picture or a video by processing the picture or the video taken by the monocular camera. The video captured by monocular camera is called a monocular video. When shooting a monocular video, there may be differences between adjacent frames in the captured monocular video due to uncontrollable factors, such as shaking of the camera, object movement in the shooting scene, and noise. These factors can lead to large jitter in monocular depth estimation of the monocular video, and the depth data of two adjacent video frames can be quite different. In order to suppress the jittering of the camera, at present, according to the images from different time and perspective, the deep learning method of monocular depth estimation mainly uses the SFM (Structure From Motion) principle to let the model infer the object depth, and reconstructs the image of the object perspective using the reference image. The reconstructed image from the depth estimation with lower error rates can be closer to the original object image, but the reconstructed image similarity cannot accurately represent the degree of depth error in the following scenarios because moving objects do not conform to SFM's viewpoint pose transformation and cannot be correctly reconstructed. The existing technology can not completely filter out moving objects in the process of training the monocular depth estimation model, which makes the accuracy of the model depth estimation low, and the model parameters cannot be optimized. BRIEF DESCRIPTION OF THE DRAWINGS Implementations of the present disclosure will now be described, by way of embodiments, with reference to the attached figures. FIG. 1 is an application scenario diagram of an embodiment of a method for reducing the error of depth estimation model. FIG. 2 is a flowchart of an embodiment of the method of FIG. 1. FIG. 3 is a flowchart of an embodiment of a depth estimation method. FIG. 4 is a block diagram of an embodiment of a device for reducing the error of depth estimation model. FIG. 5 is an architecture diagram of an electronic device. DETAILED DESCRIPTION It will be appreciated that for simplicity and clarity of illustration, where appropriate, reference numerals have been repeated among the different figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth in order to provide a thorough understanding of the embodiments described herein. However, it will be understood by those of ordinary skill in the art that the embodiments described herein can be practiced without these specific details. In other instances, methods, procedures, and components have not been described in detail so as not to obscure the related relevant feature being described. Also, the description is not to be considered as limiting the scope of the embodiments described herein. The drawings are not necessarily to scale and the proportions of certain parts may be exaggerated to better illustrate details and features of the present disclosure. It should be noted that references to “an” or “one” embodiment in this disclosure are not necessarily to the same embodiment, and such references mean “at least one”. Several definitions that apply throughout this disclosure will now be presented. The connection can be such that the objects are permanently connected or releasably connected. The term “comprising,” when utilized, means “including, but not necessarily limited to”; it specifically indicates open-ended inclusion or membership in the so-described combination, group, series, and the like. The method provided by this embodiment is mainly applied to a dynamic environment containing dynamic objects. As shown in FIG. 1, a dynamic object can be an object whose is P in the monocular camera O1 perspective at the previous moment, but P′ in the monocular camera O2 perspective at the latter moment. The projection point of P′ from the perspective of monocular camera O2 is P3, (P1, P3) that is the feature point matching of dynamic object. The feature point matching of (P1, P3) dynamic objects is different from the perspective transformation pose relationship obtained by the feature point matching of (P1, P2) of static objects, the perspective transformation pose relationship is used in the modeling of static objects, and cannot completely filter out moving objects, so that the accuracy of the model has a large error. FIG. 2 illustrates one exemplary embodiment of an image depth estimation method. The flowchart presents an exemplary embodime