CN-115908702-B - Multi-view three-dimensional reconstruction method and equipment

CN115908702BCN 115908702 BCN115908702 BCN 115908702BCN-115908702-B

Abstract

The invention discloses a multi-view three-dimensional reconstruction method and equipment, which comprise the steps of extracting three-dimensional picture features of an input image by utilizing a two-dimensional UNet network, optimizing a feature sequence by utilizing a self-attention mechanism, carrying out two-dimensional to three-dimensional conversion by adopting homography conversion, constructing an initial cost body, inputting the initial cost body into the three-dimensional UNet network, carrying out softmax and soft-argmin operation to obtain an initial depth map D k,i , wherein k epsilon {1,2,3} represents three stages, a depth map generated in the former stage serves as additional input of homography conversion in the latter stage, depth information entropy of the initial probability body serves as weight to fuse the initial cost body, and carrying out softmax and soft-argmin operation on the fused cost body through the three-dimensional UNet network to generate a depth map D 0 . The invention improves the integrity and accuracy of the multi-view three-dimensional reconstruction of the object.

Inventors

Huai Hongqi
GAO RUI
SHA JIEYUN

Assignees

南京六九零二科技有限公司

Dates

Publication Date: 20260508
Application Date: 20221101

Claims (6)

1. A multi-view three-dimensional reconstruction method, comprising the steps of: (1) Extracting three-scale picture features of an input image by using a two-dimensional UNet network, wherein the three-scale picture features are respectively expressed as 、 And ; (2) The obtained three-scale characteristic sequences 、 And Respectively input to corresponding self-attention layers, calculate self-attention scores, and generate new feature sequences according to the self-attention scores 、 And ; (3) The homography transformation is adopted to respectively carry out the characteristic sequence 、 And Mapping to a coordinate system of a reference view, and constructing a corresponding initial cost body, wherein the homography transformation is as follows: Wherein Representing homographies between the feature map of the ith view and the reference feature map at the hypothetical depth d, 、、 Camera references, rotations, and translations respectively representing the ith view, Representing the principal axis parameters of the reference camera, 、、 Respectively representing reference camera internal parameters, rotation and translation, Is a unitary matrix, and superscript T represents a transpose; (4) Inputting the initial cost volume into a three-dimensional UNet network for regularization to obtain a cost volume Further performing softmax operation to obtain probability body Performing soft-argmin operation on the probability volume to obtain an initial depth map Wherein Representing three phases, wherein the depth map generated in the first phase As an additional input to the homography transformation in the second stage, the depth map generated in the second stage As an additional input to the homography transformation in the third stage, i represents the i-th view; (5) Calculating the depth information entropy of an initial probability body, taking the depth information entropy as uncertainty of pixel points at different depths, and fusing the uncertainty as weight to the initial cost body, wherein the uncertainty is expressed in the following mathematical expression: , Representing pixel points Uncertainty at different depth values, In order to map the function of the function, The entropy of the information representing the pixel point s, Representing pixel points At a corresponding depth of The probability of that is to be found, The depth range is defined by the following cost volume fusion formula: , wherein, ; (6) The fused cost body performs softmax and soft-argmin operation through a three-dimensional UNet network to generate a depth map 。
2. The multi-view three-dimensional reconstruction method according to claim 1, wherein the feature sequence 、 And The scale of the characterization corresponds to 1/8, 1/4 and 1/2 of the scale of the original image, respectively.
3. The multi-view three-dimensional reconstruction method according to claim 1, wherein the step (2) comprises: To be used for Representation of 、 And Any one of them will first Inputting into three different linear layers to obtain three different characteristic sequences 、、 As shown in formula (1): (1) Wherein, the 、 And The weight factors of the linear layer are represented, and the optimal value is obtained through multiple times of training of the network; Then, according to formula (2), a self-attention score is calculated: (2) Wherein, the Representing feature dimensions, the self-attention score representing the correlation between features; finally, a new feature sequence is generated from the self-attention score as shown in equation (3): (3)。
4. the multi-view three-dimensional reconstruction method according to claim 1, wherein a loss function of the overall network training is defined as follows: ; Wherein, the The weight of the result error in the k-th stage in the whole network architecture learning training is as follows: ; ; ; Wherein, the Representing the depth map obtained from the source map of the ith view in the kth stage, Is a depth map Is used to determine the true value of (a), Is the kth stage Uncertainty of the source signature of the individual views, A depth value true value representing the reference feature map in the kth stage, Representing a discrete interval of depth sampling, Representing the effective pixels of the depth map for which the depth estimate is within the hypothetical range.
5. A computer device, comprising: One or more processors; Memory, and One or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, which when executed by the processors implement the steps of the multi-view three-dimensional reconstruction method of any of claims 1-4.
6. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the multi-view three-dimensional reconstruction method as claimed in any one of claims 1 to 4.

Description

Multi-view three-dimensional reconstruction method and equipment Technical Field The invention relates to the field of computer vision, in particular to a multi-view three-dimensional reconstruction method and equipment. Background The three-dimensional reconstruction is used as one of key technologies of environment perception, and can be used for automatic driving, virtual reality, moving target monitoring, behavior analysis, security monitoring and the like. From a computer vision perspective, three-dimensional reconstruction aims at estimating a dense representation from overlapping images from a given image dataset, thereby recovering a geometric model of the corresponding object. Traditional three-dimensional reconstruction methods use hand-made similarity metrics and engineering regularization (e.g., normalized cross-correlation and semi-global matching) to compute dense correspondences and recover three-dimensional point clouds. While these methods have achieved good results in the ideal case, they still face some limitations. For example, low texture, occlusion, highlights and interaction areas in the scene make dense matching difficult to handle, resulting in poor reconstruction integrity and robustness (see literature "Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2018."）. in recent years, the high-speed development of deep learning techniques and computer hardware has greatly driven academic research into three-dimensional reconstruction problems and achieved pleasing results: methods based on deep Convolutional Neural Networks (CNNs) have heretofore achieved the most advanced precision effect on three-dimensional reconstruction: CNNs have passed through encoder-decoder structures, encoders feature extraction of picture data and feature fusion with attention mechanisms, the three-dimensional reconstruction corresponds to a source image and a reference image, the image pairs have diversity in size, illumination and visual field, the method of extracting the convolution features is influenced by convolution kernel receptive field, the generated feature images have a plurality of differences, and thus, the inconsistency of stereo matching is caused, and the reconstruction precision is influenced. Disclosure of Invention Aiming at the problems in the prior art, the invention provides a multi-view three-dimensional reconstruction method and equipment, which improve reconstruction accuracy. The technical scheme is that the multi-view three-dimensional reconstruction method comprises the following steps: (1) Extracting three-scale picture features of an input image by using a two-dimensional UNet network, wherein the three-scale picture features are respectively expressed as 、And; (2) The obtained three-scale characteristic sequences、AndRespectively input to corresponding self-attention layers, calculate self-attention scores, and generate new feature sequences according to the self-attention scores、And; (3) The homography transformation is adopted to respectively carry out the characteristic sequence、AndMapping to a coordinate system of a reference view, and constructing a corresponding initial cost body; (4) Inputting the initial cost volume into a three-dimensional UNet network for regularization to obtain a cost volume Further performing softmax operation to obtain probability bodyPerforming soft-argmin operation on the probability volume to obtain an initial depth mapWhereinRepresenting three phases, wherein the depth map generated in the first phaseAs an additional input to the homography transformation in the second stage, the depth map generated in the second stageAs an additional input to the homography transformation in the third stage, i represents the i-th view; (5) Calculating the depth information entropy of the initial probability body as the uncertainty of the pixel point at different depths The uncertainty is used as weight to fuse the initial cost body; (6) The fused cost body performs softmax and soft-argmin operation through a three-dimensional UNet network to generate a depth map 。 Further, the characteristic sequence、AndThe scale of the characterization corresponds to 1/8, 1/4 and 1/2 of the scale of the original image, respectively. Further, the step (2) includes: To be used for Representation of、AndAny one of them will firstInputting into three different linear layers to obtain three different characteristic sequences、、As shown in formula (1): (1) Wherein, the 、AndThe weight factors of the linear layer are represented, and the optimal value is obtained through multiple times of training of the network; Then, according to formula (2), a self-attention score is calculated: (2) Wherein, the Representing feature dimensions, the self-attention score representing the correlation between features; finally, a new feature sequence is generated from the self-attention