CN-121505174-B - Robust single-frame structured light three-dimensional imaging method and system based on neural feature decoding

CN121505174BCN 121505174 BCN121505174 BCN 121505174BCN-121505174-B

Abstract

The invention discloses a robust single-frame structured light three-dimensional imaging method and system based on nerve feature decoding, comprising a data generation module, a nerve feature matching module and a depth optimization module; the method comprises the steps of carrying out physical simulation on structured light, constructing a large-scale synthesized structured light data set, constructing a neural feature matching module, obtaining an initial depth map through neural feature matching, designing a depth optimization module, injecting the initial depth map into a visual basic model as a geometric prompt, further optimizing on the basis of the initial depth map, generating a final depth map, and realizing single-frame structured light three-dimensional imaging. The method overcomes the defects of low precision and poor robustness of the three-dimensional imaging method under the complex real scene such as strong ambient light, high reflection, translucency and the like, solves the problem that the deep learning method is seriously lack of large-scale high-quality training data when being applied to the structured light field, and improves the precision and the robustness of single-frame structured light three-dimensional imaging under the complex scene.

Inventors

CHEN WENZHENG
LI JIAHENG
DAI QIYU
CHEN BAOQUAN
SUN HE

Assignees

北京大学

Dates

Publication Date: 20260505
Application Date: 20251118

Claims (8)

1. The single-frame structured light three-dimensional imaging method based on nerve feature decoding is characterized by comprising the following steps of: A) Constructing a simulation structured light environment, performing physical simulation on structured light, and constructing and generating a large-scale synthetic structured light data set; b) The method for constructing the neural feature matching module, which acquires an initial depth map through the neural feature matching, comprises the following steps: B1, receiving a single-frame structured light projection pattern input from the outside and a structured light infrared image captured by an infrared camera, and extracting multisource nerve characteristics, wherein the infrared camera can be in monocular configuration or binocular configuration; B2, constructing a multi-scale structured light cost body between the left infrared camera image and the structured light projection pattern and/or between the left infrared camera image and the right infrared camera image by calculating the correlation between the two different image local nerve characteristic images to form a multi-level cost body pyramid; B3, generating an initial depth map through iterative optimization; based on the cyclic neural network, predicting and optimizing parallax iteratively according to the cost body and the contextual neural characteristics to obtain a parallax map, and calculating to obtain an initial depth map; C) The depth optimization module is designed, the initial depth map is used as a geometric prompt, and is injected into the visual basic model, and the final depth map is generated by further optimizing on the basis of the initial depth map; The depth optimization module comprises a pre-trained depth estimation visual basic model and a prompt network consisting of a plurality of convolution layers; the visual basic model comprises ViT encoder and DPT decoder, viT encoder encodes advanced global information from image, DPT decoder infers image depth from encoded advanced global information, cue network is used to encode initial depth map as geometric cue feature and inject it into DPT decoder, DPT decoder infers high precision final depth map according to advanced global information extracted by ViT encoder and geometric cue feature of initial depth, realizing single frame structured light three-dimensional imaging; The visual basic model adopts DEPTHANYTHINGV <2 >, and the design depth optimization module comprises the following steps: c1, expanding a depth estimation visual basic model DEPTHANYTHINGV, wherein the depth estimation visual basic model DEPTHANYTHINGV which is pre-trained on massive images and fine-tuned on structured light data is adopted as a backbone network of a depth optimization module; the depth estimation visual basic model DEPTHANYTHINGV comprises a picture ViT encoder based on a transducer and a DPT decoder, wherein the ViT encoder is used for extracting semantic high-level global information in the picture, expanding at the decoder, receiving an intermediate layer code of the ViT encoder and an initial depth map serving as a prompt, and repairing a low-quality area in the initial depth map; Inputting the left infrared image into the ViT coder, extracting the output characteristics of the intermediate layer of the ViT coder, and inputting the output characteristics into a DPT decoder; taking the initial depth map as a strong geometric prompt, and encoding the initial depth map into geometric prompt characteristics through a prompt network; inputting intermediate layer features extracted from an intermediate layer of a ViT coder and prompt features obtained by encoding an initial depth map into a DPT decoder; c3, the DPT decoder outputs a final depth map; the DPT decoder fuses the infrared image monocular vision characteristic from the ViT coder and the geometric prompt characteristic from the initial depth map, optimizes the initial depth map and outputs a final depth map.
2. The single-frame structured light three-dimensional imaging method based on nerve feature decoding as claimed in claim 1, wherein in the step B1, multisource nerve features are extracted, specifically, a feature encoder in a nerve feature matching module is used to extract a high-dimensional nerve feature map from a left infrared image and a projection pattern image according to an input single-frame projection pattern image, a left infrared image and/or a right infrared image under binocular configuration; extracting local nerve characteristics from the left infrared image and the structured light projection pattern by using the same characteristic encoder under a monocular configuration, and extracting a second group of nerve characteristic images from the left infrared image and the right infrared image respectively by using another additional characteristic encoder with the same structure and different parameters while extracting the local nerve characteristics of the left infrared image and the structured light pattern by using the same characteristic encoder under a monocular configuration under a binocular configuration; For contextual neural features, the same context encoder is employed to extract multi-scale contextual neural features from the left infrared image for subsequent iterative optimization, whether for a monocular or binocular configuration.
3. The method of claim 1, wherein in the multi-scale structured light cost body constructed in step B2, the structured light cost body encodes a dense matching relationship between the projection pattern and the captured image, and elements in the structured light cost body represent similarity of local nerve features of the left infrared image and the structured light projection pattern, or similarity of local nerve features of the left infrared image and the right infrared image.
4. The method for three-dimensional imaging of single frame structured light based on neural feature decoding according to claim 3, wherein the three-dimensional cost volume formed by the local neural feature map of the left infrared image and the local neural feature map of the structured light projection pattern is subjected to multi-level average pooling, the three-dimensional cost volume is subjected to average pooling for 4 times along the last dimension of the three-dimensional cost volume, the dimension of the last dimension is reduced to half before pooling by each average pooling operation, the three-dimensional cost volume before and after pooling is preserved each time, a 4-level cost volume pyramid is constructed to capture multi-scale information, and the same operation is performed on the second three-dimensional cost volume travelled by the local neural feature map of the left infrared image and the local neural feature map of the right infrared image for binocular structured light.
5. The single-frame structured light three-dimensional imaging method based on nerve feature decoding as claimed in claim 1, wherein in the step B3 of iterative optimization generation of the initial depth map, a cyclic neural network based on a convolution gating cyclic unit is adopted to predict and optimize parallax to obtain a parallax map, and a triangulation principle is utilized to calculate to obtain the initial depth map.
6. The method for three-dimensional imaging of single-frame structured light based on neural feature decoding as claimed in claim 5, wherein the method is characterized in that a predicted disparity map is initialized to be a full-0 disparity map, in each iteration, according to a current disparity estimation value, correlation features are sampled from a cost pyramid, and in combination with contextual neural features, a convolution gating unit is input to update a hidden state to obtain a new hidden state, the updated hidden state outputs a disparity difference value through an output head of the convolution gating unit, and the disparity difference value is added to the current disparity estimation to obtain an updated disparity estimation.
7. The single-frame structured light three-dimensional imaging system based on nerve feature decoding realized by the method of claim 1 is characterized by comprising a data generation module, a nerve feature matching module and a depth optimization module, wherein: the data generation module is used for generating a large-scale synthetic structured light data set through physical simulation; The neural feature matching module is used for matching the input single-frame projection pattern with the scene infrared image shot by the camera in the feature space, and generating an initial depth map through iterative optimization; the depth optimization module is used for receiving the initial depth map and serving as a geometric prompt, combining with advanced visual features of the scene infrared image, optimizing by utilizing a depth estimation visual basic model which is pre-trained and subjected to structured light data fine adjustment, and generating a final depth map.
8. The system of claim 7, wherein the dataset comprises a plurality of scenes, illumination, materials and projection patterns and provides corresponding real depths, and the neural feature matching module is configured to match the input single frame projection patterns with the infrared images of the scenes captured by the at least one camera in feature space.

Description

Robust single-frame structured light three-dimensional imaging method and system based on neural feature decoding Technical Field The invention belongs to the technical field of computer vision and three-dimensional reconstruction, and provides a three-dimensional imaging method, in particular relates to a robust single-frame structured light three-dimensional imaging method and system based on deep learning neural feature decoding. Background Structured light is an active 3D scanning technique by actively projecting specially designed light patterns (such as stripes, grids or speckles) onto an object, then observing the deformations of these light patterns on the surface of the object by means of one or more cameras, and finally back-deriving the three-dimensional shape and depth information of the object by calculation. Active Structured Light (SL) three-dimensional imaging is a key technology for acquiring three-dimensional information of an object, wherein a single-frame structured light technology can finish three-dimensional reconstruction by projecting a static coding pattern, has the advantages of high efficiency and high dynamic adaptability, and is widely applied to three-dimensional imaging commercial equipment. Existing single frame structured light techniques can be divided into traditional pixel domain matching-based decoding, which is still the current mainstream method and widely used in commercial devices, and deep learning-based decoding, which is still in the initial exploration phase. Conventional pixel domain matching based decoding methods are the dominant approach for commercial systems (e.g., INTEL REALSENSE D, microsoft Kinect V1). The core idea is to calculate the depth by matching the projection pattern with the pixel intensity information of the local image block (patch) in the infrared image captured by the camera. However, such methods rely heavily on low-level, localized pixel information, which is highly susceptible to interference in complex scenarios. Therefore, when complex non-lambertian materials such as surface texture missing, object shielding, reflection, transparency and the like are faced, the matching at the pixel level becomes highly unstable, resulting in poor decoding robustness and significantly reduced three-dimensional reconstruction quality. The application of the decoding method based on deep learning in the structured light field is still in an early stage, which is mainly limited by two aspects of data and methods. In terms of data, structured light technology lacks large-scale, high-quality public data sets, and most of the existing work relies on small synthetic data or true data without true values (groundtruth) for training, resulting in poor model generalization capability. In terms of the method, the existing work (such as Yinda Zhang, etc., activestereonet: end-to-End self-supervised learning for active stereo systems ECCV 2018) ignores the rich spatial coding information contained in the known projection pattern when decoding, and only regards the known projection pattern as an extra texture, so that the inherent advantage of the structured light cannot be fully utilized, and the training process is unstable, poor in generalization and difficult to be practically applied. In summary, the existing traditional single-frame structured light three-dimensional imaging technology has poor robustness and low precision in complex scenes, and the three-dimensional imaging method adopting deep learning has weak model generalization capability due to data deficiency and model design limitation. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a robust single-frame structured light three-dimensional imaging method and system based on neural feature decoding, and provides a matched model training data generation scheme, which are used for overcoming the defects of low precision and poor robustness of the traditional three-dimensional imaging method under complex real scenes (such as weak textures, light reflection, shielding and the like) and solving the bottleneck problem that a deep learning method seriously lacks large-scale and high-quality training data when being applied to the structured light field. The invention designs a complete framework comprising large-scale synthetic data generation and nerve decoding, and improves fragile pixel domain matching in traditional structured light decoding to matching in a robust nerve characteristic space. The method of the invention supports hardware configuration of a monocular camera (single infrared camera+projector) and a binocular camera (two infrared cameras+projector). The method comprises the following steps of 1, performing physical simulation on structural light by using open source software Blender, and physically and programmatically rendering a large-scale high-quality synthetic structural light data set containing diversified three-dimensional scenes, illumination