CN-122023492-A - Depth estimation method of focusing clue constraint scene semantics

CN122023492ACN 122023492 ACN122023492 ACN 122023492ACN-122023492-A

Abstract

The invention discloses a depth estimation method of focusing clue constraint scene semantics, and belongs to the technical field of depth estimation. According to the method, a multi-focus image sequence of a scene to be detected is collected, basic features are extracted through a pre-training visual encoder in sequence, semantic and context modeling is conducted through a projection module and a space-time mixing module, parallax and attention features are built through a parallax attention module, multi-scale feature integration is achieved through a compression module and a fusion module, and finally a depth map is output through a depth regression module. The invention effectively merges physical imaging clues and high-level semantic priors, realizes depth estimation with global structural consistency, local geometric fineness and strong environmental adaptability under the monocular and low-cost imaging conditions, and is suitable for the integrated application of mobile terminals and consumer-level equipment.

Inventors

ZHANG JIANGFENG
QIAN YUHUA
YAN TAO
ZHANG MIN

Assignees

山西大学

Dates

Publication Date: 20260512
Application Date: 20260126

Claims (1)

1. A depth estimation method of focusing clue constraint scene semantics comprises the following steps: Step 1, collecting a multi-focus image sequence of a scene to be measured , Representing the number of sequences, the size of a single frame image is , Representing the height and width of a single image, The number of channels representing the image; step 2, for the multi-focus image sequence obtained in step 1 Inputting a pre-trained visual encoder to obtain a four-scale basic feature sequence through a formula (1) Wherein the first The feature size of each scale is , (1) Wherein, the A visual encoder that is pre-trained; step 3, obtaining the basic characteristic sequence in the step 2 Sequentially input to In each projection module, nonlinear change and channel alignment are carried out through a formula (2) to obtain a projection characteristic matrix , (2) Wherein, the Represent the first The number of projection modules is one, And Is the first A pair of the parameters that can be learned, Representing a mapping operator; Step 4, for the projection characteristic matrix obtained in the step 3 Sequentially input to The space-time mixing module performs semantic and context modeling through the model (3) to obtain a space-time semantic feature matrix , (3) Wherein, the Represent the first A space-time mixing module, Representation modeling Is a function of the spatial neighborhood relationship of (1), Representation modeling Is a temporal neighborhood relation of (a); Step 5, for the space-time semantic feature matrix obtained in the step 4 Sequentially input to The parallax attention modules construct parallax characteristics through the formula (4), and construct attention characteristics through the formula (5) to obtain a parallax attention characteristic matrix , (4) (5) Wherein, the Represent the first On the first scale The characteristics of the frame are such that, Represent the first The disparity information resulting from the inter-frame difference, Represent the first A feature-based associative operation based on the attention, 、 And Is a matrix of learnable parameters; Step 6, obtaining the parallax attention characteristic matrix in the step 5 Sequentially input to The compression module performs compression and scale reforming through a formula (6) to obtain a compression characteristic matrix , (6) Wherein, the Represent the first A plurality of compression modules are arranged in the compression module, Represent the first The up-sampled scale reforming operation is performed, First, the A plurality of depth separable convolution layers; step 7, obtaining the compressed characteristic matrix from the step 6 Inputting the fusion module, and performing feature splicing and joint modeling through the formula (7) to obtain a fusion feature matrix , (7) Wherein, the The representation of the fusion module is made, Representing feature stitching in the channel dimension, The fusion module is composed of a convolution layer, a normalization layer and a nonlinear function; Step 8, obtaining the fusion feature matrix from the step 7 Inputting the depth regression module, and obtaining a depth map through the method (8) , (8) Wherein, the The depth regression module is formed by serially connecting a convolution layer with a nonlinear activation function.

Description

Depth estimation method of focusing clue constraint scene semantics Technical Field The invention belongs to the technical field of depth estimation, and particularly relates to a depth estimation method of focusing clue constraint scene semantics. Background Depth estimation is a key element of intelligent perception of space, aimed at deducing distance information of points in a scene by analyzing visual information in images or sequences. The technology provides an important geometric foundation for three-dimensional reconstruction, robot navigation, augmented reality and other applications. According to the different implementation principle and depending hardware, the existing depth estimation method can be mainly divided into two main types of measurement type and perception type. The measurement method is usually based on a physical mechanism of actively transmitting and receiving signals to directly acquire depth. Typical representatives include structured light imaging, laser scanning, white light interferometry, millimeter wave radar, and the like. The method has a definite physical model, and can obtain a depth measurement result with higher precision. However, the disadvantages are also significant, the need for complex and expensive dedicated hardware, the large system volume and power consumption, and the sensitivity to environmental disturbances (e.g. glare, rain and fog) and therefore limited in consumer electronics and large scale deployment. The perceptual approach relies primarily on a common imaging sensor (e.g., a camera) to indirectly estimate depth by computing and analyzing image information. Common techniques include stereo matching, motion restoration structures, focus/defocus based methods, and the like. The method has low hardware cost and easy integration, and is more suitable for mobile equipment and embedded application. Among the numerous perceptual approaches, the focused clue-based approach and the scene semantic-based approach represent two different information utilization ideas. The focus cues based approach uses the sharpness of the image during the course of the lens focal length change to infer depth. The method can work under a monocular condition and is sensitive to local depth of field changes. However, its performance is susceptible to scene texture starvation, image noise, and complex lighting conditions, with poor stability in low texture areas or low contrast environments. The scene semantic-based method introduces a high-level prior by identifying object types, structures and spatial relations (such as sky, ground, shielding relations among objects and the like) in the image, so as to infer approximate depth layout and relative relations. Such methods perform well on overall scale consistency, but rely heavily on training data distribution. In the face of non-emerging objects or scene structures in the training set, they are not generalized enough and it is often difficult to recover fine geometric details and surface relief. In summary, the main challenges faced by the existing perceptual depth estimation techniques are that the method based on physical imaging cues (such as focusing) may have high local precision but lack global consistency and robustness, while the method based on semantic priors can understand that the overall layout is easy to lose details and has limited generalization capability. How to effectively integrate the two complementary information sources and realize depth estimation with global consistency and rich details on low-cost hardware is still a technical problem to be solved at present. Disclosure of Invention The invention aims to provide a depth estimation method of focusing clue constraint scene semantics, which aims to overcome the limitation caused by single information source dependence in the existing perception type depth estimation technology. Specifically, the invention realizes the effective synergy of local physical imaging information and high-level semantic priori by constructing an interactive fusion mechanism of focusing clues and scene semantics, thereby obtaining depth estimation results with global structural consistency, local geometric fineness and strong environmental adaptability under the imaging condition of monocular and low cost. The technical scheme adopted by the invention is that the depth estimation method of the focusing clue constrained scene semantics comprises the following steps: Step 1, collecting a multi-focus image sequence of a scene to be measured ,Representing the number of sequences, the size of a single frame image is,Representing the height and width of a single image,The number of channels representing the image; step 2, for the multi-focus image sequence obtained in step 1 Inputting a pre-trained visual encoder to obtain a four-scale basic feature sequence through a formula (1)Wherein the firstThe feature size of each scale is, (1) Wherein, the A visual encoder that is pre-trained; ste