CN-121982086-A - Depth estimation method, device, equipment and program product for binocular image

CN121982086ACN 121982086 ACN121982086 ACN 121982086ACN-121982086-A

Abstract

The present application relates to the field of image processing, and in particular, to a method, apparatus, device, and program product for estimating depth of a binocular image. The method comprises the steps of obtaining a binocular image, extracting a multi-scale feature image of the binocular image, constructing an initial ontology of a second-scale feature image, optimizing the initial ontology through context information provided by the third-scale feature image to obtain an optimized ontology, performing parallax regression processing on the optimized ontology to obtain an initial parallax image, constructing correction features of the optimized ontology, and performing iterative optimization on the initial parallax image through the correction features, the first-scale feature and a preset parallax iterative module to obtain a depth estimation result of the binocular image. When the method outputs the depth estimation result, the accuracy of the depth estimation result of the concerned obstacle is ensured to be output, and meanwhile, the depth estimation efficiency is greatly improved.

Inventors

YAO BOWEN

Assignees

深圳市优必选科技股份有限公司

Dates

Publication Date: 20260505
Application Date: 20251222

Claims (10)

1. A method of depth estimation of a binocular image, the method comprising: Acquiring a binocular image, extracting a multi-scale feature image of the binocular image, and constructing an initial ontology of a second-scale feature image, wherein the cost ontology comprises the initial ontology and an optimized ontology, the cost ontology comprises the matching cost of each pixel in the binocular image and a corresponding pixel under different parallaxes, and the multi-scale feature image comprises a first-scale feature image, a second-scale feature image and a third-scale feature image with sequentially reduced resolution; optimizing the initial ontology through context information provided by the third scale feature image to obtain an optimized ontology, and performing parallax regression processing on the optimized ontology to obtain an initial parallax map; And constructing the correction feature optimized into the ontology, and performing iterative optimization on the initial parallax image through the correction feature, the first scale feature and a preset parallax iterative module to obtain a depth estimation result of the binocular image.
2. The method of claim 1, wherein constructing the initial ontology of the second scale feature image comprises: Acquiring characteristic channels of the second scale characteristic image, and grouping the characteristic channels to obtain a plurality of characteristic channel groups; calculating the matching cost of pixels in the characteristic channel group under different parallaxes to obtain the matching cost of the characteristic channel group; And splicing the matching cost of the characteristic channel groups to obtain the initial ontology.
3. The method of claim 2, wherein calculating the matching costs for pixels in the characteristic channel group at different disparities results in a matching cost for the characteristic channel group, comprising: Determining a maximum disparity for the feature channel packet; determining all parallaxes to be calculated based on the maximum parallaxes, and translating a right-eye image in the binocular images according to the parallaxes to be calculated; And determining the matching cost of the characteristic channel group according to the cosine similarity or the inner product of the left eye image and the right eye image at the corresponding positions.
4. The method of claim 1, wherein the multi-scale feature image further comprises a fourth-scale feature image having a resolution that is less than a resolution of the third-scale feature image; Optimizing the initial ontology through the context information provided by the third scale feature image to obtain an optimized ontology, including: performing convolution and downsampling on the initial ontology to obtain a first coding feature image, and performing attention fusion on the first coding feature image and the third scale feature image to obtain a first fusion feature image; Performing convolution and downsampling on the first fusion feature image to obtain a second coding feature image, and performing attention fusion on the second coding feature image and the fourth scale feature image to obtain a second fusion feature image; Performing up-sampling processing on the second fusion characteristic image to obtain a first decoded image; splicing the first decoding image and the first fusion characteristic image to obtain a characteristic spliced image; And carrying out convolution processing on the characteristic spliced image to obtain a convolution image, and carrying out attention fusion processing on the convolution image and the third-scale characteristic image to generate the optimized ontology.
5. The method of claim 1, wherein constructing the modified features optimized to an ontology comprises: Performing downsampling treatment on the optimized ontology to obtain a downsampled ontology; the parallax value of each pixel in the downsampled ontology is taken as the center, and a parallax optimization range is determined; Determining candidate parallaxes in the parallax optimization range, and extracting cost slices from the downsampled ontology according to the candidate parallaxes; And splicing the cost slices to generate the correction features.
6. The method of claim 1, wherein iteratively optimizing the initial disparity map by the correction feature, the first scale feature, and a preset disparity iteration module comprises: inputting the correction features, the initial parallax map and the first scale feature image into a parallax iteration module to generate a parallax residual map; Determining a current parallax image according to the parallax residual image and the initial parallax image, updating the initial parallax image according to the current parallax image, inputting the initial parallax image into a parallax iteration module again for iterative computation, obtaining the current parallax image when the iteration times meet the preset requirement, and determining the depth estimation result according to the current parallax image.
7. The method of claim 6, wherein determining the depth estimate from the current disparity map comprises: Mapping pixel points in the current parallax map to the first scale feature image, and determining mapping points of the first scale feature image; And determining the depth of the mapping point according to the parallax in the current parallax map and the predetermined binocular camera parameters, and obtaining the depth estimation result.
8. A depth estimation device for binocular images, the device comprising: An initial ontology construction unit for acquiring a binocular image, extracting a multi-scale feature image of the binocular image, constructing an initial ontology of a second scale feature image, wherein the cost ontology comprises the initial ontology and an optimized ontology, the cost body comprises the matching cost of each pixel in the binocular image and the corresponding pixel under different parallaxes, and the multi-scale characteristic image comprises a first scale characteristic image, a second scale characteristic image and a third scale characteristic image with sequentially reduced resolution; The initial parallax map determining unit is used for optimizing the initial ontology through the context information provided by the third scale feature image to obtain an optimized ontology, and performing parallax regression processing on the optimized ontology to obtain an initial parallax map; and the depth estimation result determining unit is used for constructing the correction feature optimized into the ontology, and carrying out iterative optimization on the initial parallax image through the correction feature, the first scale feature and a preset parallax iterative module to obtain a depth estimation result of the binocular image.
9. A depth estimation device for a binocular image, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, causes the depth estimation device for a binocular image to implement the method of any one of claims 1-7.
10. A computer program product comprising computer program instructions which, when run, cause the method of any of claims 1-7 to be performed.

Description

Depth estimation method, device, equipment and program product for binocular image Technical Field The present application relates to the field of image processing, and in particular, to a method, apparatus, device, and program product for estimating depth of a binocular image. Background The binocular depth perception technology is a stereoscopic vision scheme based on the bionics principle, and high-precision depth information or dense point cloud data is obtained by simulating a parallax mechanism of human eyes. Unlike active sensing technologies such as laser radar, millimeter wave radar or ToF (English is fully called Time of flight), the binocular system can realize three-dimensional sensing by means of natural light without actively transmitting specific signals, has the advantages of simple hardware structure, low system cost, high ranging accuracy and the like, and is widely applied to the fields of robots, unmanned and augmented reality. In practical navigational positioning applications, it is often desirable to quickly perceive depth in a scene in order to respond to the detected depth in the scene in a timely manner. However, in the current depth estimation method of the binocular image, in order to ensure the accuracy of the depth estimation result, there is often a higher time delay, which is not beneficial to meeting the real-time requirement of navigation positioning. Disclosure of Invention In view of this, the embodiments of the present application provide a depth estimation method, apparatus, device, and program product for binocular image, so as to solve the problem in the prior art that in order to ensure the accuracy of the depth estimation result, there is often a higher time delay, which is not beneficial to meeting the real-time requirement of navigation positioning. A first aspect of an embodiment of the present application provides a depth estimation method for a binocular image, the method including: Acquiring a binocular image, extracting a multi-scale feature image of the binocular image, and constructing an initial ontology of a second-scale feature image, wherein the cost ontology comprises the initial ontology and an optimized ontology, the cost ontology comprises the matching cost of each pixel in the binocular image and a corresponding pixel under different parallaxes, and the multi-scale feature image comprises a first-scale feature image, a second-scale feature image and a third-scale feature image with sequentially reduced resolution; optimizing the initial ontology through context information provided by the third scale feature image to obtain an optimized ontology, and performing parallax regression processing on the optimized ontology to obtain an initial parallax map; And constructing the correction feature optimized into the ontology, and performing iterative optimization on the initial parallax image through the correction feature, the first scale feature and a preset parallax iterative module to obtain a depth estimation result of the binocular image. With reference to the first aspect, in a first possible implementation manner of the first aspect, constructing an initial ontology of the second scale feature image includes: Acquiring characteristic channels of the second scale characteristic image, and grouping the characteristic channels to obtain a plurality of characteristic channel groups; calculating the matching cost of pixels in the characteristic channel group under different parallaxes to obtain the matching cost of the characteristic channel group; And splicing the matching cost of the characteristic channel groups to obtain the initial ontology. With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, calculating matching costs of pixels in the feature channel group under different parallaxes, to obtain matching costs of the feature channel group includes: Determining a maximum disparity for the feature channel packet; determining all parallaxes to be calculated based on the maximum parallaxes, and translating a right-eye image in the binocular images according to the parallaxes to be calculated; And determining the matching cost of the characteristic channel group according to the cosine similarity or the inner product of the left eye image and the right eye image at the corresponding positions. With reference to the first aspect, in a third possible implementation manner of the first aspect, the multi-scale feature image further includes a fourth-scale feature image, a resolution of the fourth-scale feature image is smaller than a resolution of the third-scale feature image; Optimizing the initial ontology through the context information provided by the third scale feature image to obtain an optimized ontology, including: performing convolution and downsampling on the initial ontology to obtain a first coding feature image, and performing attention fusion on the f