CN-115294182-B - High-precision stereo matching method based on double-crossover attention mechanism
Abstract
The invention designs a high-precision stereo matching method based on a double-cross attention mechanism, which comprises the steps of utilizing an hourglass-shaped network structure of shared parameters to extract corrected left and right view features to calculate matching cost, taking left view features as keys for inputting the calculation attention mechanism, taking right view features as Query for inputting the calculation attention mechanism, taking voxels of the left and right view features after channel cascade connection as inputs, calculating Value of the attention mechanism, utilizing keys and Query to calculate weight applied to the Value, solving the matching cost, generating disparity map estimation through a multi-scale cost aggregation network mainly based on 2D convolution, and finally, recovering the resolution of the disparity map to the input resolution by using a residual error correction network based on left and right consistency test, and improving matching details. By utilizing the parallax estimation method and the parallax estimation device, manual pre-specification of the parallax range is avoided, when the parallax range is changed, the parallax estimation result under a larger parallax range can be obtained without training the network again, and the convenience of parallax estimation is improved.
Inventors
- LIU RONGKE
- SUN SHUQIAO
- SUN SHANTONG
Assignees
- 北京航空航天大学
Dates
- Publication Date
- 20260505
- Application Date
- 20220425
Claims (6)
- 1. A high-precision stereo matching method based on a double-crossover attention mechanism is characterized by comprising the following steps: the method comprises the steps of 1, processing input data, dividing the data into a training set and a testing set according to the need, wherein the training set is further divided into the training set and a verification set, and the verification set has parallax truth values which are only used for evaluating but not participating in loss calculation in the training process; Step 2, adopting funnel-shaped characteristics to extract network computing characteristics; step 3, the output of the first layer and the output of the last three layers of the feature pyramid are cascaded and serve as the input of a cost calculation module of a dual-attention mechanism, the cost calculation module is composed of the dual-attention mechanism, matched points only appear on the same kernel line under the left and right view points, and the cost calculation only needs to be one-dimensional matched; step 4, the cost aggregation part adopts a multi-scale cost aggregation structure based on 2D convolution, and comprises two aggregation modes in and between scales, wherein in the cost aggregation process, the two modes are alternately used and share an aggregation result; Step 4a, performing multi-scale intra-scale aggregation on input, wherein the aggregation adopts two neighborhood positions, namely a fixed neighborhood and an adaptive neighborhood, wherein the fixed neighborhood obeys 8 neighborhood points with the displacement of the central position as a starting point being 1, 9 points are sampled altogether, the displacement of the adaptive neighborhood is not fixed, the adaptive neighborhood is obtained by calculation of a neural network, 9 points are sampled along a epipolar line, and the cost aggregation among sampling points is obtained by sharing parameters And specific parameters The method is characterized by comprising the following steps of jointly determining, wherein the subscript l represents different sampling point numbers, and the cost aggregation formula is as follows: wherein Representing the number of total sampling points, Representing the output cost matrix of the device, For the position, i is the channel number, In order to fix the amount of the coordinate offset, Is an adaptive coordinate offset; Step 4b, carrying out inter-scale aggregation on the cost output of the last step, when the target scale is smaller than the current scale, downsampling the cost matrix of the current scale by a convolution layer with the step length of2 until the space size is consistent with the space size of the target scale, and if the target scale is larger than the current scale, upsampling by a difference value until the space size is consistent with the space size of the target scale, otherwise, needing no transformation; Step 4c, alternately performing the steps 4b and 4c for a plurality of times, and finally finishing cost aggregation through one layer of convolution layer, wherein the number of output characteristic channels of the last layer of aggregation is 1, so as to obtain an estimated parallax image; The resolution recovery and detail remodelling module takes the estimated parallax image as input, and gradually recovers the resolution through 3 resolution recovery modules in sequence, wherein a left view image, a right view image, a left view parallax estimation, a left view image which is transformed to an image under a right view according to the left view parallax estimation, and a difference value between the right view image and the right view image obtained by transformation are taken as input of a structure according to channel cascade, each resolution recovery module is composed of 4 ResNet modules and 1 deconvolution with the step length of 2, the output of each module is recovered into a parallax image residual error under the resolution after passing through a convolution layer with the 1-layer kernel size of 1x1, and finally the residual error and the parallax image estimation which is up-sampled to the same resolution are added to obtain the parallax image under the resolution; And 6, obtaining losses of the disparity map and the true value disparity map according to the loss function, guiding network model parameter update through back propagation, and finally obtaining fixed network model parameters for disparity map reasoning.
- 2. The high-precision stereo matching method based on the double-crossover attention mechanism is characterized by further comprising the steps of writing network structure codes of a feature extraction part in step 2, splitting the codes of the feature extraction module into definitions of a downsampling unit, an upsampling unit and a pyramid unit, noting that layers in each unit cannot be shared when applied for multiple times, specific layers need to be defined for multiple times, and sharing parameters of the feature extraction part by a left graph and a right graph, and finally outputting extracted left graph and right graph features.
- 3. The method for high-precision stereo matching based on the double-cross attention mechanism according to claim 1 or 2, wherein in step 6, the method further comprises the steps of composing a loss function by adopting a smooth L1 loss, wherein the loss is applied to a disparity map under all scales, when the loss is calculated, the true value of the disparity map is downsampled to the same spatial dimension as the estimated disparity map, and the importance and the number of pixels under different spatial dimensions are weighted and summed for the loss of each scale.
- 4. The high-precision stereo matching method based on the double-crossover attention mechanism according to claim 1 or 2, wherein in step 1, the method further comprises the steps of: step 1a, carrying out displacement and rotation transformation in the vertical direction on the left image and the right image simultaneously, wherein the displacement is that The pixel range is determined randomly, and the rotation occurs The random number values of the displacement and the angle obey uniform distribution in the range of [0, 1); Step 1b, independently carrying out brightness transformation on RGB three channels according to 50% probability, wherein all channel changes are subject to uniform distribution in a (-20, 20) interval, the transformation is carried out on values on each pixel, and different transformation values are applied to the left and right images with 20% probability; step 1c, the algorithm then chooses to either Gaussian white noise or luminance-contrast transform the input with 50% probability, wherein the Gaussian noise follows a Gaussian distribution with a mean of 0 and variance in the range of (10, 50), wherein the variance is chosen to follow a uniform distribution in the range of [0, 1), the luminance-contrast transform is a luminance-contrast transform of the image luminance in the range of (-0.2, 0.2), and likewise, the left and right images have 20% probability with different transform values.
- 5. The high-precision stereo matching method based on the double-crossover attention mechanism according to claim 1 or 2, wherein in step 2, the method further comprises: Step 2a, extracting features from input and compressing the spatial size by using 5 downsampling units, respectively numbering l= {1, 2,3, 4, 5}, wherein each downsampling unit consists of a) a convolution layer with a step length of 2, a kernel size of 3x3 and a padding of 1, b) Batch normalization, c) a convolution layer with a slope of 0.1, d) a convolution layer with a step length of 1, a kernel size of 3x3 and a padding of 1, e) Batch normalization, f) a convolution layer with a slope of 0.1; Step 2b, extracting features and recovering space size by using 2 up-sampling units, with the numbers of l= {6, 7}, respectively, wherein each up-sampling unit consists of a) a deconvolution layer with a step length of 2, a kernel size of 3x3 and a padding of 0, b) Batch normalization, c) a leak ReLU with a slope of 0.1, d) a convolution layer with a step length of 1, a kernel size of 3x3 and a padding of 1, e) Batch normalization, and f) a leak ReLU with a slope of 0.1, which are sequentially formed; Step 2c, defining a feature pyramid, comprising a) one layer of convolution layers with a step size of 1, a kernel size of 3x3 and a padding of 1, b) three parallel average pooling layers with a kernel size and a step size of (1, 1), (2, 2), (4, 4), respectively, c) three parallel convolution layers with a kernel size of 1x1 and a padding of 0, sequentially following the previous pooling layers, d) three parallel per-channel Batch normalization, sequentially following the previous convolution layers, e) three parallel leak ReLU layers with a slope of 0.1, respectively following the previous normalization layers, and f) three parallel interpolation layers, respectively following the activation function layers, for restoring the spatial resolution to 1/8 of the original size.
- 6. The high-precision stereo matching method based on the double-crossover attention mechanism as set forth in claim 1 or 2, wherein in step 3, further comprising: Step 3a, obtaining key through 1 full-connection layer by taking left viewpoint feature as input, obtaining query through 1 full-connection layer by taking right viewpoint feature as input, obtaining value through one full-connection layer by taking cascade of left and right viewpoint features as input, if multi-head attention mechanism is applied, dividing full-connection layer neuron into two parts Parts, respectively applied to 1- On each channel, realize -A head attention mechanism; step 3b, calculating attention weight for the key and query obtained by calculation The formula is: Wherein Q represents query, K represents Key, T represents transpose operation; Representing the number of characteristic channels of a single head, Is the total characteristic channel number; Numbering the heads to which the head belongs; step 3c, the attention mechanism has displacement invariance, position coding information is needed to be added to assist in distinguishing different matching positions, the absolute position coding is recovered from the relative position coding by adopting a relative position coding form, wherein the abscissas and the ordinates are all image widths, and the attention weight is expressed as follows: wherein And Representing the features of the left and right figures respectively, 、 、 Representing the weights between the query, the key-feature and the key-position respectively, Representing relative position coding, where Represents the horizontal coordinate difference between key and query, Is a vector learned under each head separately; Step 3 d-since the position of the same object in the right image must appear to the left of the object position in the left image, i.e., the abscissa Therefore, only the attention mechanism is needed to be considered in calculation The weight which does not meet the constraint condition is set as None and does not participate in subsequent calculation; Step 3e, multiplying the attention weight with the position code by the corresponding value, cascading the results of all the heads, inputting the results into a linear mapping layer to obtain the final output, wherein the specific formula is as follows: The multiplication is a matrix multiplication, wherein, Is a matched output; And Learning parameters for an output layer; value corresponding to the h head; step 3f, cascading the left graph feature, the right graph feature and the output of the cost calculation of the double-cross-attention mechanism, taking the output as the input of cost aggregation after passing through a convolution layer, and taking the input as parameters And (3) representing.
Description
High-precision stereo matching method based on double-crossover attention mechanism Technical Field The invention belongs to the technical field of image processing, and particularly relates to a depth information acquisition method and an optimization strategy of a self-adaptive parallax range, which can be used for high-efficiency depth information acquisition under binocular input. Background In the stereoscopic field, given the parallax of one pixel, its depth z can be solved by z=bf/d. Where b is the baseline length of the binocular camera system, f is the focal length of the camera system, and d is the parallax for that pixel. Accordingly, the depth information estimation task may be converted into a disparity estimation task of the corresponding pixel. Given a pair of corrected binocular images, without occlusion, a point in the real scene is projected simultaneously into the images of both the left and right viewpoints, and the difference in horizontal coordinates between the corresponding pairs of pixels at the different viewpoints is referred to as "parallax". Typically, this disparity estimation task will be broken down into four steps, 1) feature extraction, 2) cost calculation, 3) cost aggregation, and 4) disparity estimation. For conventional algorithms, these four steps are performed sequentially, and the technique of each step can generally be applied separately. In recent years, with the development of deep learning technology, the completion of each step of parallax estimation is gradually replaced by a neural network, the independence between the steps is also gradually weakened, and the parallax estimation task is also more completed by an end-to-end network calculation. In these networks, cost computation is typically accomplished by two means, feature concatenation and convolution computation. For the former, the input at left and right viewpoints goes through the same feature extraction structure of the shared parameters, and then the obtained features are directly cascaded according to channels and directly serve as the input of the subsequent cost aggregation and parallax estimation network. This approach typically relies on 3D convolution in the cost aggregation part, thus requiring a large computational cost. Moreover, the direct cascade cost calculation method is not as effective as the cost matrix obtained by convolution calculation. The cost calculation method based on convolution calculation calculates the similarity of the characteristics under the left and right view points through dot products within a certain given parallax range, and places calculation results of different dot pairs on different channels. Typically, the number of channels of the cost matrix in this method is equal to the manually set parallax range. For most databases, this disparity range is designated as 192 pixels. Although this correlation-based method can yield a cost matrix that is better than a direct cascade, it must be applied to a fixed parallax range. When the parallax range is changed, the network structure needs to be retrained, and the training time period is obviously prolonged along with the increase of the parallax range. In addition, in some application scenarios where the parallax range varies greatly, if the network is not retrained pertinently, the effect is greatly compromised, or even disabled. For example, in the automatic driving field, the algorithm needs to accurately identify a far target and a near target, the parallax difference between the two targets is huge, the calculation difficulty of the conventional algorithm with a fixed parallax range is great, and the algorithm performance can be obviously reduced along with the increase of the parallax range. In addition to these two parallax estimation structures based on convolutional neural networks, there is also a network structure based on an attention mechanism. The method regards binocular stereo matching task as sequence computing task similar to natural language processing, and synchronously completes cost computing task and cost aggregation task through the alternate application of a cross attention mechanism and a self attention mechanism. Although the method can also realize the self-adaption of the parallax range, the calculation cost is huge and the effect is not as good as that of a mainstream stereo matching algorithm based on a convolutional neural network. Disclosure of Invention The invention aims to solve the problem of limitation of the application of the variable parallax range to algorithm application and performance, and designs a high-efficiency stereo matching method based on a double-cross attention mechanism, which is used for high-precision parallax estimation under the variable parallax range. In order to achieve the above purpose, the invention adopts the following technical scheme: the efficient stereo matching method based on the double-cross attention mechanism is characterized in that the method