CN-119048568-B - Three-dimensional matching system and method based on wavelet transformation and lightweight large-kernel convolution

CN119048568BCN 119048568 BCN119048568 BCN 119048568BCN-119048568-B

Abstract

The invention provides a three-dimensional matching system based on wavelet transformation and light-weight large-kernel convolution, which comprises a data acquisition unit and a three-dimensional matching unit based on the wavelet transformation and the light-weight large-kernel convolution, wherein the three-dimensional matching system comprises a method for acquiring an image of a scene to be detected, wavelet processing, extracting context information, extracting light-weight large-kernel convolution, extracting features, constructing a cost body and combining an encoding body to obtain final parallax, so that matching precision in the three-dimensional matching field and matching and generalization problems of sharp areas of the image are improved, the wavelet transformation can extract multi-frequency image information, the light-weight large-kernel convolution can increase the receptive field, capture long-distance dependence and extract multi-frequency global feature information to improve the precision of three-dimensional matching, and meanwhile, the light-weight large-scale large-parameter calculation and parameters brought by the large-kernel convolution can be reduced, and experimental results show that the method provided by the invention can improve the precision and generalization of three-dimensional matching.

Inventors

XUE YANBING
DUAN JIALONG
CAI JING
WANG ZHIGANG
GAO ZAN
WEN XIANBIN

Assignees

天津理工大学

Dates

Publication Date: 20260512
Application Date: 20240722

Claims (10)

1. The three-dimensional matching system based on wavelet transformation and light-weight large-kernel convolution is characterized by comprising a data acquisition unit and a three-dimensional matching unit based on wavelet transformation and light-weight large-kernel convolution, wherein the three-dimensional matching unit based on wavelet transformation and light-weight large-kernel convolution is composed of a wavelet transformation processing module, a context extraction module, a light-weight large-kernel convolution feature extraction module, a construction cost body module, a construction combined encoder body module and an iteration updating module, the data acquisition unit is used for acquiring real parallax images of left and right color images and left color images of a scene to be detected, the output ends of the data acquisition unit are respectively connected with the input ends of the context extraction module and the wavelet transformation processing module, the output end of the context extraction module is connected with the input end of the iteration updating module, the output end of the wavelet transformation processing module is connected with the input end of the light-weight large-kernel convolution feature extraction module, the output end of the light-weight large-kernel convolution feature extraction module is respectively connected with the input ends of the construction cost body module and the construction combined encoder body module, the output end of the construction combined encoder body module is respectively connected with the input end of the construction combined encoder body module and the iteration updating module, the output end of the construction combined encoder body module is connected with the real parallax image, and the final parallax image is calculated by the final parallax image.
2. The stereo matching system based on wavelet transformation and lightweight large-kernel convolution according to claim 1, wherein the construction cost body module, the construction combined coding body module and the iteration updating module adopt a IGEV network structure.
3. A stereo matching method based on wavelet transformation and light-weight large-kernel convolution is characterized by comprising the following steps: (1) The data acquisition unit acquires left and right color images of a scene to be detected and a real parallax image of the left image, the left and right color images are input into the wavelet transformation processing module, and the left color images are input into the context extraction module; (2) Performing wavelet function processing on the left and right color images by a wavelet transformation processing module to obtain image sets of 12 channels on the left and right; (3) Extracting context information from the left color image by using a context extraction module to obtain a context information graph with the dimensions of H/4 x W/4 x 64, H/4 x H/4 x 128, H/8 x W/8 x 128 and H/16 x W/16 x 128; (4) Performing feature extraction on the image sets of the left and right 12 channels obtained in the step (2) by using a light large-kernel convolution module to obtain a multi-scale feature map Wherein l represents a left graph feature map, r represents a right graph feature map, i represents the number of stages of image data processing, C i represents the number of channels output in the ith stage; (5) The multi-scale characteristic diagram obtained in the step (4) is processed Inputting a construction cost body module and a construction combined code body module to respectively obtain a cost body and a combined code body, and inputting the multi-scale context information graph obtained in the step (3) and the cost body and the combined code body into an iteration updating module to obtain final parallax, wherein the construction cost body module, the construction combined code body module and the iteration updating module adopt IGEV network structures.
4. The stereo matching method based on wavelet transform and light-weight large-kernel convolution according to claim 3, wherein the wavelet transform processing module performs wavelet function processing on left and right color images in the step (2) specifically, the left and right color images received by the input end of the wavelet transform processing module are respectively divided into R, G, B three channels, the three channels perform wavelet function dwt2 processing on the left and right color images respectively, each channel obtains image information of four frequencies, namely LL, LH, HL, HH, wherein LL is approximate image information of a low frequency part, LH, HL and HH are detail image information of a high frequency part, image information of LL, LH, HL, HH four frequencies generated by an R channel of a left color image, image information of LL, LH, HL, HH four frequencies generated by a G channel and image information of LL, LH, HL, HH four frequencies generated by a B channel are spliced together according to a channel using Concat function of the image information to obtain a left image set of 12 channels, image information of LL, LH, HL, HH four frequencies generated by an R channel of a right color image, image information of four channels generated by a G channel and image information of four channels are spliced according to a multi-frequency image information of a channel using a 3-frequency image set of a shown in a formula of Concat, and image information of the image set of the image information of the image is spliced together according to a multi-frequency set of the image information shown in the formula of figure of 3512: (1) Where x is an RGB input image, i is a channel corresponding to the RGB image, dwt2 is a wavelet function, LL, LH, HL, HH is image information of four frequencies, concat is a stitching operation, and p is an image set.
5. The stereo matching method based on wavelet transform and light-weight large-kernel convolution according to claim 4, wherein the resolution of four frequency images obtained by each of the R, G, B channels is H/2 x W/2, and the number of channels is 1, and H, W is the height and width of left and right color images received by the input end of the wavelet transform processing module respectively.
6. The stereo matching method based on wavelet transformation and lightweight large-kernel convolution according to claim 3, wherein the context extraction module in the step (3) consists of 1 convolution layer, 1 normalization layer, 3 residual layers and 2 under-sampling layers, the 3 residual layers I, II and III consist of 2 residual blocks I, 6 residual blocks I in total, the under-sampling layer I, II consists of 1 residual block I and 1 residual block II, the residual block I consists of convolution with 2 kernel sizes of 3x3 step sizes of 1 and two ReLU functions, the residual block II consists of convolution with 1 kernel sizes of 3x3 step sizes of 2, convolution with 1 kernel sizes of 3x3 step sizes of 1 and two ReLU functions, the convolution layer of the context extraction module consists of 1 kernel sizes of 7x7, step sizes of 2, input channel of 3 and output channel of 64, the 3 residual blocks I are the convolution layers with 3 kernel sizes of 3, the output channel of 128 layers and the under-sampling layer I and the under-sampling layer of the number of 128, and the under-sampling layer II are the residual layers of 128 and the under-sampling layer of the number of the output channel of the input channel of 64.
7. A stereo matching method based on wavelet transformation and lightweight large-kernel convolution according to claim 3, wherein the process of extracting the context information in the step (3) specifically comprises the following steps: (3-1) using a context extraction module to obtain a context information diagram I with a scale of a left color image obtained in the step (1) through a convolution layer and then through a normalization layer; (3-2) obtaining a context information diagram II by passing the context information diagram I obtained in the step (3-1) through a residual layer I, a residual layer II and a residual layer III; (3-3) obtaining a context information diagram III from the context information diagram II obtained in the step (3-2) through the downsampling layer I; (3-4) obtaining the context information map IV from the context information map III obtained in the step (3-3) through the downsampling layer II.
8. The stereo matching method based on wavelet transform and lightweight large kernel convolution according to claim 7, wherein the dimensions of the context information map I, the context information map II, the context information map III and the context information map IV obtained in the step (3) are H/4 x W/4 x 64, H/4 x H/4 x 128, H/8 x W/8 x 128 and H/16 x W/16 x 128, respectively.
9. The three-dimensional matching method based on wavelet transformation and light-weight large kernel convolution as claimed in claim 3, wherein the step (4) specifically means that the light-weight large kernel convolution module performs four-stage image data processing on the left and right 12-channel image sets obtained in the step (2), the number N i of the light-weight large kernel basic modules used in the four stages is 3, 5 and 2 in turn, the number C i of the output channels is 24, 32, 96, 160, i E {1,2,3,4}, the feature map scale obtained by the image set of the 12 channels through the first stage is H/4 x W/4 x C 1 , the feature map scale obtained through the second stage is H/8 x W/8 x C 2 , the feature map scale obtained through the third stage is H/16 x W/16 x C 3 , and the feature map scale obtained through the fourth stage is H/32 x W/x C 4 , namely, a plurality of feature maps are finally obtained Wherein l represents a left graph feature map and r represents a right graph feature map; the image data processing process of the four stages comprises the following steps: (4-1) dividing each channel image of the image set of each of the left and right 12 channels obtained in the step (2) into N patches of 7x7 in size, N being H/14 x W/14, downsampling the patches with a convolution of 3 kernel sizes and 2 step sizes, and then normalizing the patches as shown in the formula (2) to obtain an output x: (2) Wherein, norm is the normalization processing, overlapPatch is the segmentation image and downsampling processing, and p is the image set in the step (2); the method comprises the steps of (4-2) decomposing the output x obtained in the step (4-1) into 4 convolutions through N light-weight large-core basic modules, namely, 1x7 convolutions, 7x1 convolutions, 3x3 convolutions and cavity convolutions with the core size of 7 and the cavity rate of 3, wherein the output x obtained in the step (4-1) is uniformly divided into 4 groups according to channels after standardized processing, 4 convolutions are respectively carried out, namely, 1 group is subjected to 1x7 convolutions, 2 group is subjected to 7x1 convolutions, 3 group is subjected to 3x3 convolutions, 4 groups of data are subjected to cavity convolutions with the core size of 7 and the cavity rate of 3 after convolutions are subjected to splicing processing according to channels, and the spliced data are subjected to element addition processing with the output x obtained in the step (4-1), so as to obtain data y; (4-3) normalizing the output z in the step (4-2) to obtain feature map information, as shown in the formula (8) (8)。
10. The stereo matching method based on wavelet transform and lightweight large kernel convolution according to claim 9, wherein said step of lightweight large kernel basic module of step (4-2) comprises the following steps: (4-2-1) performing normalization processing on the output X obtained in the step (4-1), as shown in the formula (3): (3) (4-2-2) dividing the output obtained in the step (4-2-1) into 4 groups on average according to channels, and performing image data processing on 41 x 7, 7 x 1, 3 x 3 convolutions used by the 4 groups respectively and hole convolution with a kernel size of 7 and a hole rate of 3 to obtain an output As shown in formula (4): (4) The Split is divided equally according to channels, x h 、x w 、x hw 、x dl is 4 groups of data divided equally according to channels, DWConv 7x1 is 7 x 1 depth convolution, DWConv 1x7 is 1 x 7 depth convolution, DWConv 3x3 is 3 x 3 depth convolution, and DLConv k=7,d=3 is cavity convolution with a kernel size of 7 and a cavity rate of 3; 4 groups of data subjected to 4 different convolution processes; (4-2-3) combining the 4 groups of data obtained in the step (4-2-2) after 4 different convolution processes, namely splicing according to channels, and performing element addition operation as shown in a formula (5) with the output x in the step (4-1) to obtain an output y: (5) concat is performing splicing operation according to the channel; (4-2-4) subjecting the output y in the step (4-2-3) to normalization and an MLP module to obtain an output As shown in formula (6); (6) (4-2-5) the output in the step (4-2-4) Performing element addition operation as shown in formula (7) with the output y in the step (4-2-3) to obtain an output z (7)。

Description

Three-dimensional matching system and method based on wavelet transformation and lightweight large-kernel convolution [ Field of technology ] The invention relates to the field of computer vision, in particular to a stereo matching system and method based on wavelet transformation and light-weight large-kernel convolution. [ Background Art ] In recent years, convolutional neural networks have been successfully applied to tasks such as target detection, image classification and the like due to their strong feature representation and function fitting capability. The stereo matching algorithm based on the deep learning is mainly divided into an end-to-end stereo matching algorithm and a non-end-to-end stereo matching algorithm. The earlier stereo matching algorithm is mainly realized in a non-end-to-end mode, and the matching cost of the stereo matching stage is calculated by acquiring image features, so that the result of the stereo matching algorithm on KITTI reference test is superior to that of the traditional algorithm. By applying the convolutional neural network to a part of the stereo matching steps, the stereo matching performance is improved. However, the above algorithm still requires some post-processing steps. Generally, the non-end-to-end binocular stereo matching algorithm comprises a deep learning processing module and a traditional algorithm processing module, and compared with the traditional algorithm, the performance of the non-end-to-end stereo matching algorithm is improved, but because the non-end-to-end stereo matching algorithm further comprises a plurality of manually defined functions, the whole algorithm process is difficult to uniformly optimize. The end-to-end based stereo matching algorithm has strong self-adaptive adjustment capability because the algorithm parameters can be learned, and the generated parallax image has higher precision. By constructing an end-to-end stereo matching algorithm, the post-processing step of stereo matching is eliminated, and the 3D convolution is utilized to optimize the matching cost, so that the precision of the stereo matching algorithm is further improved. Later people use the spatial pyramid structure to extract context information during the feature extraction stage and the cascading hourglass structure to optimize the cost cube. In recent two years, people use iterative update networks to refine parallax, and the precision is improved to a certain extent. However, even though the above algorithm improves matching accuracy by enhancing scene recognition capability, matching consistency is still difficult to achieve for some ill-conditioned areas lacking a priori information, such as non-textured areas, and accuracy and generalization still remain to be improved. [ Invention ] The invention aims to solve the technical defects and shortcomings, and provides a three-dimensional matching system and method based on wavelet transformation and light-weight large-kernel convolution, which are simple in structure and easy to implement. The three-dimensional matching system based on wavelet transformation and light-weight large-kernel convolution is characterized by comprising a data acquisition unit and a three-dimensional matching unit based on wavelet transformation and light-weight large-kernel convolution, wherein the three-dimensional matching unit based on wavelet transformation and light-weight large-kernel convolution consists of a wavelet transformation processing module, a context extraction module, a light-weight large-kernel convolution feature extraction module, a construction cost body module, a construction combination encoding body module and an iteration updating module, the data acquisition unit is used for acquiring real parallax images of left and right color images of a scene to be detected, the output ends of the data acquisition unit are respectively connected with the input ends of the context extraction module and the wavelet transformation processing module, the output ends of the context extraction module are connected with the input ends of the iteration updating module, the output ends of the wavelet transformation processing module are connected with the input ends of the light-weight large-kernel convolution feature extraction module, the output ends of the light-weight large-kernel convolution feature extraction module are respectively connected with the input ends of the construction combination encoding body module, the output ends of the construction combination encoding body module are respectively connected with the construction combination encoding body module and the iteration updating module, the output ends of the construction combination encoding body module are respectively, and the final parallax image is calculated, and the final parallax image is finally lost. The construction cost body module, the construction combined encoding body module and the iteration update module adopt a IGEV (ITERATIVE GEOMETRY E