CN-122023520-A - Sparse texture target binocular vision detection and space positioning method embedded with geometric constraint

CN122023520ACN 122023520 ACN122023520 ACN 122023520ACN-122023520-A

Abstract

A sparse texture target binocular vision detection and space positioning method embedded with geometric constraints is characterized by comprising a binocular vision detection method embedded with geometric constraints and a space positioning method based on detection area point cloud and geometric fitting. The method solves the problems of recognition and positioning of sparse texture targets in an industrial scene with complex background and multiple interference factors in the visual positioning process, firstly, the targets are detected and determined to be target areas through a visual detection network model embedded with geometric constraints, and then the pose of the targets is solved through a target space positioning method based on point cloud construction of the parallax of the ROI areas and geometric feature fitting.

Inventors

JIN XIA
CHEN XIAOHU
WU YUCHEN
WANG MIN
ZHANG DELI
DU TIANYU
OUYANG YAN
FANG CHUNYANG
LIU MENG
HE JIANCHAO

Assignees

南京航空航天大学

Dates

Publication Date: 20260512
Application Date: 20260116

Claims (10)

1. The sparse texture target binocular vision detection and space positioning method with embedded geometric constraints is characterized by comprising a binocular vision detection method with embedded geometric constraints and a space positioning method based on detection area point cloud and geometric fitting, wherein the vision detection method with embedded geometric constraints comprises the following steps: Firstly, performing internal parameter calibration and system external parameter calibration on a binocular camera, and synchronously acquiring image data of a sparse texture target by a left camera and a right camera; Secondly, constructing an imaging degradation synthesis frame to simulate rainfall shielding, snow shielding, contrast attenuation caused by fog and imaging degradation effect caused by imaging overexposure caused by strong light, and adopting a parameterizable degradation model to carry out degradation generation processing of grading intensity on the data set so as to amplify a training data set and improve the robustness and generalization capability of the detection model in a complex environment; Thirdly, constructing a shared weight double-branch target detection method and integrating an epipolar feature fusion module. The device comprises a left and right feature extraction branch with consistent structure and shared parameters, wherein the left branch is used as a main detection branch for outputting a target detection result, and the right branch is used as an auxiliary constraint branch for providing cross-view geometric consistency information; Fourth, introduce the joint loss function of the geometric constraint in the training process of the goal detection method, the loss function includes not merely goal measuring loss of the left and right branch, include the geometric constraint loss based on detection frame, wherein the geometric constraint loss includes polar constraint loss and scale consistency loss at least, space corresponding relation and dimensional stability used for restraining the identical goal in the left and right view, reduce the error association that background interference, similar goal and shelter from effectively, make the training process of the detection model more controllable, easier to converge; Fifthly, inputting the left and right view images subjected to epipolar correction into a trained detection network model, and outputting detection frame results of sparse texture targets in the left and right views; the spatial positioning method based on the detection area point cloud and geometric fitting comprises the following steps: Firstly, after the detection of the targets of the left view and the right view is completed, extracting corresponding ROI (region of interest) areas according to the detection frames of the targets in the left view and the right view, and synchronously cutting left and right view images; secondly, combining the parallax result and updated camera parameters, performing depth calculation and three-dimensional reconstruction on the cut and epipolar corrected double-view ROI region, and generating an initial three-dimensional point cloud of a target; thirdly, performing outlier removal, noise filtering and smoothing on the generated initial three-dimensional point cloud, and performing up-sampling complementation on the sparse region to obtain an optimized point cloud with complete structure and uniform distribution; fourth, on the basis of the optimized point cloud, performing point cloud rough extraction according to the geometric prior information of the target, constructing a geometric hypothesis model by adopting a RANSAC iterative sampling method, and outputting the geometric characteristic parameter three-dimensional space attitude information of the target; fifthly, solving a transformation matrix of the target gesture from a camera coordinate system to a mechanical arm base coordinate system based on a calibration result, and realizing unified expression of target positioning among different coordinate systems; And sixthly, converting the three-dimensional space gesture of the target geometric parameter obtained by resolving under the camera coordinate system into the mechanical arm base coordinate system by utilizing the transformation matrix to obtain the target space gesture information which can be directly used for grabbing by the mechanical arm, thereby realizing three-dimensional accurate positioning facing engineering execution.
2. The method of claim 1, wherein the image-degrading composite frame comprises weather effects such as rain, snow, fog, glare, and the like.
3. The method of claim 1, wherein the simulated rainfall shielding effect is characterized in that double-layer motion blur noise is adopted to respectively generate fine long-range rain wires and coarse long-range short-range rain wires, the fine long-range rain wires and the coarse long-range rain wires are mixed and overlapped to a background after illumination adjustment through Alpha, snowflake particles with different scales are generated by adopting high-frequency noise and low-frequency noise iteration to simulate the snow shielding effect, natural fusion is realized through a screen mixing mode, an S-shaped curve is adopted to simulate vertical concentration gradual change to simulate the simulated fog effect, and linear interpolation synthesis is carried out after noise is overlapped.
4. The method of claim 1, wherein the constructing a shared-weight dual-branch target detection method and integrating polar line feature fusion module method comprises the following steps: (3.1) arranging left and right parallel feature extraction branches, wherein the two branches have consistent structures and shared parameters, and respectively performing forward computation on left and right view images to obtain left and right multi-scale features; (3.2) before the feature extraction module of the main detection branch outputs features and enters the detection module, introducing an epipolar feature fusion module on one layer of channels, wherein the feature images output by the main detection branch and the auxiliary constraint branch and the epipolar corrected binocular parameters are taken as inputs by the epipolar feature fusion module; The method comprises the steps of (3.3) determining an effective parallax interval according to camera parameters and a target scene depth range, discretizing the parallax interval, and generating a parallax candidate list covering the effective parallax range, wherein the parallax candidate list is used for representing horizontal displacement of the same spatial target on a left view feature map and a right view feature map, so that a basis is provided for candidate alignment of auxiliary constraint branch features in the polar line direction; (3.4) constructing a corresponding feature sampling grid along the polar line direction for each candidate parallax value in the parallax candidate list, and calling a resampling operator to carry out translational resampling on the auxiliary constraint branch feature map to generate a group of auxiliary view candidate feature maps under different parallax assumptions; The matching degree score between the main detection branch feature and each candidate auxiliary view feature in the last step is calculated position by taking the main detection branch feature map as a reference by adopting cosine similarity, and then Softmax normalization processing is carried out on the matching degree score on the candidate parallax dimension to obtain attention weight distribution corresponding to each candidate parallax, so that the candidate feature with higher matching degree with the main detection branch feature obtains higher weight, and the candidate feature with lower matching degree is restrained; (3.6) weighting and summing the multi-parallax candidate auxiliary view features according to the attention weight distribution to obtain an aggregated auxiliary view feature representation, wherein the aggregated auxiliary view feature representation can adaptively select a parallax hypothesis which is the most consistent with the main view feature in geometry, so that the cross-view information is effectively aggregated; And (3.7) splicing the original features of the main detection branch and the aggregated auxiliary view features in the channel dimension, finishing information fusion through a channel compression and fusion operator, outputting the enhanced main detection branch features to replace the original features under the corresponding dimension, and inputting the original features to a detection prediction module to finish target detection.
5. The method of claim 1, wherein the training process comprises the steps of: (4.1) after the left branch and the right branch respectively output target detection prediction results, respectively calculating target detection losses of a main detection branch and an auxiliary constraint branch, wherein the detection losses of the main detection branch are used as main supervision items, and the detection losses of the auxiliary constraint branch participate in training with smaller weights so as to ensure the stability of the auxiliary branch prediction and provide a reliable basis for cross-view geometric constraint; According to the polar geometric relationship of the binocular camera, calculating the corresponding polar line position of the central point of the main view prediction frame in the right view, and restricting the central point of the right view prediction frame to fall near the polar line, thereby limiting the effective searching range of cross-view matching in the polar line neighborhood and reducing the false matching interference caused by non-corresponding targets; (4.3) taking the scale information of the detection frames as constraint basis, adopting the relative difference of the areas of the detection frames as measurement, and keeping the scales of the left and right prediction frames consistent in the training process, so that abnormal scale distortion of the single-view prediction frames is restrained, and cross-view geometric consistency and positioning stability are improved; and (4.4) constructing geometric constraint loss by using epipolar constraint and scale consistency constraint, carrying out weighted fusion on the geometric constraint loss, main detection branch detection loss and auxiliary branch detection loss according to weight coefficients to form a total loss function, and executing back propagation based on the total loss function to update parameters of a feature extraction module, a detection prediction module and an epipolar feature fusion module so as to realize high-precision target detection and cross-view consistency optimization under binocular conditions.
6. The method of claim 1, wherein the object detection method and the geometric constraint training mechanism of the epipolar feature fusion module are introduced, the object detection method and the geometric constraint training mechanism are deployed and realized on an object detection tool platform based on a YOLO architecture as a specific implementation mode, and the on-site acquired image is sent to a trained network model for detection after epipolar correction.
7. The method of claim 1, wherein the spatial localization method based on the detection area point cloud and geometric fitting comprises the following steps: step 1, respectively extracting pixel coordinates of a left detection frame and a right detection frame, wherein the left frame is expressed as The right box is shown as Next, a rectangular area that can completely cover the object in the left and right frames is generated, and its coordinates are determined by the following formula: ; ; ; the four sides of the region are outwards expanded by a set proportion to form a final ROI region Synchronously cutting left and right view images, and inputting the left and right view images into a three-dimensional matching network RAFT-Stereo for parallax calculation; And 2, combining the parallax result and the updated camera parameters to construct a three-dimensional point cloud and optimize the three-dimensional point cloud.
8. The method of claim 7, wherein the three-dimensional point cloud is constructed and optimized by the steps of: (2.1) first performing polar correction on the left and right views by stereoRectify functions according to the calibrated internal and external parameters of the dual camera, which is equivalent to applying rotation transformation to the camera coordinate system and generating corrected projection matrix, the original left camera internal parameter matrix is expressed as ; After polar correction, a correction rotation matrix projection matrix of the left camera is obtained Corresponding projection matrix Given that its corresponding internal reference is implicit in the projection matrix Further, the target area is cut on the corrected image, the cutting operation only changes the original point position of the pixel coordinate system, so that the equivalent internal reference matrix after cutting can be expressed as ; In the three-dimensional reconstruction stage, projection matrix is used And combining the cut parallax map to complete three-dimensional reconstruction; (2.2) use of the updated internal reference Base line Parallax obtained through RAFT-Stereo processing of Stereo matching network Turning to depth, the formula is as follows: ; And (2.3) back-projecting pixel coordinates into a three-dimensional space according to a camera model to obtain a local point cloud, wherein the formula is as follows: ; ; ; and step 3, obtaining point cloud, optimizing the quality of the point cloud through outlier removal, noise filtering and smoothing, and outputting the point cloud with complete structure after up-sampling.
9. The method according to claim 8, wherein the optimizing the quality of the point cloud comprises the steps of: (3.1) adopting a composite filtering algorithm based on statistical distribution and space density to remove abnormal data points which do not accord with a preset threshold value from the original point cloud; (3.2) smoothing random noise in the point cloud coordinates through a filtering algorithm, and optimizing the spatial distribution of the point cloud by adopting a curved surface smoothing technology to enable the point cloud to be more fit with the surface geometry of an actual object; And (3.3) up-sampling the sparse region to fill the data hole, and then enabling the density of the whole point cloud to be uniform through re-sampling, and outputting the optimized three-dimensional point cloud.
10. The method of claim 1, wherein the vision system includes 4 coordinate systems, namely a robot arm base coordinate system, a left/right camera coordinate system and a world coordinate system, and the method comprises the steps of: (5.1) controlling the mechanical arm to enable the terminal probe to sequentially point to a plurality of corner points of the same checkerboard under different terminal poses, and recording pose information of the terminal of the mechanical arm relative to a mechanical arm base in each operation; (5.2) pose of tip relative to robot base The probe is in pose relative to the tail end of the mechanical arm The coordinates of the checkerboard corner points under the standard system of the mechanical arm base are Wherein ; (5.3) Synchronously acquiring images containing the checkerboard calibration plates by using a binocular camera, extracting pixel coordinates of the checkerboard corner points in the left and right images, and calculating to obtain three-dimensional coordinates of each corner point under a camera coordinate system by using a binocular triangulation method ; (5.4) Establishing a corresponding relation between the three-dimensional point set under the camera coordinate system and the three-dimensional point set of the mechanical arm base coordinate system, and constructing a three-dimensional point set constraint equation: ; when a measurement error exists, the above formula cannot be strictly established, so a least square error model is constructed: ; The fixed offset of the probe end relative to the tail end of the mechanical arm is equivalently eliminated as a hidden variable in the whole optimization process by utilizing the joint constraint of multiple points and multiple poses; And (5.5) solving an optimal rotation matrix and a translation vector by adopting a rigid body registration algorithm based on SVD, bringing the solved camera-mechanical arm base transformation matrix into an error model, calculating residual errors of all points, and if the errors meet a preset threshold, calibrating the result effectively.

Description

Sparse texture target binocular vision detection and space positioning method embedded with geometric constraint Technical Field The invention relates to a computer vision detection and space positioning method, in particular to a method for identifying and positioning a sparse texture target in an industrial scene with complex background and multiple interference factors. The method can be used for detecting and positioning the sparse texture target under the field complex background, and particularly relates to a binocular vision detection and space positioning method for the sparse texture target embedded with geometric constraints. Background Along with the rapid development of industrial automation and intelligent manufacturing, the target detection and three-dimensional positioning technology based on machine vision is widely applied to the scenes such as industrial robot grabbing, assembly alignment, equipment inspection and quality detection. The method realizes stable identification of the target and outputs the space pose information which can be used for the actuating mechanism especially in the field environment with complex background and multiple interference factors, and is an important basic capability of the robot perception and industrial vision system. The existing industrial targets have a large number of sparse texture targets, the surface textures of the sparse texture targets are weak, the edge features of the sparse texture targets are not obvious, and the sparse texture targets are often accompanied with the conditions of high metal reflection, strong symmetrical structures, mutual shielding among objects and the like, so that the detection and positioning of the sparse texture targets become a difficult problem in the field. In the prior art, the target detection of industrial vision mainly comprises a traditional characteristic method and a deep learning detection method. The traditional method is generally used for matching and positioning based on the characteristics of edges, corner points, line segments or local descriptors, the method has strong dependence on the texture and imaging quality of the target surface, and under the condition of sparse texture or obvious reflection, the conditions of insufficient quantity of characteristic points, unstable matching or increased mismatching and the like are easy to occur, so that the reliability of detection and positioning results is affected. In recent years, deep learning target detection networks are widely applied in industrial scenes, can directly output types and detection frames of targets in images, and have good adaptability in multiple types of complex backgrounds. However, in practical application, the single-view detection result is still likely to have the conditions of missed detection, false detection, unstable positioning and scale of a detection frame and the like, which are influenced by factors such as bad weather, illumination attenuation, shielding, blurring, weak textures on the surface of a target and the like, and is more obvious especially on the sparse textures and the strong light-reflecting target. In order to obtain the spatial position of the target, binocular vision and three-dimensional sensing schemes such as structured light, laser and the like are widely adopted. The binocular vision has relatively low hardware cost, flexible deployment and high application value in industrial measurement and robot positioning, and can provide texture and depth information at the same time. The existing binocular three-dimensional positioning generally recovers scene depth through stereo matching or parallax estimation, and further performs three-dimensional point cloud reconstruction and pose calculation. Meanwhile, with the development of a deep learning stereo matching network, the dense parallax estimation precision is improved, and the dependence of the traditional method on manual characteristics and local textures is relieved to a certain extent. However, the problems that the three-dimensional matching of the whole image range is large in calculated amount and a large amount of irrelevant parallax noise is introduced into a background area still exist in an industrial field at present, and the problems that the parallax estimation is easy to generate holes, mismatching and discontinuous depth aiming at sparse texture, reflective or shielding areas, so that the point cloud is sparse and the outliers are increased are solved. Therefore, under the conditions of industrial complex background and severe imaging, stable detection of sparse texture targets is realized, cross-view consistency is effectively improved by fusing binocular polar line geometry priori, meanwhile, the influence of background interference on parallax estimation and point cloud reconstruction is reduced, a three-dimensional positioning result which is complete in structure and can be used for engineering execution is obtained, and the meth