CN-122023512-A - Abnormal element pose estimation method integrating geometric priori and self-supervision learning and related equipment

CN122023512ACN 122023512 ACN122023512 ACN 122023512ACN-122023512-A

Abstract

The embodiment of the application provides a method for estimating the pose of a special-shaped element by combining geometric prior and self-supervision learning and related equipment, belonging to the technical field of computer vision and industrial automation. The method comprises the steps of firstly constructing an element geometric priori knowledge map containing symmetry, boundary and key point topological information, collecting RGB-D images, detecting a depth failure area to generate a confidence map, respectively extracting RGB texture features and point cloud geometric features, carrying out self-adaptive fusion based on the confidence map, inputting the fusion features and the geometric priori to a pose prediction network containing a differentiable constraint module to predict an initial pose, adopting a two-stage strategy training network combining synthetic data supervision pre-training and real data self-supervision micro-phase modulation, and finally merging the geometric priori to carry out iterative optimization by taking the initial pose as a starting point to output a refined 6D pose. The method and the device remarkably improve the pose estimation precision and robustness of the small-size, weak texture and reflective special-shaped element, and greatly reduce the dependence on labeling data.

Inventors

KUANG YONGCONG
WANG ZHIPENG
YAN HAOZHI

Assignees

华南理工大学

Dates

Publication Date: 20260512
Application Date: 20251224

Claims (10)

1. The abnormal element pose estimation method integrating geometric prior and self-supervision learning is characterized by comprising the following steps of: constructing a geometric prior knowledge graph of a target element, wherein the knowledge graph at least comprises symmetry information, boundary constraint information and key point topology information extracted from an element CAD model; Acquiring an RGB-D image containing the target element, detecting a failure area of depth data, and generating a pixel-level depth confidence map; Respectively processing RGB images and point cloud data in the RGB-D images in parallel through an RGB texture feature extraction branch and an RGB-D geometric feature extraction branch to extract multi-scale texture features and multi-scale geometric features; based on the depth confidence map, the multi-scale texture features and the multi-scale geometric features are adaptively fused through a cross-modal attention mechanism, so that fusion features are obtained; Inputting the fusion features and the geometric priori knowledge map to a pose prediction network comprising a differentiable geometric constraint module, respectively predicting initial pose estimates based on RGB branches and RGB-D branches and corresponding prediction confidence degrees, and carrying out weighted fusion on the initial pose estimates of the two branches according to the prediction confidence degrees to output a preliminary 6D pose; training the pose prediction network by adopting a two-stage strategy, wherein the first stage uses synthesis data with labels to conduct supervision pre-training, and the second stage uses real data without labels to conduct self-supervision fine adjustment based on luminosity consistency loss and geometric consistency loss; And taking the preliminary 6D pose as an initial value, introducing the geometric priori knowledge graph as a constraint, carrying out fine adjustment on the pose through an iterative optimization algorithm, and outputting the final fine 6D pose.
2. The method of claim 1, wherein constructing the geometric prior knowledge-graph of the target element comprises: carrying out symmetry detection on the CAD model, and identifying a rotation symmetry axis, a mirror symmetry plane and a symmetry order n of the element; Extracting equation parameters of the boundary dimension of the envelope box and the key boundary plane of the CAD model to be used as boundary constraint; identifying and extracting three-dimensional coordinates of preset type feature points on the CAD model, and constructing a topological graph representing the connection relation between the feature points; and encoding the symmetry information, the boundary constraint information and the key point topology information into a structural feature vector.
3. The method according to claim 2, wherein the symmetry detection comprises extracting a principal axis of the element by principal component analysis, and determining a type of rotational symmetry or mirror symmetry and an order n thereof by calculating a degree of coincidence or a distance of a mirror point pair of the point cloud at a predetermined rotation angle.
4. The method of claim 1, wherein the method of generating a pixel-level depth confidence map comprises: Calculating the brightness variance and the mean value of the RGB image local area, and judging the RGB image local area as a high reflection area when the variance is lower than a first threshold value and the mean value is higher than a second threshold value; calculating gradient amplitude of the depth image, and judging the depth jump area when the gradient amplitude exceeds a third threshold value; The three-dimensional matching cost, the consistency of the left view and the right view and the effectiveness of the depth value are synthesized, a confidence score is calculated for each pixel, and the depth confidence map is generated; Areas with confidence scores below the failure threshold are marked as depth failure masks.
5. The method according to claim 1, wherein the cross-modal attention mechanism is specifically: Generating a spatial attention weight graph according to the depth confidence coefficient graph, wherein the weight graph characterizes the dependence degree of each spatial position on geometric features; transforming the multi-scale texture features into a three-dimensional point cloud space through a camera projection model, and performing space alignment with the multi-scale geometric features; and carrying out weighted summation on the aligned texture features and geometric features by using the spatial attention weight graph to realize self-adaptive fusion.
6. The method of claim 1, wherein the differentiable geometric constraint module comprises: A symmetry constraint layer for constraining the prediction space of the rotation matrix on a discrete rotation set determined by a symmetry order n; The boundary constraint layer is used for calculating the distance from the transformed point cloud to a preset boundary plane and adding training as a loss term; and the key point alignment layer is used for predicting the observation key points and establishing a corresponding relation with the CAD model key points so as to minimize the distance loss of the corresponding points.
7. The method of claim 1, wherein the second stage of self-supervised fine tuning comprises: calculating luminosity consistency loss, namely comparing a CAD model image rendered according to the predicted pose with a real observation RGB image; calculating geometric consistency loss, namely comparing the CAD model point cloud transformed according to the predicted pose with the real observation point cloud in an effective depth area; and performing domain adaptation countermeasure training, namely introducing a domain discriminator to reduce the distribution difference of the synthesized data and the real data at the characteristic level.
8. The method of claim 1, wherein the iterative optimization algorithm is a weighted iterative closest point algorithm that optimizes an objective function comprising: A point-to-face distance residual term; symmetry constraint terms derived from the symmetry information; a boundary violation penalty term derived from the boundary constraint information; the weight of each point is determined by the corresponding depth confidence and the point-to-point distance.
9. An electronic device comprising a memory storing a computer program and a processor implementing the method of any of claims 1 to 8 when the computer program is executed by the processor.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method of any one of claims 1 to 8.

Description

Abnormal element pose estimation method integrating geometric priori and self-supervision learning and related equipment Technical Field The application relates to the technical field of computer vision and industrial automation, in particular to a method and related equipment for estimating the pose of a special-shaped element by combining geometric prior and self-supervision learning. Background In the industrial fields of electronic manufacturing, semiconductor packaging, precision assembly and the like, the automatic grabbing and assembly of mechanical arms on special-shaped electronic elements (such as inductors, capacitors, connectors, relays and the like) depends on high-precision and high-robustness estimation on 6D poses (namely 3D positions and 3D poses) of the elements in a three-dimensional space. Such elements are typically small in size (in the order of millimeters, e.g., 3-8 mm), irregular in geometry, weak or reflective in surface texture, symmetry, and the like, which pose serious challenges to conventional visual positioning techniques. The existing pose estimation methods are mainly divided into the following categories, but all have obvious defects: 1) The deep learning method based on RGB images, such as PoseCNN, YOLO6D and the like, is used for directly returning the object pose from the monocular or binocular RGB images. Such methods rely heavily on object surface texture features, and for weak texture or retroreflective elements, feature extraction is difficult, resulting in dramatic decreases in accuracy. Meanwhile, small-size objects occupy fewer pixels in the image, so that enough effective features are difficult to obtain, and the performance of the small-size objects is further limited. 2) And a depth learning method based on RGB-D data, such as DenseFusion, FFB D, and the like, fusing texture information of an RGB image and geometric information of a depth point cloud. However, such methods often employ simple feature stitching or post-fusion strategies that fail to adequately account for differences in reliability of the two modalities in different scenarios. Particularly, on the reflective surface, a depth sensor based on structured light or binocular stereo is extremely prone to failure, large-area holes or noise are generated, and pose estimation can be seriously misled if depth characteristics are still given high weight. 3) Methods based on conventional geometric registration such as Iterative Closest Point (ICP) algorithm, feature-based PnP algorithm, etc. Such methods require an initial pose that is very close to the true value, otherwise it is prone to falling into local optima. They have high requirements on geometric characteristics (such as curvature and normal direction) of the surface of an object, are difficult to match on elements with weak textures, light reflection or symmetry, and have insufficient robustness. 4) The general method based on supervised learning is that the deep learning method mostly needs a large amount of real data with accurate 6D pose labels for training. Such data labeling is extremely costly, cumbersome to process, and models are easily overfitted to specific training data, with poor generalization ability in the face of new elements or new environments (domains). Disclosure of Invention The embodiment of the application mainly aims to provide a special-shaped element pose estimation method, electronic equipment, storage medium and program product for fusing geometric priori knowledge and self-supervision learning, which are used for designing a failure perception self-adaptive multi-mode fusion mechanism by constructing and embedding geometric priori knowledge of elements and combining training strategies of two stages (synthetic data supervision pre-training and real data self-supervision fine tuning), so as to realize high-precision and high-robustness 6D pose estimation of small-size, weak texture, reflective and symmetrical special-shaped elements with lower labeling cost. In order to achieve the above object, an aspect of the embodiments of the present application provides a method for estimating a pose of a profiled element by fusing geometric prior and self-supervised learning, the method comprising: constructing a geometric prior knowledge graph of a target element, wherein the knowledge graph at least comprises symmetry information, boundary constraint information and key point topology information extracted from an element CAD model; Acquiring an RGB-D image containing the target element, detecting a failure area of depth data, and generating a pixel-level depth confidence map; Respectively processing RGB images and point cloud data in the RGB-D images in parallel through an RGB texture feature extraction branch and an RGB-D geometric feature extraction branch to extract multi-scale texture features and multi-scale geometric features; based on the depth confidence map, the multi-scale texture features and the multi-scale geome