CN-122023521-A - Cross-modal semantic mapping loop detection method and system

CN122023521ACN 122023521 ACN122023521 ACN 122023521ACN-122023521-A

Abstract

The invention discloses a loop detection method and system of cross-mode semantic mapping, which belong to the technical field of computer vision, synchronous positioning and map construction. And secondly, screening out key points consistent with the real depth by applying a depth consistency projection operator. And then generating a three-dimensional semantic map based on a map construction method of the nerve radiation field. And finally, constructing a global descriptor fused with the three-dimensional geometric features and the semantic features, forming a unified similarity score according to the weighted fusion of the confidence coefficient, and carrying out retrieval and verification of loop detection. The invention can reduce the dependence on the three-dimensional data set during the training of the algorithm model and can improve the robustness and the accuracy of loop detection.

Inventors

CHEN RUI
HU YAOYI
ZHANG HUANLONG
ZHAI YONGJIE
DU JIHONG
YU YUE
WANG TIANXIANG

Assignees

华北电力大学（保定）
郑州轻工业大学

Dates

Publication Date: 20260512
Application Date: 20260123

Claims (7)

1. A loop detection method of cross-modal semantic mapping is characterized by comprising the following steps: s1, respectively carrying out geometric feature extraction and semantic feature extraction on a frame image; S2, performing geometric-semantic alignment guided by depth consistency, and mapping semantic tags to a three-dimensional space to obtain a three-dimensional space point set with the semantic tags; S3, generating a three-dimensional semantic map by a map construction method based on a nerve radiation field; And S4, constructing a global descriptor fused by geometric features and semantic features based on the three-dimensional semantic map, weighting and fusing according to confidence to form a unified similarity score, and carrying out retrieval and verification of loop detection.
2. The method according to claim 1, wherein in S1, geometrical feature extraction is realized through Instant-NGP multi-resolution hash coding, and each level of features are stored by establishing progressive refinement MLP grids in a spatial domain and using a hash table capable of being queried in constant time on the basis of original multi-resolution hash coding.
3. The method according to claim 1, wherein the semantic feature extraction in S1 is performed based on any one of ESPNetv, deep Lab, U-Net, seg-Net.
4. The method according to claim 1, wherein S2 specifically comprises: For one pixel in an image Constructing a ray sampling point set corresponding to the pixel through back projection and ray sampling : Wherein, the In order to parameterize the curve for the radiation, For the camera optical center of the pixel, For the direction of projection of the pixel, To guide in depth A sampling distance within; Projection operator through depth consistency Aligning two-dimensional pixel points to three-dimensional space, and only preserving the function of true depth Matching sampling points: Wherein, the Distance threshold value xi is set for depth effective value mask, when The time mask value is 1, otherwise the value is 0; The spacing distance between adjacent sampling points; After obtaining the projection operator, passing the semantic tags through a cross-modal mapping operator Mapping to three-dimensional space to obtain three-dimensional space point set with semantic label : Wherein, the Representing pixels Corresponding semantic tags.
5. The method according to claim 4, wherein S4 specifically comprises: From the slave The obtained three-dimensional point-normal pair is recorded as , wherein, The coordinates of the i-th point are indicated, Representing the normal vector of the surface corresponding to the ith point and embedding a multi-scale voxel pyramid, and calculating a weighted center and covariance of the point set by taking density, visibility and normal consistency as weights in each voxel To characterize the anisotropy and occupancy intensity of the local shape, the calculation formula is: Wherein, the The weights of density, visibility and normal consistency are taken into account comprehensively for the ith point, As a set of points for voxel a, For the sum of the weights of the voxels a, A weighted centroid for voxel a; splicing eigenvalues, occupancy rate and curvature approximation of covariance on multiple scales, and performing power normalization and whitening to obtain global geometric descriptor : Wherein, the For covariance Is of the characteristic value of (2) For the voxel occupancy rate, As an approximation of the curvature of the web, Is of a scale The set of voxels above it, Representing a set of scales; in parallel with geometry, from Acquiring semantic labels of each three-dimensional point and pixel-level confidence coefficient thereof, and constructing a semantic histogram influenced by depth consistency : Wherein, the As a semantic tag, the content of the document is, The semantic category is represented by a representation of the semantic category, For a core weight that decreases with the residual, The coefficients are calibrated to be the power of the power, In order for the confidence level to be high, Normalization constants for the semantic histogram; To geometric vectors Normalization calculation is carried out, and meanwhile, regular transformation is adopted on the semantic histogram, wherein a calculation formula is as follows: Wherein, the Is a constant for numerical stabilization; The cosine similarity is adopted to screen a loop candidate set in actual retrieval, and the Jensen-Shannon divergence is used as a semantic constraint term, and the calculation formula is as follows: Wherein, the Is shown at the moment And A similarity score is calculated between the two, The weight is represented by a weight that, Representation of The semantic histogram of time instants calculates a value, Representation of Calculating a time semantic histogram; Setting a similarity threshold value to judge two frames of images, judging that the track is closed loop if the similarity is larger than the threshold value and the rule value is 1, otherwise, judging that the track is closed loop if the rule value is 0: Wherein, the Is a similarity threshold.
6. A cross-modal semantic mapping loop detection system, which is characterized by comprising a feature extraction module, a depth consistency guided cross-modal alignment module, a map construction module based on a nerve radiation field and a loop detection module, wherein the loop detection is performed by applying the method as claimed in any one of claims 1 to 5.
7. An electronic device comprising a processor and a memory, the memory storing machine-executable instructions executable by the processor to implement a method as recited in any one of claims 1-5.

Description

Cross-modal semantic mapping loop detection method and system Technical Field The invention relates to the technical field of computer vision, synchronous positioning and map construction, in particular to a loop detection method and system of cross-mode semantic mapping, which are particularly suitable for a visual SLAM system based on a nerve radiation field (Neural RADIANCE FIELDS, NERF) and are used for realizing loop identification with high robustness and high precision in a complex dynamic environment. Background Under the current rapid development of technology, the synchronous positioning and map construction (Simultaneous Localization AND MAPPING, SLAM) technology provides basic perception support for an intelligent system by acquiring the geometric structure and the camera pose of the environment in real time, and is a key technology for constructing a high-fidelity map. The loop detection (Loop Closure Detection, LCD) is used as a very important part in SLAM technology, and can effectively eliminate map drift caused by accumulated errors, so that the positioning accuracy and the map quality are improved. Conventional visual SLAM systems typically employ explicit dense or sparse image features, point clouds, grid or voxel representations, etc. techniques. However, these methods often exhibit lower loop detection robustness and accuracy in the face of environmental changes and viewing angle differences. For example, an ORB-SLAM algorithm proposed by Raul Mur-Artal and the like is added with a loop detection module on the basis of a PTAM framework, and quick feature matching is realized by virtue of ORB features. However, it is less robust against illumination changes and viewing angle differences. With the development of technology, a large number of students improve the ORB-SLAM algorithm, and ORB-SLAM2 is proposed, and the system expands ORB-SLAM and introduces a more accurate map optimization and loop detection module. Then, scholars propose ORB-SLAM3 to further enhance the performance of the system and introduce the support of an inertial sensor, but the system is still greatly influenced by external dynamic interference and repeated scene mismatching. In addition, DTAM improves the pose estimation precision through an optimization algorithm, and further utilizes the obtained pose and the adjacent frame image information to construct a three-dimensional dense map, but lacks a global feature descriptor. SLAM++ is used as a representative of object level SLAM, and data association is completed through instance recognition, but the SLAM++ is highly dependent on a limited object library, and system performance is easily affected by occlusion and appearance change. The VINS-MONO adopts a tightly coupled mode to fuse an Inertial Measurement Unit (IMU), but error closed loop is easy to introduce. KinectFusion implement real-time three-dimensional reconstruction based on voxel TSDF, but are prone to drift in the case of long loops. The PL-SLAM provides a dotted line mixed binocular SLAM scheme, and introduces a mixed BoW loop detection mechanism, but the loop accuracy is lower. In addition, the VIO system proposed by ZHENG et al regards the three-dimensional position of the marginalized keypoints as true positions for loop detection, but it is susceptible to matching accuracy and precision. An algorithm combining a bag-of-words model, image verification and a tracking prediction model is proposed by Anping et al, but the system performance under a long sequence is poor. Shi Jiahao et al propose a dot line feature visual SLAM loop detection algorithm based on improved LBD and data dependent metrics that is relatively sensitive to viewing angle drafts. Zhang Cuijun et al propose a SLAM closed loop detection method based on the HHO algorithm, but the algorithm is prone to over-fitting matching. RDS-SLAM proposed by Liu et al segments the key frame queue through independent threads and a bi-directional model to deliver semantic information, but does not build a joint matching mechanism of geometric features and semantic features. In recent years, the advent of neural radiation field (Neural RADIANCE FIELDS, NERF) technology has driven the rapid development of implicit expression methods based on neural networks, providing a continuous, high quality new paradigm for visual SLAM systems, but lacking in loop detection. Sucar et al propose iMAP that uses a multi-layer perceptron (MLP) for scene characterization, but does not introduce a loop-back detection mechanism. The NICE-SLAM proposed by Zhu et al jointly characterizes the scene through a hierarchically characterized network and MLP, but also lacks a loop-back detection mechanism. MIP-NeRF proposed by Barro et al reduces the recognition capability of geometrical feature differences in loop detection by effectively rendering an antialiased conical frustum, but the conical integration smoothes out the geometric details of the scene. Co-SLAM proposed by Wan