CN-121999158-A - Dynamic environment vision SLAM method based on dual-source dynamic mask tracking

CN121999158ACN 121999158 ACN121999158 ACN 121999158ACN-121999158-A

Abstract

The application relates to a dynamic environment vision SLAM method based on dual-source dynamic mask tracking. The method comprises the steps of inputting a color image into a lightweight instance segmentation model for reasoning, generating a hysteresis semantic dynamic mask of a priori dynamic object and asynchronously transmitting the hysteresis semantic dynamic mask to a front-end tracking thread. And detecting key points based on the hysteresis semantic dynamic mask and tracking the optical flow to the current frame in the front-end thread, generating a overlook depth histogram through the current frame, and obtaining the semantic dynamic mask of the current frame through screening and reconstruction. And after the region is removed, calculating the geometric dynamic mask, supplementing the geometric dynamic mask which is not detected in the previous frame, and obtaining a final dynamic mask by merging the geometric dynamic mask and the semantic dynamic mask. And filtering the dynamic characteristics by utilizing the final dynamic mask, completing pose estimation and back-end optimization based on the static characteristics, updating the TSDF map only by using the static parts, removing the changed foreground structure during revisiting, and outputting synchronous positioning and mapping results. The method can improve the positioning precision and the drawing quality of the small unmanned platform vision SLAM system.

Inventors

PAN XIANFEI
XU MEILIN
GUO YAN
CHEN ZONGYANG
Gui ke

Assignees

中国人民解放军国防科技大学

Dates

Publication Date: 20260508
Application Date: 20260123

Claims (10)

1. A dynamic environment vision SLAM method based on dual-source dynamic mask tracking, the method comprising: Creating independent parallel semantic segmentation threads, inputting the color images into a lightweight instance segmentation model for real-time reasoning, generating a hysteresis semantic dynamic mask corresponding to a priori dynamic object, caching the hysteresis semantic dynamic mask and corresponding image frames thereof, and asynchronously transmitting the hysteresis semantic dynamic mask and the corresponding image frames to a vision front-end tracking thread; in a visual front-end tracking thread, detecting key points in the hysteresis semantic dynamic mask area based on an image frame corresponding to the hysteresis semantic dynamic mask, obtaining corresponding tracking points in a current frame through optical flow tracking, calculating a depth image of the current frame to generate a overlook depth histogram, and obtaining a candidate two-dimensional detection frame through threshold filtering and continuous point set segmentation; After excluding the area corresponding to the semantic dynamic mask of the current frame, performing first-stage camera motion estimation through optical flow tracking and random sampling consistency algorithm, forming 3D-2D matching by depth back projection, obtaining a geometric constraint violation point set by combining with quarter bit distance filtering, and converting the geometric constraint violation point set into the geometric dynamic mask through bounding box depth filtering; If the geometrical dynamic mask exists in the previous frame and the current frame is not detected, the corresponding mask of the geometrical dynamic mask in the current frame is calculated based on optical flow tracking complementation, the semantic dynamic mask and the geometrical dynamic mask of the current frame are obtained in a union set to obtain a final dynamic mask, dynamic features in a visual front-end tracking thread are filtered by utilizing the final dynamic mask, visual feature tracking and rear-end optimization are completed based on the filtered static features, TSDF map updating is carried out only by using an image static part, and a changed map foreground structure is actively removed during revisiting, so that synchronous positioning and map building results are obtained.
2. The method of claim 1, wherein detecting keypoints within the hysteresis semantic dynamic mask area, obtaining corresponding tracking points in a current frame by optical flow tracking, comprises: and uniformly detecting key points in the hysteresis semantic dynamic mask region by adopting a FAST algorithm, tracking the key points by adopting a Lucas-Kanade optical flow method to obtain corresponding tracking points in the current frame, wherein the abscissa of the overlooking depth histogram is the same as the original depth image, the ordinate represents the increase of depth from top to bottom, and the value at each pixel (x, y) is the pixel number of the space points at all heights in the x-th column of the depth map in a depth threshold interval y.
3. The method of claim 1, wherein excluding the region corresponding to the semantic dynamic mask of the current frame performs first-stage camera motion estimation by an optical flow tracking and random sample consistency algorithm, comprising: and if the optical flow falling points are positioned in the semantic dynamic mask obtained by calculation in the previous stage, the angular point matching does not participate in the initial pose estimation of the stage, 8 pairs of matching points are selected as a minimum subset to calculate a basic matrix by a random sampling consistency algorithm based on the optical flow matching points of the residual static area, fitting residual errors of all the matching points and the current model are counted, and the basic matrix with the highest internal point occupation ratio is selected as a first stage camera motion estimation result.
4. A method according to any one of claims 1 to 3, wherein forming a 3D-2D match using depth backprojection, in combination with a quarter-bit distance filtering, results in a set of geometric constraint violation points, which are converted into a geometric dynamic mask via bounding box depth filtering, comprising: Acquiring a depth value of a matching point from a depth image of a previous frame, back-projecting the depth value into a three-dimensional point under a camera coordinate system, converting the three-dimensional point into a world coordinate system by using the pose of the previous frame, and forming a 3D-2D matching pair by taking the obtained three-dimensional point as a three-dimensional space point corresponding to the matching point of the current frame; Solving the pose of the accurate camera for the point with effective depth in the 3D-2D matching pair by a PnP-RANSAC method, calculating the reprojection error of the matching point of the current frame, and calculating the epipolar distances corresponding to all optical flows again after obtaining the pose of the accurate camera to obtain an epipolar constraint violation point set; Performing a round of statistical filtering based on the quartile range on the reprojection residual values of the polar constraint violation point set points to obtain a geometric constraint violation point set; And clustering and distinguishing multiple dynamic main bodies according to the space positions of the geometric constraint violation points, taking the minimum rectangular bounding box of each cluster in the image, screening out a foreground region in the bounding box by utilizing depth information, and extracting the foreground region as a dense geometric dynamic mask.
5. The method of claim 1, wherein performing a round of quarter-bit distance based statistical filtering on the reprojected residual values at the epipolar constraint violation point set points comprises: setting polar distance sequence If the sample size N is less than 10, calculating a first quartile And third quartile Obtaining the quartile range ; Defining an acceptable residual range as The points with polar line distances exceeding the residual range form a geometric constraint violation point set, which is as follows: Wherein, the Representing the two-dimensional homogeneous coordinates of the i-th pair of matching points, Representing a set of epipolar constraint violation points, Visible light image representing ith pair of matching points in current frame Is used for the distance between the polar lines, The first quartile is represented by the first, Represents the coefficient of experience that is to be found, The four-bit distance is represented by the number of bits, The third quartile is represented.
6. The method of claim 5, wherein the epipolar distance is calculated by: Wherein, the And For the two-dimensional homogeneous coordinates of the i-th pair of matching points in the two frames, As a basis matrix of the matrix, For the camera reference matrix, T and R are respectively a translation vector and a rotation matrix of relative motion between frames of the camera, and the superscript T represents transposition operation.
7. The method of claim 1, wherein filtering dynamic features in a visual front-end tracking thread with the final dynamic mask comprises: In the ORB feature point extraction stage, if the detected feature points are located in the pixel region covered by the final dynamic mask, the feature points are directly removed, and only the feature points in the static region are reserved to participate in subsequent feature matching and pose estimation.
8. The method of claim 1, wherein performing visual feature tracking and back-end optimization based on the filtered static features comprises: And when the motion model fails or the matching quality is insufficient, returning to a pose estimation strategy based on the key frame, and judging the key frame to meet the preset conditions of effective feature quantity, parallax change and tracking quality in a static area.
9. The method of claim 1, wherein using the static portion of the image for TSDF map updating comprises: the depth pixels outside the final dynamic mask are regarded as static observation and used for fusion updating of the TSDF map, and any pixel center in the map is set as Its TSDF value and weight are 、 For the current frame pixel Combining the pose of the camera with the internal reference, the observation rays of the pixels of the current frame are as follows: Wherein, the As the depth value of the object, Is a back projection mapping of pixels to normalized camera coordinates, Is the translation vector of the current frame camera in the world coordinate system, Is a parameterized variable of the ray; If the map voxel set traversed by the ray is not empty, the TSDF value of the corresponding voxel is updated only by the depth information of the static area.
10. The method of claim, wherein actively removing changing map foreground structures upon revisit comprises: When the camera revisits the built map area, if the map voxels traversed by the observation rays are marked as dynamic related areas, weight reduction operation is carried out on the map voxels, and residues of dynamic prospects in the map are gradually cleared; The performing a weight reduction operation on map voxels is: Wherein, the For the preset weight attenuation amount, the weight attenuation amount is calculated, For k map voxels Is used for the current weight value of (a).

Description

Dynamic environment vision SLAM method based on dual-source dynamic mask tracking Technical Field The application relates to the technical field of visual synchronous positioning and mapping, in particular to a dynamic environment visual SLAM method based on dual-source dynamic mask tracking. Background In recent years, along with the rapid development of related technologies of an autonomous unmanned system, unmanned platforms such as unmanned vehicles, unmanned planes and bionic robots are widely applied to the fields of warehouse logistics, power inspection, commercial service and the like. The synchronous positioning and mapping (SimultaneousLocalizationandMapping, SLAM) technology is a key technology required by an unmanned platform for realizing autonomous navigation and environment perception, and the vision SLAM technology aims at enabling the unmanned platform to take vision as a main information source, acquiring environment information through a carrying sensor, estimating the pose of the unmanned platform in real time and establishing a map with space information for the environment. Compared with sensors such as laser radar, the vision sensor has the advantages of light weight, low cost and low energy consumption, and can provide visual and rich scene semantic information through images, so that the vision SLAM is widely paid attention to by researchers. The visual SLAM system acquires observation information through a visual road sign in a perception environment and continuously corrects the state estimation of the visual SLAM system so as to achieve an accurate and consistent global state estimation result, and the inherent geometric constraint principle enables the visual SLAM method to be built on a static scene assumption, however, the assumption is often destroyed by a dynamic scene ubiquitous in a real environment. The dynamic nature of the scene generally causes a large amount of noise pollution to be doped in the front-end features of the vision, so that pose estimation drift is caused, global positioning accuracy is reduced, and meanwhile, the quality of environmental mapping is indirectly affected. In view of this problem, a great deal of research has been devoted in recent years to improving the robustness of visual SLAM in dynamic environments, and the main technical routes are mainly divided into two categories. The first type of method is characterized in that the visual characteristics of the dynamic region are not involved in the subsequent state estimation by carrying out characteristic filtering on the visual front end, so that the source quality of observation and optimization constraint is more reliable. Most methods adopt technical routes for detecting and rejecting a priori specified dynamic object by introducing semantic information, and the systems generally rely on a deep learning model to perform target detection or semantic segmentation, so that dynamic areas are explicitly removed in the front-end tracking and mapping stage, and meanwhile, dynamic feature points can be screened by utilizing geometric space constraint to perform consistency check. The second method focuses on the rear-end pose estimation stage, and distinguishes the reliability of the constraint of each environmental landmark in the optimization stage, for example, the constraint contribution of the landmark with high dynamic probability to the camera pose estimation is reduced by designing the error term weight, and the effectiveness of the method is premised on the accurate estimation of the dynamic attribute of the visual landmark in the early stage of the system. Although the above-mentioned technical route has achieved significant effects in some conventional dynamic scenarios, its applicability in complex dynamic scenarios is still limited by various factors. The front-end dynamic filtering method based on semantic information generally relies on a deep learning model to perform target detection or semantic segmentation, the model has high calculation cost, and the real-time performance of a system is often obviously reduced under the architecture of serial semantic extraction and feature tracking of the current mainstream, so that the engineering application of the method is limited. In addition, the dynamic detection method relying on geometric constraint usually uses sparse feature points as basic processing units to exclude dynamic factors, and although the method can achieve good effects under low dynamic scenes and good visual conditions, the method has difficulty in explicitly describing the overall structure of dynamic objects, so that the method has difficulty in maintaining the space-time continuity of dynamic detection in complex dynamic scenes. In summary, the visual SLAM technology is widely applied to autonomous navigation of various unmanned platforms, how to identify and cope with dynamic scene interference is one of the important problems of the technology, however, the real-ti