CN-118747809-B - Multi-view semantic embedding scene recognition method based on point cloud
Abstract
The invention belongs to the technical field of scene recognition, and discloses a point cloud-based multi-view semantic embedding scene recognition method. Respectively processing point cloud data in two tracks into descriptors through a descriptor coding network, wherein the two descriptors are respectively used as a database and a query set; and finding the most similar point cloud in the database by adopting a nearest neighbor algorithm, and completing global scene identification based on the point cloud. The descriptor coding network respectively projects the point cloud data to a forward view angle and a bird's-eye view angle through multi-view angle projection to obtain a distance view angle image and a bird's-eye view angle image, wherein the distance view angle image and the bird's-eye view angle image are subjected to semantic embedding feature learning fusion, and the final point cloud descriptor is obtained through feature self-adaptive fusion. The descriptor coding network provided by the invention is efficient in calculation, can utilize the point cloud characteristics of different view angles, and can effectively utilize semantic information to enhance scene recognition. The descriptor coding network can effectively describe the point cloud scene and is applied to scene recognition tasks.
Inventors
- ZHANG YUNZHOU
- ZHANG JINPENG
- RONG LEI
- WANG LI
- WANG SIZHAN
Assignees
- 东北大学
Dates
- Publication Date
- 20260512
- Application Date
- 20240704
Claims (5)
- 1. A multi-view semantic embedded scene recognition method based on point clouds is characterized in that point cloud data in two tracks are respectively processed into unique descriptors through a descriptor coding network, and the descriptors of the two tracks are respectively used as a database and a query set; the descriptor coding network is characterized by comprising multi-view projection, semantic embedded feature learning and feature self-adaptive fusion; Respectively projecting the point cloud data to a forward view angle and a bird's-eye view angle through multi-view projection to obtain a distance view angle image and a bird's-eye view angle image, wherein the distance view angle image and the bird's-eye view angle image are subjected to semantic embedding feature learning fusion, and the final point cloud descriptor is obtained through feature self-adaptive fusion; the multi-view projection is specifically as follows: generating a range view image, wherein for points p= { x, y, z } in arbitrary point cloud data, the range view is projected through the following transformation; Calculating an azimuth angle theta from an origin to a point (X, Y), namely an angle between the point and an X axis, by an atan2 function, wherein phi is an angle between a point cloud and an X-Y plane, and r and c respectively represent a horizontal coordinate and a vertical coordinate of the point cloud projected onto a 2D plane, wherein Θ and Phi represents the horizontal and vertical resolution of the radar, the projected image values fill 5-tuple (range, x, y, z, intensity), where range is the distance from a point cloud point to the z-axis, and the value of each channel is set to 0 if there is no point at a specific location; The aerial view image generation method comprises the steps of obtaining a point cloud by projecting on an X-Y plane, creating a filter, only keeping points in a specific interval of the point cloud, determining plane coordinates corresponding to each point, then designating filling values of corresponding positions, regarding the height and reflection intensity of the original point cloud as physical quantities in terms of distinguishing and geographical representation, and obtaining an aerial view image with a z coordinate and reflection intensity dual channels by the operation; The semantic embedded feature learning comprises a semantic segmentation network, a self-attention module and a combined convolution module; The semantic segmentation network processes the distance view image to obtain a point cloud containing a semantic segmentation result; The method comprises the steps of connecting a feature extraction part of a semantic segmentation network to a self-attention module, further refining features facing a scene recognition task, converting point cloud containing semantic segmentation results into semantic aerial view images, forming three-channel aerial view images by the aerial view images, processing the three-channel aerial view images by using a combined convolution module, fusing output features of the combined convolution module and the output features of the self-attention module into global feature representation by using a post-fusion mode, and obtaining a final point cloud descriptor by using GeM pooling processing; The feature self-adaptive fusion specifically comprises the step of fusing and encoding global features of two branches into unique point cloud descriptors, and the fusion process specifically comprises the step of pooling through a transducer layer module and GeM in sequence.
- 2. The point cloud based multi-view semantic embedded scene recognition method according to claim 1, wherein the two tracks are data acquired by the same route at different times, descriptors of the previously acquired tracks are databases, and descriptors of the later acquired tracks are query sets.
- 3. The point cloud based multi-view semantic embedded scene recognition method of claim 1, wherein the point cloud specific interval is set to a filtering range of 20 meters, the resolution is set to 5 centimeters, and the origin is translated to minimize data at location (0, 0).
- 4. The multi-view semantic embedding scene recognition method based on the point cloud according to claim 1 is characterized in that the semantic segmentation network is divided into two stages, namely a downsampling encoding stage and an upsampling decoding stage, model parameters of the semantic segmentation network are trained in advance, the point cloud containing semantic segmentation results is projected to an aerial view to obtain semantic aerial view images, features of the aerial view images in distance view image branches are further extracted into global features facing scene recognition tasks through a self-attention module, the aerial view branches are constructed into a new three-channel aerial view image and input into a combination convolution module to conduct global feature extraction, the two branches are processed into respective descriptors through GeM pooling, and descriptors of the two branches are fused through a feature self-adaptive fusion module to obtain final feature representation.
- 5. The multi-view semantic embedded scene recognition method based on point cloud according to claim 4, wherein the self-attention module firstly uses three convolution layers to map a feature map of a downsampling encoding stage, further refines features through a self-attention mechanism, the combined convolution module extracts basic features of a bird's eye view image by being composed of four convolution layers and a pooling layer, two sets of CNNs with different parameters are used for processing the basic features respectively in the combined convolution module, feature integration is carried out in a cross-layer summation mode of different levels of features, and finally, the feature integration result is integrated into a unified representation through dot product operation.
Description
Multi-view semantic embedding scene recognition method based on point cloud Technical Field The invention relates to the technical field of scene recognition, in particular to a point cloud-based multi-view semantic embedding scene recognition method. Background The ability to identify and match scenes is critical to autonomous navigation of robots and autopilots. It is also the basis for closed loop detection and relocation in synchronous localization and mapping (SLAM) technology. In recent years, scene recognition technology has made remarkable progress, creating a variety of vision-based approaches. However, the performance of visual scene recognition is still severely limited by seasonal, viewing angle, illumination, and weather conditions. In contrast, lidar sensors largely avoid visual defects. Therefore, the point cloud scene recognition method has received a great deal of attention. The current point cloud scene recognition algorithm also has the problems of low point feature extraction and calculation efficiency and matching errors of similar scenes. On one hand, the three-dimensional point cloud is projected to the two-dimensional plane, and feature extraction calculation efficiency is higher and the feature extraction is easier to realize. The scene recognition task focuses more on the scene difference, so that the front view projection and the bird's eye view projection of the three-dimensional point cloud are fused, and the space specificity of the scene can be clearly expressed. On the other hand, semantic information plays a vital role. It can help identify object categories in a scene and understand the overall structure and layout of the scene. The method is beneficial to improving the recognition precision and enhancing the robustness of the system, objects such as roads, vehicles, pedestrians and the like can be recognized through semantic information, and the more robust features can ensure that the system can cope with more complex scenes. Although a large number of scene recognition algorithms based on point clouds are currently emerging, a few methods reach equilibrium in terms of computing performance and accuracy, and there is still room for improvement in large-scale outdoor scene recognition. Literature "Luo L,Cao S Y,Han B,et al.BVMatch:Lidar-Based Place Recognition Using Bird's-Eye View Images[J].IEEE Robotics and Automation Letters,vol.6,no.3,pp.6076-6083.2021." projects a 3D point cloud onto the aerial image and then uses a set of Log-Gabor filters to construct a maximum index map, encoding the directional information of the structures in the image. Although the computational efficiency of descriptor generation is improved, the absence of scene vertical direction information is unavoidable. Literature "Komorowski J.MinkLoc3D:Point Cloud Based Large-Scale Place Recognition[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,pp.1790-1799.2021." calculates discriminative 3D point cloud descriptors based on sparse voxelized point cloud representations and sparse 3D convolutions. The adoption of sparse 3D convolution to extract the global features of the point cloud ensures calculation instantaneity, but cannot effectively cope with the rotation change of the scene. Literature "Li L,Kong X,Zhao X,et al.RINet:Efficient 3D lidar-based place recognition using rotation invariant neural network[J].IEEE Robotics and Automation Letters,vol.7,no.2,pp.4321-4328.2022." combines semantic and geometric features to improve descriptive power and then employs a rotation invariant twin neural network to predict similarity between descriptors. The combination of semantic and geometric information is utilized to ensure the integrity of the information, but is computationally less efficient than other methods. Literature "Zhao S,Yin P,Yi G,et al.Spherevlad++:Attention-based and signal-enhanced viewpoint invariant descriptor[J].IEEE Robotics and Automation Letters,vol.8,no.1,pp.256-263.2023." projects a point cloud over a spherical perspective of multiple distinct regions and captures the contextual connection between local features and their dependencies with the global 3D geometry, with rotational invariance. Local direction equivalent features are encoded by spherical convolution, but do not take advantage of the correlation between features. In summary, the current research cannot ensure recognition accuracy, real-time performance and complete scene expression at the same time, and the relationship of the multi-view projection features is not processed effectively. Disclosure of Invention The invention provides a point cloud-based multi-view semantic embedding scene recognition method, which performs scene recognition by multi-view feature fusion and semantic embedding, effectively utilizes projection features of different view angles of point cloud, enhances the recognition capability of similar structure sites by utilizing semantic information, and achieves bala