CN-121982712-A - Unmanned aerial vehicle live-action three-dimensional model semantic segmentation method based on grid attention

CN121982712ACN 121982712 ACN121982712 ACN 121982712ACN-121982712-A

Abstract

The invention discloses a grid attention-based unmanned aerial vehicle live-action three-dimensional model semantic segmentation method which comprises the steps of 1, sampling and segmenting an input three-dimensional live-action model data set to obtain a plurality of space sub-blocks, 2, taking the space sub-blocks as input units, extracting features through a multi-scale feature pyramid network of an encoder-decoder framework, 3, carrying out feature enhancement on input point clouds of all levels among the levels of the pyramid network through a grid attention module, obtaining single-scale point cloud fusion features through cooperative work of local geometric perception, global relation modeling and a self-adaptive fusion mechanism, 4, carrying out category prediction on the enhancement feature representation of the point clouds, optimizing model parameters, and outputting a final semantic segmentation result of point cloud data. The method optimizes the adaptability of the model to the capturing capacity of the space locality feature and the non-uniform data, and realizes the collaborative promotion of the semantic segmentation high efficiency and the precision of the large-scale urban scene.

Inventors

XU JINGHAI
FANG JUNJIE
JING HAORAN

Assignees

南京工业大学

Dates

Publication Date: 20260505
Application Date: 20251225

Claims (10)

1. The unmanned aerial vehicle live-action three-dimensional model semantic segmentation method based on grid attention is characterized by comprising the following steps of: Step 1, performing patch-level feature sampling on an input three-dimensional real-scene model data set, and performing quadtree segmentation on a dense point cloud obtained by sampling to obtain a plurality of space sub-blocks containing semantic features of the three-dimensional real-scene model; Step 2, taking the space subblocks as input units, extracting features through a multi-scale feature pyramid network of an encoder-decoder framework, aggregating abstract semantics through hierarchical downsampling, and recovering space details through iterative upsampling and fusion with encoder features by the decoder to finally obtain enhanced feature representation with local details and high-level semantics; Step 3, feature enhancement is carried out on the input point clouds of all levels among all levels of the pyramid network through a grid attention module, and the semantic characterization capability of the features is enhanced through cooperative work of local geometric perception, global relation modeling and a self-adaptive fusion mechanism, so that single-scale point cloud fusion features with stronger discriminant and better adaptability to scene heterogeneity are obtained; And 4, performing category prediction on the enhanced characteristic representation of the point cloud through the multi-layer perceptron classifier, optimizing model parameters based on the loss function, and outputting a final semantic segmentation result of the point cloud data.
2. The method for semantic segmentation of the three-dimensional model of the real scene of the unmanned aerial vehicle based on the grid attention according to claim 1, wherein the step 1 comprises the following steps: Step 1-1, dividing an input three-dimensional live-action model data set T into a training set , Test set And a verification set Obtaining geometric texture information of triangular patches of three-dimensional grid data set And a true label Y, generating effective sampling points under a gravity center coordinate system Calculating a sampling point patch level normal vector and texture characteristics; step 1-2, taking the whole scene point cloud as a root node By setting the maximum point number in the node And minimum side length of node Constructing a recursive quadtree by the two parameters, and finishing scene resampling to obtain k dense point clouds; step 1-3, performing feature aggregation and dimension lifting processing on the k-block dense point cloud obtained in the step 1-2, namely selecting a center of gravity point As representative point of each triangular patch, and stacking normal vector and color characteristic of all sampling points in the patch to generate the representative point A kind of electronic device Dimension aggregation features And (3) obtaining grid representative points MRP for aggregating all the information of the patches, traversing all triangular patches, obtaining k space sub-blocks containing grid representative point cloud data P and a characteristic matrix X thereof, and completing patch-level characteristic sampling and preprocessing work of the three-dimensional real-scene model.
3. The semantic segmentation method for the unmanned aerial vehicle live-action three-dimensional model based on grid attention according to claim 2, wherein the step 1-1 comprises the following steps: Step 1-1-1, the barycentric coordinate parameters And Respectively equally divide into Generating a part of Is a two-dimensional lattice of grids According to the constraint condition Calculation of Value and screen out satisfaction As effective sampling points, the sampling points of each triangular patch are Simultaneously constructing an index and putting the center of gravity Setting as a representative point; step 1-1-2 based on three-dimensional coordinates of three vertices of each triangular patch in the dataset , , Two edge vectors edge 1=are obtained by the difference vector of the two vertexes - ,edge2= - The normal vector of the triangular surface patch is obtained by cross product and normalization of two side vectors , Step 1-1-3 for triangular patches Sampling points in By using the barycentric coordinates thereof Texture coordinates for three vertices Weighted average is performed: , Wherein the method comprises the steps of Is a barycentric coordinate component, satisfying: , , Texture coordinates obtained by interpolation Texture image by bilinear interpolation and other methods Obtain corresponding RGB value Normalizing to the range of [0,1]; In the step 1-2, a scene boundary frame is determined according to the space range of the input three-dimensional real scene model data set, and is expanded into a square with side length And assigning root node codes to the current node, if so Satisfying the condition that the number of points is included And the side length of the node Then the node is subdivided into four sub-nodes And recursively makes the same condition determination for each child node, for each leaf node Assigning unique morton codes And assign all sample points within the node, ultimately based on the Morton code And converting the large-scale and unordered three-dimensional dense point cloud data into data blocks which are suitable for GPU parallel processing, controllable in size and have spatial locality.
4. The method for semantic segmentation of the three-dimensional model of the real scene of the unmanned aerial vehicle based on the grid attention according to claim 1, wherein the step 2 comprises the following steps: Step 2-1, inputting point cloud data of the space sub-blocks and real labels of each point into a multi-scale feature pyramid network for training, and generating high-dimensional abstract point cloud features through hierarchical feature aggregation of an encoder; And 2-2, executing top-level global sensing and up-sampling decoding operation on the Gao Weichou image point cloud features output in the step 2-1, and recovering to the space dimension consistent with the input point cloud to obtain high-dimensional features with high-level semantics and multi-scale abstract information.
5. The semantic segmentation method for the unmanned aerial vehicle live-action three-dimensional model based on grid attention according to claim 4, wherein the step 2-1 comprises the following steps: step 2-1-1, finishing the point cloud data P and the characteristic matrix X thereof obtained in the step 1-3 through a multi-layer perceptron Dimension mapping, namely constructing a neighborhood graph through KNN by taking each grid representing point as a vertex of the graph, and then carrying out message transmission and feature aggregation on the dynamic neighborhood graph through a graph convolution layer based on an attention mechanism, so that each vertex can fuse geometric and semantic information of a local neighborhood of the vertex, and discrete grid patch features are converted into enhanced point cloud features X 0 with neighborhood relevance; step 2-1-2, for the enhanced feature point cloud obtained in step 2-1-1 Feature extraction is circularly performed by a multi-layer downsampling module, and each layer adopts the furthest point sampling to proportionally perform Downsampling to construct local neighborhood groupings For each group feature subset Nonlinear transformation and dimension improvement are carried out through a multi-layer perceptron, local aggregation characteristics are obtained through maximum pooling, local and global relation modeling and strengthening are carried out through a grid attention mechanism, and a feature matrix F after downsampling is generated; Step 2-1-3, repeating step 2-1-2, wherein each layer is sampled to obtain a corresponding level point cloud The feature dimension is successively promoted along the encoder path to At the same time, the spatial resolution is gradually decreased to form Is a multi-scale pyramid structure.
6. The semantic segmentation method for the unmanned aerial vehicle live-action three-dimensional model based on grid attention according to claim 4, wherein the step 2-2 comprises the following steps: step 2-2-1, outputting the lowest resolution point cloud at the end of the encoder path as Dimension, nonlinear transformation is carried out through a multi-layer perceptron, feature dimension is kept unchanged, global context dependency relationship is modeled through a graph convolution layer based on an attention mechanism, and top-level feature representation with global semantic perception is generated ; Step 2-2-2, the decoder path adopts a symmetrical architecture with the encoder, and gradually reconstructs the point cloud features in an iterative up-sampling mode, wherein the process is from the global features of the top layer Starting, after the dimension reduction by the multi-layer perceptron, the resolution of the feature map is improved by utilizing KNN interpolation, and the spatial resolution is according to the following Gradually increasing, and fusing corresponding layer features with the same resolution as the encoder path through jump connection, wherein feature dimensions are sequentially lifted into the decoder path After each up-sampling stage, a dynamic diagram is constructed by utilizing KNN, grid attention is applied again to refine and regularize the fused features, space detail information is gradually recovered while high-level semantic context is maintained, and finally enhanced feature representation with the same resolution as input is output 。
7. The method for semantic segmentation of the three-dimensional model of the real scene of the unmanned aerial vehicle based on the grid attention according to claim 1, wherein the step 3 comprises the following steps: step 3-1, hierarchical point cloud data based on input And its corresponding neighborhood graph Feature learning is carried out on each point and the local neighborhood thereof through a local geometric sense arm, and detail geometric structures and intra-class consistency are captured: And 3-2, for each level point cloud obtained by the sampling processing in the step 2, two independent processing flows are formed, one flow passes through the local geometric sensing arm, the other flow passes through the global relation sensing arm, the extraction module acquires the context information, the output of the two flows is sent to the self-adaptive fusion gating module, and the module dynamically fuses local and global characteristics through calculating the attention weight of a channel to realize optimization balance.
8. The unmanned aerial vehicle live-action three-dimensional model semantic segmentation method based on grid attention as set forth in claim 7, wherein step 3-1 is performed on each level point cloud obtained through step 2 processing The input graph convolution layer performs depth feature learning, namely: for input to each layer Point cloud feature matrix, and dynamic topological graph structure is constructed through K nearest neighbor algorithm Each central node is calculated With its neighborhood nodes Difference in eigenvector between Will center node characteristics And edge features Dimension splicing is carried out through a multi-layer perceptron Realizing nonlinear transformation and advanced semantic feature extraction, and adopting maximum pooling operation to each central node All neighborhood transformation features of (a) are aggregated, and a local edge feature matrix is extracted The specific calculation formula is as follows: , Wherein the method comprises the steps of For the feature difference of the center node i and the neighbor node j, The method comprises the steps of splicing the central feature and the edge feature in dimension to form a dimension of Is characterized by the combination of features of (a), Is a multi-layer perceptron which is provided with a plurality of sensing units, Representing that the maximum pooling aggregation is carried out on all neighborhood characteristics of the node i, and finally obtaining a local edge characteristic matrix of the node 。
9. The method for semantic segmentation of the three-dimensional model of the unmanned aerial vehicle based on grid attention according to claim 7, wherein the step 3-2 comprises the following steps: Step 3-2-1, wherein the point clouds of each level obtained by the step 2 are processed The overall feature is learned by a graph convolution layer based on an attention mechanism and is sent into a linear layer again, the node feature is subjected to triple linear transformation processing, and a weight matrix is used Generating a principal feature transformation path By a weight matrix And Projecting source node and target node characteristics respectively to generate source node characteristics of attention calculation And target node features Constructing a complete attention computing framework; Step 3-2-2, for edge set Each edge of (a) Two nodes connected And First, calculate the relative position vector of the target node to the source node By position coding neural networks Mapping it to a high-dimensional location feature Combining projected feature differences of source node and target node And location features Generating raw attention scores Passing the score through an attention neural network And nonlinear activation function processing to obtain Finally, activating the function by normalization Obtaining the attention weight of each channel in the neighborhood range The mapping from the geometric relation to the characteristic weight is completed, and the specific calculation formula is as follows: , Wherein the neural network is position encoded And an attention neural network Are all multi-layer perceptrons (MLPs), Is a normalized activation function; Step 3-2-3, based on the calculated attention weights of the channels Transformation characteristics for each target node And position coding Hadamard product is carried out on the sum, the attention weight of each channel is used for carrying out channel-by-channel weighting on the characteristics, all input messages of each target node are summed according to the channels, and the output characteristics of each source node are generated The specific formula is as follows: , Wherein the method comprises the steps of Representing nodes Is defined by a set of neighboring nodes of the network, For Hadamard product, the global enhancement feature matrix with space perception capability is finally output Each node feature in the feature matrix fuses geometric context information and semantic feature information of the global neighborhood; step 3-2-4, for the local edge feature matrix obtained in step 3-1 and step 3-2-3 Global enhancement feature matrix Expanding and splicing the three dimensions to form Is a combined feature tensor of (2) Simultaneously splicing the original local features and the global features along the channel dimension to obtain the shape of Is then input into a fused attention network consisting of fully connected layers and normalized activation functions to generate attention weights characterizing the importance of the two feature sources The shape of the attention weight is determined by Is adjusted to And combining feature tensors Broadcast multiplication is performed, and finally summation is performed along a third dimension to generate a shape of Single-scale point cloud fusion features of (1) The characteristic adaptively integrates local detail information and global context information, and a specific calculation formula is as follows: , , Wherein FC is a fully-connected layer, For the Hadamard product, Is a column vector with d dimensions of all 1.
10. The semantic segmentation method of the unmanned aerial vehicle live-action three-dimensional model based on grid attention as set forth in claim 6, wherein in the step 4, the enhanced features output through the step 2-2-2 are represented Through the full connection layer and the multi-classifier, a semantic segmentation probability distribution map is generated, and an optimization target is formed by combining weighted cross entropy loss and Dice loss, wherein the cross entropy loss has the following specific formula: , Wherein C is the total number of categories, where One-hot encoding for a genuine tag, Is the predicted point Belongs to the category of Is a function of the probability of (1), Is a category of The specific formula of the Dice loss is as follows: , geometric consistency of segmentation is improved by calculating overlap degree of prediction and real segmentation area, and two losses form final optimization target through weighted summation Through gradient back propagation and parameter updating, the network gradually learns the complex mapping relation from the input point cloud to the semantic tags, and accurate and robust 3D grid semantic segmentation is realized.

Description

Unmanned aerial vehicle live-action three-dimensional model semantic segmentation method based on grid attention Technical Field The invention belongs to the field of computer image processing and computer vision, and particularly relates to an unmanned aerial vehicle live-action three-dimensional model semantic segmentation method based on grid attention. Background In the fields of digital cities, intelligent traffic, cultural heritage protection, unmanned operation, geological mapping and the like, a high-precision and high-reality three-dimensional real model has become a core digital foundation for realizing scene digitization and intellectualization. By means of technologies such as oblique photography, laser radar, multi-view stereoscopic vision and the like, the model can finely restore the geometric structure and texture of the real world, construct a digital twin body of the physical world and provide support for various depth applications. However, the existing three-dimensional live-action model is generally composed of a massive point cloud or triangular patches, only supports visual presentation, lacks semantic understanding capability, is difficult to realize automatic identification and structural analysis, and limits deep application in scenes such as operation and maintenance management, environment perception and the like. For this reason, semantic segmentation techniques are introduced into the field of three-dimensional processing, aimed at assigning semantic labels to basic units such as points, voxels or patches, and achieving a crossing from "visualization" to "understandable". At present, semantic segmentation based on point cloud is mature and applied in the fields of automatic driving and the like, and semantic segmentation research oriented to a three-dimensional live-action model is still relatively lagged. In recent years, a plurality of methods appear in the field of semantic segmentation of three-dimensional live-action models, and the three-dimensional live-action models are roughly divided into four categories, namely a method based on traditional machine learning, a method based on deep learning, a centroid-oriented point cloud representation method based on deep learning, a data organization is simple, local features can be effectively captured, texture information can be utilized to improve precision, a method based on deep learning, oriented to three-dimensional grid model elements, has no information loss caused by intermediate conversion, and can completely reserve the topological structure and geometric details of a 3D model, wherein the method based on deep learning, oriented to multi-view representation, can directly reuse a 2D deep learning technology. The three-dimensional grid model has the remarkable advantages of being capable of explicitly representing the surface topological structure and continuity, helping to obtain the segmentation result with clear boundaries, supporting high-reality rendering, being capable of distinguishing objects with similar appearance and different semantics by combining colors and textures, being high in storage and calculation efficiency and being more suitable for large-scale scenes. The graph is used as a structure of natural adaptation irregular data, is highlighted in point cloud processing, defines the adjacency relation of point cloud sets through the geometric similarity of the kernel correlation measurement, and realizes convolution operation on each node and neighborhood nodes thereof. In recent years, attention mechanisms have been widely used in a plurality of fields such as machine translation, object detection, semantic segmentation, etc., and in the field of three-dimensional model segmentation, a graph roll-up neural network has been introduced for the first time. However, no mature and efficient semantic segmentation solution is formed at present for three-dimensional live-action data acquired by unmanned aerial vehicles. Disclosure of Invention The invention aims to solve the core pain point in the existing large-scale urban scene three-dimensional real-scene model semantic segmentation technology, provides an unmanned aerial vehicle real-scene three-dimensional model semantic segmentation method based on grid attention, aims at the problems of lack of a large-scale three-dimensional grid Gao Xiaobao detail preprocessing scheme and huge calculation and memory expenditure of an original point cloud direct input model, designs an innovative preprocessing method, optimizes the adaptability of the model to space locality characteristics and non-uniform data, and realizes the collaborative promotion of high efficiency and accuracy of large-scale urban scene semantic segmentation. In order to achieve the purpose, the technical scheme adopted by the invention is that the semantic segmentation method for the unmanned aerial vehicle live-action three-dimensional model based on grid attention comprises the following steps: