CN-121482384-B - Structured image analysis method and system for fine-grained structure boundary and small-object segmentation

CN121482384BCN 121482384 BCN121482384 BCN 121482384BCN-121482384-B

Abstract

The invention belongs to the technical field of image semantic segmentation, and particularly relates to a structured image analysis method and a structured image analysis system for fine-grained structure boundary and small-object segmentation, wherein the method comprises the steps of constructing a semantic segmentation network HDA UNet based on an encoder-decoder framework, embedding a layered deformable attention HDA module in jump connection, and embedding a deformable attention DA module in a bottleneck layer to enhance modeling capability of an elongated object and fuzzy boundary characteristics and realize self-adaptive alignment and fusion of cross-scale characteristics; based on a pre-constructed multi-class edge guide loss function training network, weighting cross entropy loss according to a weight graph generated according to the real label edge distance to drive the network to focus on learning of a boundary area, inputting an image to be analyzed into the trained network, and outputting a pixel-level semantic segmentation result. The invention effectively improves the segmentation precision of the bearing wall boundary, the door and window outline, the sliding door, the railing and other small targets in the building plan.

Inventors

SU LIANGLIANG
Sheng Jiamu
YANG YALONG
JIANG HAORAN

Assignees

安徽建筑大学

Dates

Publication Date: 20260512
Application Date: 20251024

Claims (8)

1. The structured image analysis method for fine-grained structure boundary and small-object segmentation is characterized by comprising the following steps of: constructing a semantic segmentation network based on an encoder-decoder architecture, embedding a layered deformable attention HDA module in jump connection, and embedding a deformable attention DA module in a bottleneck layer; training a semantic segmentation network based on a pre-constructed loss function, wherein the loss function is constructed by cross entropy loss and a weight graph dynamically generated according to structural boundary information in a real label so as to drive the network to focus on learning of fine-granularity structural boundaries; inputting the structured image to be analyzed into a trained semantic segmentation network, and outputting a pixel-level semantic segmentation result; the encoder extracts multi-scale features of an input image, a feature map output by the tail end of the encoder is processed by the DA module, the enhanced bottleneck feature is output, the decoder performs step-by-step upsampling on the enhanced bottleneck feature, and the upsampled feature map is combined with a corresponding encoder feature map aligned and fused by the HDA module at each step so as to improve the segmentation precision of a fine granularity boundary and a small target; The semantic segmentation network is an HDA-UNet network, wherein: The encoder consists of 4 coding layers which are sequentially connected, wherein each coding layer comprises a convolution operation with a convolution kernel of 3 multiplied by 3 twice, a batch normalization and ReLU activation operation and a maximum pooling operation with a convolution kernel of 2 multiplied by 2; the bottleneck layer is connected with the encoder and the decoder, wherein the embedded DA module is used for dynamically modeling and enhancing the slender object and the fuzzy boundary characteristic in the image; Each decoding layer carries out cascade connection on the up-sampling feature map of the last decoding layer and the corresponding encoder feature processed by the HDA module, and outputs the up-sampling feature map after convolution operation; the jump connection is used for transmitting the output characteristics of each coding layer to the corresponding decoding layer for fusion after being processed by the HDA module, and finally, the decoder outputs a semantic segmentation result matched with the resolution of the input image.
2. The structured image resolution method for fine-grained structure boundary and small-object segmentation according to claim 1, wherein the step of the DA module processing the input feature map X comprises: inputting the feature map X into two parallel branches, the first branch flattening it into a query vector The second branch generates a value vector by linear transformation C is the number of channels, and divide V into M groups on average in the channel dimension; for each query position Q, its corresponding M K sample offsets are predicted from Q by two linear transforms And attention weight Where K is the number of samples per attention head, where: ; In the formula, Represent the first Head, q-th query position, offset vector of k-th sampling point; Normalized to satisfy ; Reference point coordinates constructed based on each query location q Offset from normalization Calculating the final sampling position The following formula: ; At each sampling position Where the features are sampled from V and weighted according to attention Weighted summation is carried out to obtain the output of each attention head The following formula: ; Wherein, the Attention output for head m corresponding to reference point q; After the outputs of all attention heads are spliced, linear projection transformation is carried out, and the linear projection transformation is further carried out with the original inquiry And carrying out residual error addition and layer normalization to obtain an intermediate feature Z1 q , wherein the following formula is as follows: ; In the formula, For the output of the 1 st attention header to the q-th query location, Output results of the mth attention head for the qth query position; and finally, remolding the output into a two-dimensional feature map to obtain the enhanced features.
3. The structured image resolution method for fine-grained structure boundary and small-object segmentation according to claim 2, wherein the reference point coordinates are constructed based on each query location q Comprising: For each query location q, a normalized coordinate within the unit grid is constructed as follows: ; In the formula, And H is the feature map height.
4. A structured image resolution method for fine-grained structure boundary and small-object segmentation as defined in claim 3, wherein the step of the HDA module processing the upsampled feature map comprises: flattening the up-sampling feature map of the current stage of the decoder as a query vector Feature maps for L different scales from the encoder Firstly, carrying out 1X 1 convolution on each scale to realize channel alignment, respectively flattening and then splicing into a value vector V ; Predicting sample offsets for different scale features in a value vector V using a deformable attention mechanism based on query location And attention weight According to the reference point coordinates Computing cross-scale sampling locations with offset The following formula: ; At each sampling position The features are sampled from V by interpolation and weighted according to attention After weighted summation and aggregation of all attention heads and scale outputs, the intermediate features are obtained through linear transformation The following formula: ; The intermediate feature is subjected to Remolding into a two-dimensional space form, and performing channel splicing with an up-sampling feature map Y at the current stage of the decoder to form fusion features The following formula: 。
5. The structured image resolution method for fine-grained structure boundary and small-object segmentation according to claim 4, wherein the constructing of the pre-constructed loss function comprises the steps of: Split label graph for input Each category of (3) Generating a binary image thereof Performing convolution operation by using Laplace convolution kernel k, and extracting edge graph of the category through ReLU activation function The following formula: ; Merging edge graphs of all classes Obtaining multi-class edge masks ; Calculating the distance from each non-edge pixel to the nearest edge pixel to generate a normalized distance map Assigning an edge-aware weight to each pixel (i, j) using an edge-decay function The following formula: ; where alpha is the edge weighting coefficient, For the distance of the i, j-th pixel to the nearest edge pixel, Represent the first A loss amplification factor for the pixel; based on the edge perception weight Weighting the pixel-by-pixel cross entropy loss CE (·) to obtain a final multi-class edge-directed loss The following formula: ; In the formula, Representing the actual tag at position (i, j), Representing the network prediction result at location (i, j), 。
6. The structured image analysis method for fine-grained structure boundary and small-object segmentation according to claim 1, wherein the structured image is a building plan, the small object comprises at least one of a sliding door and a railing, and the fine-grained structure boundary comprises at least one of a bearing wall boundary, a non-bearing wall boundary and a door and window contour.
7. Structured image parsing system for fine-grained structure boundary and small object segmentation for implementing the steps of the structured image parsing method according to any of the claims 1-6, characterized in that the system comprises: the network construction module is used for constructing a semantic segmentation network based on an encoder-decoder architecture, embedding a layered deformable attention HDA module in jump connection, and embedding a deformable attention DA module in a bottleneck layer; the network training module is used for training the semantic segmentation network based on a pre-constructed loss function, wherein the loss function is constructed by cross entropy loss and a weight graph dynamically generated according to structural boundary information in a real label so as to drive the network to focus on learning of fine-granularity structural boundaries; the image analysis module is used for inputting the structured image to be analyzed into the trained semantic segmentation network and outputting a pixel-level semantic segmentation result; The encoder is used for extracting multi-scale characteristics of an input image, a characteristic diagram output by the tail end of the encoder is processed by the DA module and outputs an enhanced bottleneck characteristic, the decoder is used for up-sampling the enhanced bottleneck characteristic step by step and combining the up-sampling characteristic diagram with a corresponding encoder characteristic diagram aligned and fused by the HDA module at each stage so as to improve the segmentation precision of a fine granularity boundary and a small target.
8. The fine-grain structure boundary and small-object segmentation oriented structured image resolution system of claim 7, further comprising a loss function construction module for constructing a loss function, comprising: the edge extraction unit is used for generating binary images of each category for the input segmentation label graph, and extracting each category of edge graph through Laplace convolution and ReLU activation; The edge fusion unit is used for merging the edge graphs of all the categories to obtain multi-category edge masks; The weight generating unit is used for calculating the distance from each non-edge pixel to the nearest edge pixel, generating a normalized distance graph and distributing edge perception weights to each pixel by utilizing an edge attenuation function; and the loss calculation unit is used for weighting the pixel-by-pixel cross entropy loss based on the edge perception weight to obtain the final multi-category edge guiding loss.

Description

Structured image analysis method and system for fine-grained structure boundary and small-object segmentation Technical Field The invention belongs to the technical field of image semantic segmentation, and particularly relates to a structured image analysis method and system for fine-grained structure boundary and small-object segmentation. Background Early planogram recognition methods relied primarily on geometric heuristics and low-level image processing techniques such as line detection, edge extraction, morphological operations, and the like. The method realizes room area division and building member positioning based on the structural line characteristics and the spatial topological relation of the image, and can obtain a certain analysis effect in the drawing with standard layers and regular structures. Subsequently, a technical scheme of combining connected region analysis, graphic element line width recognition and optical character recognition appears, and the detection of door and window symbols and the semantic annotation of room functions are further realized. In recent years, with the rapid development of deep learning techniques, a semantic segmentation model based on an encoder-decoder architecture is widely used for a planar graph analysis task. Based on the above, a series of improvement methods are proposed, including introducing deeper network structures, fusing skeletonized post-processing flows, using direction-aware convolution to enhance geometric modeling, and combining strategies such as boundary guiding attention mechanism, multi-task learning framework and generating countermeasure networks, so that remarkable progress is made in the aspects of identification and reconstruction of main building elements such as walls, rooms and the like. However, most of the prior art approaches focus on body structures such as walls, rooms, etc., and have inadequate modeling capabilities for fine grain structural boundaries (e.g., load-bearing wall versus non-load-bearing wall boundaries, fine door and window profiles) and small objects (e.g., sliding doors, railings). Because the small sample categories are rare in quantity in training data and the boundaries are in a fuzzy state, the model is difficult to converge in the training process and is easily ignored by a backbone network, so that missed detection and false separation are caused. In addition, although some methods introduce attention or boundary sensing mechanisms, their feature alignment and fusion capabilities are limited, and it is difficult to adaptively focus on key details in a complex structural background, so that feature expression capability on an elongated object and a fuzzy boundary is weak, and a segmentation result is often not accurate enough at the boundary. Therefore, the conventional method has poor generalization capability when facing the residential plan with various wind lattices and complex structure in the real scene, and is difficult to meet the actual application requirements of high-precision and fine-granularity structural analysis. Disclosure of Invention The invention aims to provide a structured image analysis method and a structured image analysis system for segmenting fine-grained structure boundaries and small targets, so as to solve the technical problem that the existing image semantic segmentation method is insufficient in segmentation precision of fine-grained structure boundaries (such as bearing wall boundaries and door and window outlines) and small targets (such as sliding doors and railings). The invention realizes the above purpose through the following technical scheme: In a first aspect, the present invention provides a structured image parsing method for fine-grained structure boundary and small-object segmentation, where the method includes: constructing a semantic segmentation network based on an encoder-decoder architecture, embedding a layered deformable attention HDA module in jump connection, and embedding a deformable attention DA module in a bottleneck layer; training a semantic segmentation network based on a pre-constructed loss function, wherein the loss function is constructed by cross entropy loss and a weight graph dynamically generated according to structural boundary information in a real label so as to drive the network to focus on learning of fine-granularity structural boundaries; inputting the structured image to be analyzed into a trained semantic segmentation network, and outputting a pixel-level semantic segmentation result; The encoder extracts multi-scale features of an input image, a feature map output by the tail end of the encoder is processed by the DA module, the enhanced bottleneck feature is output, the decoder performs step-by-step upsampling on the enhanced bottleneck feature, and the upsampling feature map is combined with a corresponding encoder feature map aligned and fused by the HDA module at each step, so that the segmentation precision of a fine g