CN-120766131-B - Image analysis method and system for identifying post-earthquake collapse building

CN120766131BCN 120766131 BCN120766131 BCN 120766131BCN-120766131-B

Abstract

The invention discloses an image analysis method and an image analysis system for identifying post-earthquake collapse buildings, and belongs to the technical field of image analysis. Firstly, constructing an identification network according to an UNet++ network, a global context enhancement module GCEM and a dynamic feature interaction module DFIM, secondly, inputting the post-earthquake optical images into the global context enhancement module GCEM and the UNet++ network simultaneously, extracting feature images by the global context enhancement module GCEM through a convolution layer, generating a semantic mark set through a multi-scale semantic mark MST, performing global modeling through a transducer encoder and a decoder to output a global feature image, outputting a local feature image by the UNet++ network through the downsampling of the encoder and the gradual fusion operation of the decoder, and finally inputting the global feature image and the local feature image into the dynamic feature interaction module DFIM, realizing multi-scale feature fusion through self-adaptive weight adjustment to generate a collapse building identification result.

Inventors

WANG CHAO
WANG SHUIZHANG
XUE XIAOGANG
Ren Mengwen

Assignees

南京长望可祯智能科技有限公司

Dates

Publication Date: 20260512
Application Date: 20250624

Claims (7)

1. An image analysis method for identifying post-earthquake collapse building is characterized by comprising the following steps: s1, constructing an identification network, wherein the identification network comprises a UNet++ network, a global context enhancement module GCEM and a dynamic characteristic interaction module DFIM; global context enhancement module GCEM includes a multi-scale semantic marker MST, transformer encoder and decoder; the multi-scale semantic marker MST is used for carrying out multi-scale attention calculation on the feature images extracted by the convolution layer to generate a multi-scale semantic marker set; the upper half reshapes the input feature map into vector form, and the convolution layer extracts features Reshaping the input tensor X from (H, W, C) to ((H, W, C), wherein H, W, C represents image height, width, and channel number, respectively; the lower half part generates a multi-scale space attention map in an initial stage through a multi-scale attention module; first, multi-scale features are extracted by convolution kernels of 1×1,3×3, and 5×5, and the feature expression is enhanced with average pooling, maximum pooling: ; Wherein, the Representing the ith enhanced feature, the AP and MP distributions represent average pooling and maximum pooling, Indicating that the convolution kernel is of size Wherein, ; The attention weight matrix is generated by adopting the function ReLU and sigmoid two full connection layers, , Multiplying the multi-scale feature with the corresponding weight map to obtain multi-scale attention feature ; Second, features are obtained by integrating different scale attention features Calculating the weighted average sum of each pixel in X by the multi-scale attention pattern, and obtaining a multi-scale semantic graph by point-by-point convolution operation ; Finally, multiplying the obtained result with the mark set acquired in the upper half part through Softmax operation to obtain a multi-scale semantic mark set The specific formula is as follows: ; Wherein, the The feature vector is represented by a vector of features, Representing a spatial attention module, To provide a point-by-point convolutional layer with autonomous learning properties, Representing a Softmax operation; s2, inputting the post-earthquake optical image into a global context enhancement module GCEM and a UNet++ network at the same time; The global context enhancement module GCEM extracts the feature map through the convolution layer, generates a semantic tag set through a multi-scale semantic tag MST, carries out global modeling through a transducer encoder and a decoder, and outputs a global feature map; the UNet++ network performs the progressive fusion operation through the downsampling of the encoder and the decoder, and outputs a local feature map; S3, inputting the global feature map and the local feature map into a dynamic feature interaction module DFIM, realizing multi-scale feature fusion through weight adjustment, and generating a collapse building identification result; the dynamic feature interaction module DFIM performs the following operations: Firstly, inputting the global feature map and the local feature map into a dynamic feature interaction module DFIM, transforming the global feature map and the local feature map through a3×3 convolution layer, and adopting residual connection and a ReLU activation function, wherein the method specifically comprises the following steps: ; ; Wherein, the Representing a global feature of the object, Representing a local feature of the object, A convolution operation representing a convolution kernel size of 3 x 3; Secondly, splicing the transformed global features and local features, and generating attention scores through 1 multiplied by 1 convolution; calculating the attention weights of the global features and the local features in different dimensions through a Softmax function; Then, multiplying the global feature and the local feature with the corresponding attention weight element by element to obtain the weighting of the global feature And weighting of local features ; And finally, adding the global weight and the local weight element by element to obtain an output result.
2. The method for image analysis for post-earthquake collapse building according to claim 1, wherein the transform encoder is used for adding position codes to the semantic mark set, extracting context features through a multi-head self-attention mechanism, and the transform decoder is used for mapping the semantic features back to pixel space through a multi-head cross-attention layer and outputting a global feature map.
3. The method for image analysis for identifying post-earthquake collapse structures of claim 2, wherein the transducer encoder operation comprises: The position is encoded first Merging the semantic mark set T and merging the semantic mark set Input to a transducer encoder to obtain a new semantic tag set After the semantic mark set T is input into a transducer encoder, the semantic mark set T can be obtained The formula is as follows: ; ; ; Where q, k, v are query vectors, key vectors, and value vectors, respectively, Representing an autonomously learnable linear matrix, a single self-attention layer The expression of (2) is as follows: ; Wherein d represents the number of channel dimensions of q, k, v; multi-head self-attention layer The expression of (2) is as follows: ; ; Wherein, the For the j-th attention header, , Representing a linear projection matrix, h represents the number of self-attention layers.
4. The method for image analysis for post-earthquake collapse building according to claim 3, wherein the image features are further optimized by using a transducer decoder consisting of The units of the layers are formed, and each unit comprises a multi-head cross attention layer and a multi-layer perceptron layer; The sensor layer further comprises two linear projection layers, wherein one of the two linear projection layers is a gaussian error linear unit GELU, and the formula is as follows: ; Wherein the method comprises the steps of Representing an autonomously learnable linear transformation matrix.
5. The method for image analysis for identifying post-earthquake collapse of building of claim 1, wherein the weight calculation formula is as follows: ; 。
6. An electronic device comprising a processor and a memory, the memory storing a computer program, the processor implementing the method of any one of claims 1-5 when executing the program.
7. The image analysis system for identifying the post-earthquake collapse building is applied to the image analysis method for identifying the post-earthquake collapse building according to claim 1, and is characterized in that the system comprises an image input module, a global feature extraction module, a local feature extraction module, a feature fusion module and an output module; The image input module is used for receiving the post-earthquake optical image; The global feature extraction module is implemented by a global context enhancement module GCEM, including a convolutional layer, a multi-scale semantic marker MST, transformer encoder and decoder; The convolution layer is used for extracting a characteristic diagram of the input image; the multi-scale semantic marker MST generates multi-scale attention features through 1×1, 3×3 and 5×5 convolution kernels and combines a space attention module to output a semantic marker set; the local feature extraction module is realized by a UNet++ network, and outputs a local feature map through the downsampling of an encoder and the gradual fusion operation of a decoder; the feature fusion module is implemented by the dynamic feature interaction module DFIM, and performs the following operations: Transforming the global feature map and the local feature map through a 3X 3 convolution layer respectively, adopting residual connection, splicing the transformed features, generating attention scores through 1X 1 convolution, calculating weights of the global features and the local features through a Softmax function, adding the weighted global features and the local features, and generating a collapse building identification result; the output module is used for outputting the identification result to the display device or the storage medium.

Description

Image analysis method and system for identifying post-earthquake collapse building Technical Field The invention relates to the technical field of image analysis, in particular to an image analysis method and an image analysis system for identifying post-earthquake collapse buildings. Background Post-earthquake collapse building identification is a key task of disaster emergency response, and the traditional method relies on joint analysis of pre-earthquake optical images and post-earthquake SAR images, wherein shape, spectrum and texture features are extracted through the optical images, polarization and scattering features of the SAR images are combined, and identification is achieved through a classifier. However, the SAR image is limited by satellite revisit period and post-earthquake environmental interference, and the problems of poor timeliness of data acquisition and high interpretation professional threshold exist. In comparison, the optical image has advantages in terms of sensor distribution, convenience in acquisition and interpretation difficulty, but lacks elevation information, and more abstract and strong-discrimination features need to be mined, so that the introduction of a deep learning technology becomes a necessary choice. At present, the deep learning method based on the optical image still has obvious defects. For example, 1. Global modeling capability is weak, the mainstream model UNet++ is limited by local receptive fields of convolution kernels, long-distance spatial dependence is difficult to capture, and boundary recognition is fuzzy. 2. The feature fusion mechanism is rigidified, and the fixed weight fusion strategy can not dynamically sense the importance of the multi-scale features, so that the false detection/omission rate is increased. 3. Complex background anti-interference is weak, and the characteristics of the scenes such as rubble accumulation, vegetation shielding and the like are easy to be mixed after an earthquake, so that the recognition accuracy is obviously reduced. The existing improved model is difficult to break through performance bottleneck due to strong boundary information dependence or insufficient environment interference distinguishing capability although optimization is attempted. Therefore, an image analysis technical solution for identifying post-earthquake collapse buildings is needed to solve the above technical problems. Disclosure of Invention The invention aims to provide an image analysis method and system for identifying post-earthquake collapse buildings, which takes unet++ as a main network framework, designs a global context enhancement module (Global Context Enhancement Module, GCEM), and captures long-distance dependence and global context by constructing a Multi-scale semantic marker (Multi-SCALE SEMANTIC Tokenizer, MST) and introducing an encoding and decoding module of a Transformer network, enhances the structural understanding of the collapse area by the model, and simultaneously suppresses background noise. On the basis, a dynamic feature interaction module (Dynamic Feature Interaction Module, DFIM) is designed, and local features and global features are adaptively fused, so that the advantages of the features of each scale are fully exerted, and the accuracy and the robustness of the model in the recognition of the collapse building under the complex scene are improved. To solve the problems set forth in the background art. In order to solve the above technical problems, the present invention provides an image analysis method for identifying a post-earthquake collapse building, comprising: S1, constructing an identification network, wherein the identification network comprises a UNet++ network, a global context enhancement module GCEM and a dynamic feature interaction module DFIM. When the high-resolution remote sensing image is used for identifying a collapse building, the UNet++ is difficult to balance global information capture with local feature association, and the global modeling mechanism is weak to influence the detection accuracy and integrity in the face of an image filled with interference elements after an earthquake. The global context enhancement module GCEM is employed to achieve efficient association of features with the global scene, and the global context enhancement module GCEM contains three parts, a multi-scale semantic marker MST, transformer encoder and a transducer decoder. The multi-scale semantic marker MST is integrated into the multi-scale attention module, a semantic marker set is generated through Softmax, then a transducer encoder models the context of the semantic marker set, and a transducer decoder maps the semantic marker set back to the pixel space, so that the features are effectively associated with the global scene. S2, inputting the post-earthquake optical images into the global context enhancement module GCEM and the UNet++ network at the same time. The global context enhancement module GCEM