CN-121788809-B - Real-time infrared small target detection method based on linear global scanning network

CN121788809BCN 121788809 BCN121788809 BCN 121788809BCN-121788809-B

Abstract

The invention discloses a real-time infrared small target detection method based on a linear global scanning network. Aiming at the problems of high computational complexity and poor real-time performance of the existing model, the invention constructs a U-Net-like lightweight architecture which takes detection precision and reasoning speed into consideration. The encoding stage extracts shallow local textures through Stem and ResBlock, and deep introduces a Linear Global Scanning (LGS) module, and the core captures anisotropic long-distance semantic dependence with linear complexity by using a spatial scanning GRU. The encoding end multiplexes the GRUs with a Linear Context Aggregator (LCA) and incorporates channel re-weighting enhancement features. The decoding stage is aligned by PlainBlock to jump the connection semantics, and the upsampling features are fused by StandardFusion module and refined by cascade convolution. And finally, adopting Inception modules to combine with depth supervision to generate a high-precision prediction result. The method has the advantages of low calculation cost and high detection precision, and is suitable for high-frame-rate real-time infrared monitoring scenes.

Inventors

ZHENG JIEYING
Jiang Dingshuo
PANG YUEYONG
LIU FENG
LIU XITIAN

Assignees

南京邮电大学

Dates

Publication Date: 20260512
Application Date: 20260309

Claims (8)

1. The real-time infrared small target detection method based on the linear global scanning network is characterized by comprising the following steps of: s1, acquiring an infrared image to be detected; S2, preprocessing an image to be detected through a Stem module, performing channel expansion and maintaining original spatial resolution; S3, constructing an encoder to extract multi-scale features of the image, wherein a shallow layer Stage adopts ResBlock to extract shallow layer local textures, and a deep layer Stage adopts a plurality of cascaded LGS stages to process the features so as to enhance global semantic expression; The construction encoder for extracting multi-scale features of an image comprises the sub-steps of: s301, an encoder hierarchical cascade architecture, wherein a Stem module is used as an encoder initial layer, the encoder sequentially comprises at least one shallow layer feature extraction stage and a plurality of deep layer feature extraction stages which are arranged in cascade, and the spatial resolution of a feature map is reduced step by step through downsampling operation, and the number of channels is expanded to form multi-scale feature representation; S302, shallow texture feature extraction, namely processing input features by utilizing a Stem module and at least one ResBlock at a shallow stage of an encoder to capture shallow physical features and local background textures of an infrared small target and enhancing local feature response through residual mapping; S303, deep global semantic modeling, namely in the deep Stage of the encoder, processing the characteristics by adopting a plurality of cascaded linear global scanning stages LGS Stage, and establishing a long-distance space dependency relationship in each LGS Stage through a linear global scanning module LGS Block so as to enhance the global semantic expression capability under a complex background; s4, modeling deep features by using a linear global scanning module LGS Block, wherein the LGS Block scans GRU units through space to calculate complexity linearly Capturing the anisotropic long-distance spatial dependence; S5, receiving output characteristics of the deepest bottleneck of the encoder by using a linear context aggregator LCA, respectively inputting the characteristics into the space scanning GRU unit and the channel attention branch, and fusing the output of the space scanning GRU unit and the output of the channel attention branch to obtain aggregated characteristics; S6, a construction decoder executes feature recovery and fusion, an StandardFusion module is utilized to carry out channel splicing fusion on the up-sampling feature and the same-scale jump connection feature processed by the PlainBlock unit, and a cascade convolution layer is utilized to refine; s7, generating a pixel-level prediction result by utilizing the multi-branch Inception prediction head, training a model by combining a depth supervision strategy and a Soft-IoU loss function, and outputting an infrared small target detection result.
2. The method according to claim 1, characterized in that in step S2 the preprocessing of the infrared image to be detected by means of the Stem module comprises the following sub-steps: S201, channel initialization, namely mapping an input single-channel infrared image into a multi-channel feature space through at least one layer of convolution operation so as to complete expansion of channel dimensions under the condition of keeping original spatial resolution unchanged; S202, local residual enhancement, namely performing local characteristic enhancement processing on the characteristics after channel expansion, wherein the processing comprises convolution operation, normalization operation and nonlinear activation, and fusing the enhanced characteristics with input characteristics through residual connection so as to retain pixel-level significant information and improve initial characterization capability of a small target.
3. The method according to claim 1, wherein in step S4, the linear global scanning module LGS Block models long-distance spatial dependency of input features, comprising the sub-steps of: s401, feature preprocessing and splitting, namely inputting features After pretreatment, splitting into scanning branches along channel dimension And gating branch ; S402, extracting spatial scanning characteristics, namely inputting the scanning branch Xscan into a spatial scanning GRU unit, and obtaining global context characteristics comprising long-distance spatial dependency relations through orthogonal two-way recursive scanning in the horizontal direction and the vertical direction ; S403, gate modulation and triple residual connection by using gate branch For the global context feature Performing gate modulation to obtain modulation characteristics, performing channel compression on the modulated characteristics via an output projection layer, and performing random inactivation processing And finally, carrying out three-way residual summation on the original input characteristic, the local enhancement characteristic and the modulated global characteristic to be used as the output characteristic of the LGS Block.
4. A method according to claim 3, wherein in step S402, the processing logic of the spatially scanning GRU unit comprises the sub-steps of: S4021, parallel serialization, namely, receiving input features, and respectively remolding the input features into a horizontal one-dimensional sequence stream and a vertical one-dimensional sequence stream, wherein the horizontal one-dimensional sequence stream is generated by combining height dimension and batch dimension, and the vertical one-dimensional sequence stream is generated by transposition operation; S4022, orthogonal bidirectional scanning and space reduction, namely, respectively extracting characteristics of the horizontal sequence flow and the vertical sequence flow by utilizing a horizontal bidirectional GRU and a vertical bidirectional GRU, and outputting characteristics covering long-distance dependence of the horizontal direction and the vertical direction; s4023, channel splicing and fusion, namely splicing horizontal scanning output and vertical scanning output in the channel dimension to form fused bidirectional context characteristics; and 4024, linear mapping and activation, namely mapping the spliced features back to the original channel dimension by using a linear layer, and outputting the features after being processed by SiLU activation functions.
5. The method according to claim 4, wherein in step S5, building a linear context aggregator LCA at the deepest bottleneck of the encoder enhances the feature, comprising the sub-steps of: s501, linear space branching, namely inputting bottleneck layer input features into the space scanning GRU unit in the step S402, and obtaining linear complexity Establishing a full-image receptive field and outputting space enhancement features, wherein the calculation logic of the full-image receptive field follows the serialization and mapping process of the steps S4021 to S4024; S502, channel attention branching, namely carrying out self-adaptive global average pooling on bottleneck layer input features to obtain statistical vectors describing global background and target distribution, sending the statistical vectors into a channel attention generation network, and generating a self-adaptive channel weight graph through a sequence structure comprising channel compression, nonlinear activation and channel restoration; And S503, combining the linear space branch output with the input characteristics weighted by the channel weight graph, smoothing the characteristics through a fusion convolution layer, and finally adding the characteristics with the original input characteristics through residual connection to output the aggregated context enhancement characteristics.
6. The method according to claim 1, characterized in that in step S6 the construction decoder performs feature restoration and fusion, comprising the sub-steps of: S601, up-sampling and cross-scale feature alignment, namely up-sampling the features from a decoder at the upper layer to enable the spatial resolution to be consistent with the features of the encoder at the current scale, extracting the features of the corresponding scale in the encoder, and processing the features of the encoder through PlainBlock units to align semantic distribution among the codecs; S602, multi-path feature fusion and cascade refinement, namely splicing the up-sampled decoding features and the aligned encoder features in the channel dimension to obtain fusion features, inputting the fusion features into a fusion module comprising at least two-level cascade convolution, carrying out noise suppression and space detail reconstruction on the features, and outputting decoding features of the current scale; And S603, performing iterative recovery step by step, namely repeatedly performing the upsampling, feature alignment and fusion refinement processes to enable a decoder to recover the spatial resolution of the feature map step by step, and finally outputting the feature map with detail information and high-level semantic information at the original scale of the input image for subsequent pixel level prediction.
7. The method according to claim 1, wherein in step S7, the detection result is generated by using a multi-branch prediction structure in combination with a depth supervision policy, comprising the following sub-steps: S701, constructing a multi-branch prediction head, namely inputting the characteristics output by a decoder under the highest resolution into a multi-branch Inception module which comprises a plurality of parallel characteristic extraction branches for respectively capturing different spatial forms and direction characteristics, fusing the outputs of the branches, and generating a pixel level prediction probability map through a channel compression layer and an activation function; s702, multi-scale depth supervision prediction, namely in a training stage, auxiliary prediction branches are led out from a plurality of intermediate scale features of a decoder, corresponding intermediate prediction results are respectively generated for each scale feature, the intermediate prediction results are uniformly mapped to the same spatial scale as an input image, and the intermediate prediction results and a main prediction result participate in model training together; S703, optimizing a loss function, namely performing joint supervision on a main prediction result and auxiliary prediction results by adopting a Soft-IoU loss function based on the region overlapping degree, and constructing a multi-scale weighted loss function by setting weighting coefficients on different scale prediction results so as to relieve the problem of sample imbalance and improve the detection precision of infrared small targets.
8. An electronic device comprising a memory and a processor, the processor implementing the method of any of claims 1 to 7 when executing a computer program stored in the memory.

Description

Real-time infrared small target detection method based on linear global scanning network Technical Field The invention belongs to the field of computer vision and infrared image processing, and particularly relates to a real-time infrared small target detection method based on a linear global scanning network (LGSNet). Background Infrared small target detection is widely applied to various scenes, and has some unique characteristics compared with general object detection. Due to the far imaging distance, infrared targets are typically small, ranging from one pixel to tens of pixels, and lack color, texture, and structural information. In addition, complex background clutter (e.g., thick clouds, wave reflections, ground buildings) are very prone to masking small and weak targets, making accurate detection tasks very challenging. The existing deep learning method is excellent in coping with the challenges, the detection accuracy is remarkably improved, and the false alarm rate is reduced due to the strong characteristic representation capability. Currently, the mainstream scheme mostly adopts a hybrid architecture based on Convolutional Neural Network (CNN) and vision Transformer (ViT). Wherein CNN extracts local features using its generalized bias, and ViT captures long-range dependencies and global context information through a self-attention mechanism. This combination improves the ability of the model to detect small objects in complex scenes to some extent. However, such hybrid architecture still faces two technical challenges to be solved in practical applications: First, the amount of computation contradicts the real-time performance. ViT self-attention mechanism the computational complexity increases quadratic with the resolution of the image ). In infrared small target detection, high resolution images are often required to cover a wider field of view, which makes it difficult for a transducer-based model to meet the requirements of real-time detection on embedded or resource-constrained hardware platforms. Second, the efficiency of spatially dependent modeling is a problem. The purely attentive mechanism, while having a global field of view, lacks pertinence to such anisotropic structural features of the infrared target. In contrast, a recurrent neural network (such as GRU), while excellent in sequence modeling, has efficient linear computing characteristics) However, how to effectively map the small-size infrared target detection method with the space dimension of the two-dimensional image and form complementary advantages with the CNN feature extraction flow is still a difficult problem which has not been fully explored in the infrared small-size target detection field. At present, how to construct a lightweight network which not only has global long-distance modeling capability, but also has linear computation complexity and can effectively enhance the detection performance of small targets is still an open technical challenge in the field of infrared small target detection. Disclosure of Invention In order to solve the problems, the invention discloses a real-time infrared small target detection method based on a linear global scanning network, which replaces the traditional self-attention mechanism by introducing a linear scanning mechanism based on a cyclic neural network, and utilizes a linear global scanning module (LGSBlock) containing a space scanning GRU unit and a Linear Context Aggregator (LCA), thereby greatly reducing system reasoning overhead and hardware resource occupation, effectively enhancing the significance of infrared weak small targets under a complex background, and finally improving the accuracy, robustness and real-time response speed of detection tasks. In order to achieve the above purpose, the technical scheme of the invention is as follows: a real-time infrared small target detection method based on a linear global scanning network comprises the following specific steps: s1, acquiring an infrared image to be detected. S2, preprocessing the image to be detected through a Stem module, performing channel expansion and maintaining original spatial resolution. S3, constructing an encoder to extract the multi-scale features of the image. The method specifically comprises the steps of extracting shallow local texture features by ResBlock at a shallow Stage, and extracting global semantic features by adopting a plurality of cascaded LGS stages at a deep Stage. S4, modeling deep features by using a linear global scanning module LGS Block, and performing linear computation complexity through spatial scanning on GRU unitsThe global dependency of the anisotropy is captured. S5, receiving output characteristics of the deepest bottleneck of the encoder by using a linear context aggregator LCA, respectively inputting the characteristics into the space scanning GRU unit and the channel attention branch, and fusing the output of the space scanning GRU unit and the output of the channe