CN-116863305-B - Infrared dim target detection method based on space-time feature fusion network
Abstract
The invention relates to the technical field of infrared dim target detection, and aims to fully mine space-time characteristics in an infrared image sequence and realize an attention guiding mechanism in two dimensions of time and space. The invention is mainly applied to design and manufacturing occasions. According to the technical scheme adopted by the invention, an infrared dim target detection method based on a space-time feature fusion network is adopted, a space-time feature fusion infrared dim target detection network model STNet takes an infrared image of an adjacent frame as input, a STNet backbone network consisting of space attention units SAU is sent to extract visual features, finally, a time sequence guided transducer structure is sent to complete space-time feature fusion, and finally, prediction of a detection result is generated, wherein a STNet backbone network consisting of space attention units SAU is used for realizing in-situ substitution of a STNet backbone network convolution layer by using SAU. The invention is mainly applied to the detection occasion of infrared weak and small targets.
Inventors
- SUN YUXIN
- JI ZHONG
Assignees
- 天津大学
Dates
- Publication Date
- 20260508
- Application Date
- 20230713
Claims (6)
- 1. The method is characterized in that a space-time feature fusion infrared dim target detection network model STNet is used, an infrared image of adjacent frames is taken as input, the infrared image is sent into a STNet main network composed of space attention units SAU to extract visual features, finally, the space-time feature fusion is completed by sending a time sequence guided transform structure, and finally, prediction of a detection result is generated, wherein a STNet main network composed of the space attention units SAU is used for realizing in-situ substitution of STNet main network convolution layers by using the SAU, the time sequence guided transform structure specifically comprises the steps that three MLP networks are used for obtaining a retrieval matrix Q, a key value matrix K and a content matrix V from the input feature Fin, and the self-attention features are obtained by using a formula (1): Attention(Q, K, V) = Softmax( )V (1) wherein, in order to restrict the numerical value of the feature, the feature dimension number is introduced Dividing the product of Q and K T by The spatial feature f kou of the image sequence is used as the input of the K matrix and the V matrix, the temporal feature f kou is used as the input of the Q matrix, and the self-attention mechanism after transformation is expressed as follows: Attention(Q, K, V) = Softmax( ) ; The number of layers of a adopted time sequence guided transform encoder and decoder is set to be 3, the output characteristics of the decoder are used as the input of the MLP, the prediction of the target type and the position is obtained, each MLP network is used for realizing the prediction of the position frame, 20 MLP networks are connected behind the transform encoder in STNet, the number of layers of a single MLP is set to be 3, and the number of middle hidden layers is 512, 256 and 128 respectively.
- 2. The method for detecting the infrared dim target based on the space-time feature fusion network according to claim 1, wherein the SAU is added with a deformable convolution layer, the deformable convolution kernel is added with a variable position offset in the operation process, and the receptive field of convolution operation can be provided with scale transformation and rotation capability.
- 3. The method for detecting the infrared weak and small target based on the space-time feature fusion network as claimed in claim 1, wherein the SAU firstly adopts two expansion convolution products and a deformable convolution to extract multi-scale features, the convolution kernel sizes are set to be 3 multiplied by 3, the expansion rates of the expansion convolution are set to be 2 and 4 respectively, the feature images after convolution operation are combined to be uniform tensors, the convolution with the convolution kernel size of 1 multiplied by 1 is utilized to reduce the dimension of the features so as to keep the dimension of the features unchanged, the dimension of the features is consistent with the input dimension, meanwhile, the SAU is added with a direct communication path, in the direct communication path, the feature images input by the SAU are added with the features after the convolution operation of 1 multiplied by 1, so that the gradient disappearance problem caused by the increase of the model depth is relieved, and in order to meet the requirement of the number of channels, and when the input channel number of the SAU is inconsistent with the output channel number, the 1 multiplied by 1 is added on the direct communication path.
- 4. The method for detecting the infrared dim target based on the space-time feature fusion network according to claim 1, wherein a ResNet-18 network is selected as an original configuration to form a ResNet-SAU-18 model, a ResNet-SAU-18 model takes a 7×7 convolution layer and a maximum pooling layer as a first convolution group Conv_1, wherein SAU units sequentially replace convolution layers of Conv_2, conv_3, conv_4 and Conv_5 in ResNet-18, 7×7×512 dimension features are expanded, 49×512 dimension features are taken as visual features, and two adjacent frames of infrared images It-1 and It are respectively sent into a ResNet-SAU-18 structure to obtain corresponding visual features Ft-1 and Ft.
- 5. The method for detecting infrared dim targets based on a space-time feature fusion network as claimed in claim 1, wherein the training process of STNet model is as follows: Selecting an image frame It and a previous frame It-1 from an infrared image sequence, sending the two frames of images into a main network composed of SAU, replacing part of convolution layers in ResNet-18 with SAU units to form a ResNet-SAU-18 model by using a ResNet-18 basic structure of the main network, inputting It and It-1 into the ResNet-SAU-18 model with 224X 224 resolution, and expanding the output 49X 512 dimension characteristics to be visual characteristics Ft and Ft-1 of the two frames of images; Step 2, calculating a time sequence characteristic f kou and a space characteristic f kou , wherein ft=ft-Ft-1, f kou = F kou + F koukoudkou ,F koukoudkou is position coding; Step 3, sending f kou and f kou into a time sequence guided transducer structure, wherein the space characteristic fs is used as the input of a key value matrix K and a content matrix V of an encoder, and the time sequence characteristic f t is used as the input of a search matrix Q in the encoder and a decoder to obtain a prediction Y p = [y p1 , y p2 , . . . , y p20 T of a target detection result; Step 4, matching the predicted result Y p and the true value Y gt by using a Hungary algorithm to form a matched predicted result Y 𝑝 ′ = [y p1 , y p2 , . . . , y pn T; Step 5, calculating the loss function by using Y gt and Y 𝑝 ′ L=l label + αL box ,L label is a class loss, which is a cross entropy loss value of a class and a true class in a prediction result, L box is a prediction frame loss, and the L1 norm is calculated by parameters of a horizontal central point x, a vertical central point y, a width w, a height h and a corresponding frame in the true value of the prediction frame; and 6, calculating gradient according to the loss function calculated value L, and updating model parameters.
- 6. The method for detecting infrared dim targets based on the space-time feature fusion network of claim 1, wherein the target prediction process of STNet model comprises the following steps of (1) inputting STNet the image frame I ' kou to be detected and the frame I ' kou−1 before it, and if the current frame is the first frame, I ' kou−1 = I' kou obtaining the prediction results of 20 MLP networks; And 2, judging that the probability P lkoukoukoul=kou corresponding to the category is larger than a set threshold value theta in 20 prediction results, outputting a prediction frame result of the MLP network, and recognizing the detection frame as the position of the ith category target.
Description
Infrared dim target detection method based on space-time feature fusion network Technical Field The invention relates to a deep learning target detection method for infrared weak and small target detection, which can be applied to various tasks such as remote infrared early warning, unmanned aerial vehicle, coast monitoring and the like, and is suitable for different platforms such as airborne, carrier-borne, missile-borne, ground and the like. In particular to an infrared dim target detection method based on a space-time feature fusion network. Background In the application process of the infrared detection device, the detected target is relatively far away from the detection device, so that the infrared radiation signal of the target received by the infrared detector is relatively weak due to the caliber of the optical system and the atmospheric attenuation in the propagation path. The object typically occupies only a few to ten pixels in the imaging frame of the infrared detector, and lacks texture information available for recognition. Meanwhile, unlike a visible light camera which can collect spectrum information of three spectral ranges of red, green and blue of a target at the same time, a conventional infrared detector can only image a specific single infrared spectral range and lacks multispectral characteristics. Therefore, the infrared detection device needs to identify the infrared weak target with low signal-to-noise ratio, few characteristics and small area. In the existing single frame processing process of the infrared image, the characteristic information limited by the target is limited, and the problems of false alarm and false alarm missing are unavoidable. However, in the application process of the infrared detection device, a skilled device operator can accurately and effectively identify the object of interest from the infrared image in the monitoring screen facing the infrared detection device. This is because, with long experience accumulation, the operator can distinguish between a real object and a false object in accordance with the movement locus, shape change, and brightness change of the object. Therefore, how to efficiently use the space-time characteristics of the target becomes a key for solving the problem of identifying the infrared dim target. With the development of deep neural network technology, particularly the advent of network structures such as LSTM, transformer and the like, the method provides possibility for solving the problem of high-efficiency processing of sequence signals. The transducer structure originates in the natural language processing field and is then expanded into visual processing tasks. The transducer structure breaks through the limitation that the RNN network cannot be subjected to parallelization calculation by using the self-attention mechanism introduced by the transducer structure, can generate a more explanatory network structure, and becomes a brand new solution idea for solving the sequence problem. The self-attention mechanism solves the problem of how to build the connection between the current vector and the subsequent vector by means of learning accumulation of the self-network after the sequence feature is input into the neural network, and enables the model to quickly converge to achieve the expected effect, and the self-attention model introduces a search matrix Q, a key value matrix K and a numerical matrix V, as shown in figure 1, each input feature vector is calculated by three independent MLP networks to obtain Q, K, V three vectors, and the feature output after attention calculation is calculated. The transducer network uses the design thought of the encoder-decoder, stacks a plurality of self-attention structures in the encoder and the decoder, takes the characteristics as the input of the encoder, and finally outputs the prediction result. In the existing vision processing task application, the Transformer network divides the whole frame of image into a plurality of image blocks, and sends the image blocks and the coding information of the corresponding positions into the coding and decoding network structure to generate the position and type prediction of the target. In addition, the existing transducer network structure mainly adopts a general convolutional neural network structure to extract the visual characteristics of the target, and does not design a specific model style aiming at the problem of infrared weak and small target identification, so that the model style is more in line with the target characteristics of the infrared weak and small target, and the performance of the identification network can be influenced. Disclosure of Invention In order to overcome the defects of the prior art, the invention fully digs the space-time characteristics in the infrared image sequence and realizes an attention guiding mechanism in two dimensions of time and space, and aims to provide two network structures of a Spa