CN-121746690-B - Small target detection method based on semantic compression enhancement

CN121746690BCN 121746690 BCN121746690 BCN 121746690BCN-121746690-B

Abstract

The invention belongs to the technical field of target detection and relates to a small target detection method based on semantic compression enhancement, which comprises the steps of obtaining image data to be detected containing a small target and carrying out image scaling to obtain scaled image data; the method comprises the steps of performing multi-scale feature extraction on scaled image data by using a backbone network to obtain multi-scale features, performing multi-scale semantic compression and global feature enhancement on the multi-scale features by using a semantic compression enhancement encoder, fusing the enhanced global features with the multi-scale features to obtain encoder output features, and inputting the encoder output features into a decoder to obtain target classification results and regression frames. The small target detection method greatly reduces the parameter quantity and the calculated quantity, and improves the average accuracy of small target detection.

Inventors

LAI RUI
ZHANG JIAHAO
WU TONG
GUAN JUNTAO
LI DONG
MA RUI
ZHU ZHANGMING

Assignees

西安电子科技大学

Dates

Publication Date: 20260505
Application Date: 20260226

Claims (8)

1. The small target detection method based on semantic compression enhancement is characterized by comprising the following steps: Obtaining image data to be detected containing a small target, and performing image scaling to obtain scaled image data; performing multi-scale feature extraction on the scaled image data by using a backbone network to obtain multi-scale features; Performing multi-scale semantic compression and global feature enhancement on the multi-scale features by using a semantic compression enhancement encoder, and fusing the enhanced global features with the multi-scale features to obtain encoder output features; inputting the output characteristics of the encoder into a decoder to obtain a target classification result and a regression frame; The semantic compression enhancement encoder comprises a multi-scale semantic compression module, a foreground sampling enhancement module and a spatial information recovery module, wherein the multi-scale semantic compression module is used for carrying out feature compression and fusion on the multi-scale features to obtain global features; The computing method in the multi-scale semantic compression module comprises the following steps: the method comprises the steps of adopting a first downsampling module to downsample first scale features to obtain first downsampled features, adopting a first convolution block attention module to conduct feature aggregation on the first downsampled features to obtain first target concept information: ; Wherein, the Representing a first scale feature of the image, A first downsampling block is represented and, A first convolution block attention module is represented, Representing first target concept information; and performing feature aggregation on the second scale features by adopting a second convolution block attention module to obtain second target concept information: ; Wherein, the Representing a feature of a second dimension of the device, A second convolution block attention module is represented, Representing second target concept information; and up-sampling the third scale feature by adopting a first up-sampling module to obtain a high-level feature: ; Wherein, the Representing a third-scale feature of the device, A first up-sampling module is shown as such, Representing high-level features; And converting semantic knowledge in the high-level features into gate-on signals by adopting a gate control module: ; Wherein, the The gate control module is represented by a logic gate, Representing a gate-on signal; multiplying the gate-on signal with the first target concept information by adopting a first complementary information enhancement module, and adding the multiplication result with the first target concept information to obtain a first low-level feature: ; Wherein, the Representing a first low-level feature; Multiplying the gate-on signal with the second target concept information by adopting a second complementary information enhancement module, and adding the multiplication result with the second target concept information to obtain a second low-level feature: ; Wherein, the Representing a second low level feature; splicing the first low-level features, the second low-level features and the high-level features to obtain multi-layer features: ; Wherein, the Representing a multi-layer feature, Representing a splicing module; mapping the multi-layer feature into the global feature by adopting a first convolution fusion module: ; Wherein, the The global characteristics are represented as such, Representing a first convolution fusion module.
2. The semantic compression enhancement based small object detection method according to claim 1, wherein the scaled image data is shaped as Wherein the tensor of (c), wherein, And Is the length and width of the scaled image data.
3. The semantic compression enhancement based small object detection method according to claim 1, wherein the backbone network comprises Resnet neural networks; The multi-scale features include a first scale feature, a second scale feature, and a third scale feature, wherein the first scale feature is shaped as Tensor of (2), the second scale feature is shaped as Tensor of (a), the third scale feature is shaped as Is used for the tensor of (c), And Is the length and width of the scaled image data.
4. The small target detection method based on semantic compression enhancement according to claim 1, wherein the calculation formula of the gating module is: ; Wherein, the A convolution module with input channel of C, output channel of C, step length of 1 and convolution kernel size of 1 x 1, Representing an activation function.
5. The semantic compression enhancement based small target detection method according to claim 1, wherein the computing method in the first convolution fusion module comprises: Processing the multilayer feature by adopting a first convolution module to obtain a first convolution feature: ; processing the multilayer feature by adopting a second convolution module to obtain a second convolution feature: ; The first convolution characteristic is processed by adopting a first subunit, a second subunit and a third subunit which are sequentially connected to obtain a repeated unit output characteristic, the first subunit, the second subunit and the third subunit have the same structure and all comprise a third convolution module, a fourth convolution module, a fifth convolution module and an addition module, and the repeated unit output characteristic is as follows: ; adding the second convolution feature and the repeating unit output feature to obtain a global feature: ; Wherein, the A first convolution module is represented and is shown, A first convolution characteristic is represented and is represented, A second convolution module is represented and indicated, A second convolution characteristic is indicated and is indicated, And The input channels of the number (3×C), the output channels of the number (C), the convolution kernels of the number (1×1), A third convolution module is shown as such, A fourth convolution module is shown as such, A fifth convolution module is shown as such, And The input channels of the number (C), the output channels of the number (C) and the convolution kernel of the number (1X 1), The input channel of (2) is C, the output channel is C, the convolution kernel is 3×3, the groups parameter is C, and the filling is 1; representing the output characteristics of the first subunit, Representing a repeating unit output characteristic; Representing the presentation to be Repeated three times.
6. The semantic compression enhancement based small object detection method according to claim 1, wherein the computing method in the foreground sample enhancement module comprises: processing the global feature by adopting a sixth convolution module, a seventh convolution module and an activation function, and outputting a score: ; Wherein, the In order to score the score of the score, As a global feature of the device, it is possible, A sixth convolution module with input channel C and output channel C and convolution kernel 1 x 1, A seventh convolution module with input channel being C, output channel being the classification number of the current data set and convolution kernel being 1 x 1, Is an activation function; using Topk operations, selecting indexes corresponding to data of a preset proportion according to the scores: ; Wherein, the The index is represented by a number of indices, Indicating a preset ratio of 0.25; using Gather operation, selecting the feature at the corresponding position from the global features according to the index as a foreground feature: ; Wherein, the Is a foreground feature; Obtaining enhanced foreground features from the foreground features and the global features using a deformable attention module ; Filling the value of the enhanced foreground feature in the corresponding position in the global feature according to the index through a Scatter operation, so as to obtain the enhanced global feature: ; Wherein, the Is an enhanced global feature.
7. The semantic compression enhancement based small object detection method according to claim 6, wherein the calculation method in the spatial information recovery module comprises: And upsampling the enhanced global feature by using a second upsampling module to obtain a second upsampled feature: ; Wherein, the A second up-sampling feature is represented, A second up-sampling module is shown as such, Representing enhanced global features; copying the enhanced global feature to obtain a copied feature: ; Wherein, the Representing the characteristics of the replication; and adopting a second downsampling module to downsample the enhanced global feature to obtain a second downsampled feature: ; Wherein, the A second downsampling characteristic is represented and, Representing a second downsampling module; After the third scale feature and the second downsampling feature in the multi-scale features are spliced, an eighth convolution module is used for carrying out convolution processing on the spliced features, and output features of the first encoder are obtained: ; Wherein, the Representing a third-scale feature of the device, The splice is indicated as being a function of the splice, An eighth convolution module representing an input channel of 2 xc, an output channel of C, a step size of 1, a convolution kernel size of 1 x1, Representing a first encoder output characteristic; And upsampling the output characteristic of the first encoder by adopting a third upsampling module, and splicing the third upsampled characteristic with a second scale characteristic in the multi-scale characteristics and the copied characteristic to obtain a first spliced characteristic: ; Wherein, the Representing a feature of a second dimension of the device, The splice is indicated as being a function of the splice, A first stitching characteristic is represented and is shown, Representing a third upsampling module; Processing the first splicing characteristic by using a second convolution fusion module to obtain a second encoder output characteristic: ; Wherein, the A second convolution fusion module is shown as such, Representing a second encoder output characteristic; and upsampling the output characteristic of the second encoder by adopting a fourth upsampling module, and splicing the fourth upsampling characteristic with the first scale characteristic and the second upsampling characteristic in the multi-scale characteristic to obtain a second spliced characteristic: ; Wherein, the A second stitching characteristic is represented and is shown, A fourth up-sampling module is shown and, Representing a first scale feature; processing the second splicing characteristic by using a third convolution fusion module to obtain a third encoder output characteristic: ; Wherein, the A third convolution fusion module is shown as such, Representing a third encoder output characteristic.
8. The semantic compression enhancement-based small target detection method according to claim 7, wherein the first convolution fusion module, the second convolution fusion module and the third convolution fusion module have the same structure.

Description

Small target detection method based on semantic compression enhancement Technical Field The invention belongs to the technical field of target detection, and particularly relates to a small target detection method based on semantic compression enhancement. Background In practical application scenes such as unmanned aerial vehicle aerial photography and remote sensing image analysis, small target detection plays a vital role. The small targets in the scene generally have the remarkable characteristics of low pixel duty ratio, sparse characteristic information, missing detail textures and the like, are easily interfered by factors such as complex background noise, illumination change, dense target distribution and the like, and cause the detection difficulty to be far higher than that of the targets with the conventional size. Along with the rapid development of the deep neural network, DETR (Detection Transformer) series of target detection methods make breakthrough progress on general target detection data sets such as COCO (Common Objects in Context) by means of use as a pretext end-to-end detection frames and global feature modeling capability. However, when applied to typical small target Detection datasets such as AITOD (AERIAL IMAGE TINY Object Detection), visDrone (Vision Meets Drones), the existing DETR method exposes the defect that the DETR method relies on multi-scale feature input of a backbone network, and the backbone network inevitably causes serious loss of small target detail information in the process of acquiring high-level semantic features through layer-by-layer downsampling, so that an encoder is difficult to capture effective features which are sufficient to distinguish targets from backgrounds, and the Detection omission ratio is obviously increased. In order to make up the problem of insufficient small target features, the introduction of shallow sub-features into an encoder is an effective solution, but the shallow sub-features contain a large amount of redundant background information, so that not only can the calculation redundancy in the feature interaction process be caused, the parameter number and floating point operand are increased sharply, but also the risk that the small target features are submerged by background noise is aggravated, and meanwhile, obvious reasoning delay is brought, so that the requirement of detecting scenes in real time is difficult to meet. In addition, the traditional self-attention mechanism lacks pertinence when processing global features, has insufficient capability of distinguishing foreground features and background features of the small targets, cannot efficiently focus on key information of the region where the small targets are located, and further restricts the improvement of the detection precision of the small targets. In summary, how to enhance the capability of the DETR method for extracting and enhancing the features of the small target on the premise of ensuring the computing efficiency, reduce the information loss in the feature interaction process, and realize the accurate detection of the small target has become a core problem to be solved in the application of the current DETR method in the field of small target detection. Disclosure of Invention In order to solve the problems in the prior art, the invention provides a small target detection method based on semantic compression enhancement. The technical problems to be solved by the invention are realized by the following technical scheme: The embodiment of the invention provides a small target detection method based on semantic compression enhancement, which comprises the following steps: Obtaining image data to be detected containing a small target, and performing image scaling to obtain scaled image data; performing multi-scale feature extraction on the scaled image data by using a backbone network to obtain multi-scale features; Performing multi-scale semantic compression and global feature enhancement on the multi-scale features by using a semantic compression enhancement encoder, and fusing the enhanced global features with the multi-scale features to obtain encoder output features; and inputting the output characteristics of the encoder into a decoder to obtain a target classification result and a regression frame. In one embodiment of the invention, the scaled image data is in the shape ofWherein the tensor of (c), wherein,AndIs the length and width of the scaled image data. In one embodiment of the invention, the backbone network comprises Resnet neural networks; The multi-scale features include a first scale feature, a second scale feature, and a third scale feature, wherein the first scale feature is shaped as Tensor of (2), the second scale feature is shaped asTensor of (a), the third scale feature is shaped asIs used for the tensor of (c),AndIs the length and width of the scaled image data. In one embodiment of the invention, the semantic compression enhancement encode