CN-122024028-A - Underwater target detection encoder, detection system and method

CN122024028ACN 122024028 ACN122024028 ACN 122024028ACN-122024028-A

Abstract

The invention discloses an underwater target detection encoder, a detection system and a detection method, wherein the encoder comprises an input projection layer, a global semantic modeling unit, a first-stage feature fusion unit and a second-stage feature fusion unit, wherein the input projection layer is used for carrying out channel mapping on multi-scale features output by a backbone network, the global semantic modeling unit establishes a dependency relationship among spatial positions in a feature map through a self-attention mechanism, the first-stage feature fusion unit introduces a frequency domain transformation path in a trans-scale feature fusion process, frequency screening is carried out on features and high-frequency perception features are generated, and the second-stage feature fusion unit introduces first-stage intermediate features in a propagation process and carries out splicing and reconstruction processing to form multi-scale feature representation. And constructing a detection system and a detection method based on the encoder to realize classification and positioning processing of targets in the underwater image.

Inventors

YE HAIXIONG
XU LONGQUAN
XU ZHE

Assignees

上海海洋大学

Dates

Publication Date: 20260512
Application Date: 20260409

Claims (10)

1. An underwater target detection encoder, comprising: the system comprises a projection layer, a channel layer, a display layer and a display layer, wherein the projection layer is used for carrying out dimension mapping on multi-scale features, and the number of unified channels is a preset hidden dimension; The global semantic modeling unit is connected to the output end of the input projection layer and is used for processing high-level features into feature sequences and superposing position codes, calculating the position correlation of the sequences through multi-head self-attention, and restoring the sequence to a two-dimensional semantic enhancement feature map after the sequence position correlation is transformed by a feedforward network; A first stage feature fusion unit comprising: the first trans-scale feature fusion module is used for performing space alignment, channel splicing and convolution processing on the multi-scale features to obtain first trans-scale fusion features; The high-frequency sensing module is arranged in the first trans-scale feature fusion module and positioned in a feature processing path after spatial resolution alignment and before channel splicing, and is used for performing discrete cosine transform on the aligned input features to generate a frequency domain coefficient matrix, screening high-frequency components based on a preset frequency mask, generating high-frequency sensing features through inverse transformation, and generating first-stage fusion features through weighted summation with the trans-scale fusion features; The first bidirectional feature propagation path is used for performing bidirectional transfer on the first stage fusion features, wherein the bottom-up path performs down-sampling on the low-level features and then performs splice transformation on the low-level features with the high-level features; The second stage feature fusion unit comprises: the second trans-scale feature fusion module is used for performing splicing and convolution processing on the output of the first stage to obtain second trans-scale fusion features; and the second bidirectional feature propagation path is used for introducing the first-stage intermediate propagation feature in bidirectional transmission to carry out splicing and convolution processing and outputting a multi-scale feature representation for target detection.
2. The underwater target detection encoder of claim 1, wherein the multi-scale features processed by the input projection layer comprise at least a first spatial resolution feature, a second spatial resolution feature, and a third spatial resolution feature, and wherein the first spatial resolution is greater than the second spatial resolution, and the second spatial resolution is greater than the third spatial resolution; the input projection layer comprises a 1 multiplied by 1 convolution layer which is respectively arranged for each level of characteristics and used for executing channel compression or channel expansion on the input characteristics so that the characteristics of each layer are uniformly mapped to a preset hidden dimension, and a normalization layer which is connected with the 1 multiplied by 1 convolution layer and used for stabilizing characteristic distribution and enhancing consistency of characteristic expression.
3. An underwater target detection encoder as claimed in claim 2, wherein the global semantic modeling unit comprises: Superposing a position code for the feature sequence, wherein the position code is generated based on a sine-cosine function; inputting the feature sequence added with the position codes into a multi-head self-attention layer, and calculating attention weights by constructing a query matrix, a key matrix and a value matrix, so as to establish the dependency relationship between different spatial positions; Inputting an attention output result into a feedforward network for feature transformation, wherein the feedforward network consists of two full-connection layers and an activation function; And obtaining semantic enhancement features through residual connection and normalization operation, and recovering the semantic enhancement features into a two-dimensional feature map structure to obtain a more complete semantic enhancement feature map.
4. An underwater target detection encoder as claimed in claim 3 wherein the corresponding scale feature fusion of the first and second trans-scale feature fusion modules comprises: The scale alignment operation is used for adjusting the features with different spatial resolutions to a uniform resolution scale, wherein the resolution of the third spatial resolution feature is improved by adopting an interpolation up-sampling mode; And carrying out multiscale perception extraction, namely carrying out local structure information extraction by utilizing the aligned characteristics of a plurality of parallel convolution branches, wherein each convolution branch has different convolution kernel sizes and is used for extracting local structure information of different receptive fields, and realizing information interaction among channels through point-by-point convolution.
5. The underwater target detection encoder of claim 4, wherein the high frequency perception module comprises: the space perception path is used for extracting space high-frequency response from input features, wherein for preset shallow features, the space domain features are converted into frequency domain representations through discrete cosine transformation, frequency domain low-frequency components are restrained through frequency masks, frequency domain high-frequency components are reserved, and then the space high-frequency response is generated through inverse transformation; the channel perception path is used for extracting channel high-frequency response from input features, wherein for preset shallow features, pooling statistics is carried out on frequency domain high-frequency components, and channel attention weights are generated through a convolution network; the high-frequency feature refining path is used for adding the space sensing path output and the channel sensing path output, and then obtaining high-frequency sensing features through convolution and normalization processing; The high-frequency perception features are subjected to weighted fusion through the learner weight and the trans-scale fusion features output by the first trans-scale feature fusion module.
6. An underwater target detection encoder as defined in claim 5 wherein the first bi-directional feature propagation path, when performing feature transfer, takes a feature map of each scale feature branch before stitching and transforming as a first stage intermediate propagation feature and constructs an intermediate propagation feature set.
7. The underwater target detection encoder of claim 6, wherein the first bi-directional feature propagation path and the second bi-directional feature propagation path each comprise: The feature refining module refines the transmission features by using a residual block comprising a convolution layer, a normalization layer and residual connection; A bi-directional feature transfer path that transfers information of the first spatial resolution feature to a third spatial resolution feature level by convolving downsampling; the top-down path conveys third spatial resolution feature information to the first spatial resolution feature level by interpolation upsampling; the first bidirectional feature propagation path is used for outputting a first-stage intermediate propagation feature, and the second bidirectional feature propagation path is used for splicing and fusing a second-stage propagation feature and the first-stage intermediate propagation feature so as to obtain a final multi-scale feature representation.
8. The underwater target detection encoder of claim 7 wherein the high frequency perception module balances with a learnable weight parameter when performing weighted summation with the first cross-scale fusion feature, and wherein the second bi-directional feature propagation path fuses the intermediate feature of the corresponding level of the first stage with the current propagation feature of the second stage by way of channel dimension stitching when introducing the intermediate propagation feature of the first stage, and then performs feature reconstruction by way of a convolution layer.
9. An underwater target detection system, comprising: the image input module is used for acquiring an underwater original image and carrying out pixel normalization and size preprocessing; The backbone network module is used for extracting multi-scale characteristics of the input image; an encoder module for encoding multi-scale features output by a backbone network, the encoder module employing an underwater target detection encoder as claimed in any of claims 1-8; The target detection head module comprises a classification branch and a regression branch and is used for predicting target category and target boundary box positions based on multi-scale characteristics output by the encoder.
10. An underwater target detection method based on the implementation of the underwater target detection system as claimed in claim 9, characterized by comprising the steps of: S1, extracting characteristics of an input underwater image through a backbone network to obtain multi-scale characteristic diagrams with different spatial resolutions; s2, performing channel mapping on the multi-scale features through an input projection layer to enable the features of each layer to be mapped to a preset hidden dimension uniformly; s3, inputting the highest-layer features into a global semantic modeling unit, and establishing global dependency relations among all spatial positions in the feature map through position coding and a multi-head self-attention mechanism to obtain semantic enhancement features; s4, inputting the multi-scale features into a first-stage feature fusion unit, carrying out first-stage trans-scale feature fusion through a first trans-scale feature fusion module to obtain trans-scale fusion features, extracting frequency domain high-frequency components through a high-frequency perception module in the aggregation process to generate high-frequency perception features, and carrying out weighted fusion on the high-frequency perception features and the trans-scale fusion features to enhance target edges and texture information; S5, inputting the first-stage multi-scale features and the first-stage intermediate propagation features into a second-stage feature fusion unit, fusing the second-stage multi-scale features through a second trans-scale feature fusion module, splicing and fusing the first-stage intermediate propagation features and the second-stage propagation features in a second bidirectional feature propagation process, and executing feature reconstruction constraint on the second-stage propagation features based on the first-stage intermediate propagation features to enable shallow detail information and deep semantic information to be aligned gradually; S6, inputting the multi-scale characteristics output in the second stage into a target detection head, and predicting the positions of the target category and the target boundary frame, so that an underwater target detection result is obtained.

Description

Underwater target detection encoder, detection system and method Technical Field The invention relates to the technical field of computer vision and underwater image processing, in particular to an underwater target detection encoder, a detection system and a detection method. Background The underwater target detection has important application value in the fields of marine ecological monitoring, underwater robot navigation, marine resource exploration and the like. However, due to the absorption and scattering effects of the water body on light, the underwater image generally has the problems of low contrast, color distortion, blurred edges and the like, so that the texture information and the structural characteristics of the target are obviously degraded. The existing underwater target detection method mainly adopts a convolutional neural network to carry out target detection by combining a multi-scale characteristic pyramid structure. For example, the feature pyramid network realizes multi-scale feature fusion through top-down path and transverse connection, but the method generally adopts element-by-element addition or simple splicing mode to perform feature fusion, so that shallow detail information and deep semantic information are difficult to be fully aligned. In addition, some methods enhance the image edge information through frequency domain filtering, but the frequency domain processing is usually separated from the depth feature extraction process as an independent preprocessing step, and the high-frequency information is easily and gradually weakened by convolution and pooling operations in the deep network propagation process. Accordingly, the prior art has the following problems: 1. The extraction capability of the edge and texture information of the underwater target is insufficient; 2. The high-frequency enhancement and the characteristic fusion process are mutually separated, and the high-frequency information is easy to attenuate in the characteristic propagation process; 3. the single-stage multi-scale feature fusion is difficult to realize the depth alignment of shallow details and deep semantics; 4. the local receptive field of the convolutional network limits the global semantic modeling capability of complex underwater scenes. In view of the above, the invention provides an underwater target detection encoder, a detection system and a detection method. Disclosure of Invention The invention aims to provide an underwater target detection encoder, a detection system and a detection method, which realize synchronous fusion of high-frequency information and a characteristic fusion process by embedding a high-frequency sensing module in the multi-scale characteristic fusion process, so that the underwater target detection precision is improved. In a first aspect, the present invention provides an underwater target detection encoder comprising: the input projection layer is used for carrying out dimension mapping on multi-scale features output by the backbone network, so that the number of feature channels of different levels is unified into a preset hidden dimension; The global semantic modeling unit is connected with the output end of the input projection layer and is used for executing global relation modeling on the feature map with the lowest spatial resolution in the multi-scale features, and building global dependency relations among different spatial positions in the feature map through a self-attention mechanism so as to obtain semantic enhancement features; A first stage feature fusion unit comprising: the first trans-scale feature fusion module is used for performing spatial resolution alignment on each scale feature processed by the input projection layer, and performing channel splicing and convolution processing on the aligned features to obtain first trans-scale fusion features; The high-frequency sensing module is embedded in the feature fusion path of the first trans-scale feature fusion module, performs discrete cosine transform on the input features aligned in space resolution to generate a frequency domain coefficient matrix, screens the frequency domain coefficient matrix based on a preset frequency mask, and retains high-frequency components; The system comprises a first bidirectional feature propagation path, a convolution unit, a first stage multi-scale feature and a second stage intermediate propagation feature, wherein the first bidirectional feature propagation path is used for carrying out bidirectional feature transfer on the multi-scale features fused in the first stage, and in a bottom-up path, the lower-layer features are spliced with the higher-layer features after downsampling, and feature transformation is carried out on the lower-layer features through the convolution unit; The second stage feature fusion unit comprises: The second trans-scale feature fusion module is used for carrying out channel splicing and convolution processing on the multi-scale f