CN-121981889-A - Underwater sonar image super-resolution reconstruction method based on lightweight generation countermeasure network
Abstract
The invention belongs to the technical field of underwater sonar image processing, and particularly relates to an underwater sonar image super-resolution reconstruction method based on a lightweight generation countermeasure network. The invention firstly builds a sonar image super-resolution reconstruction frame, namely a lightweight generation countermeasure network, and comprises two main body networks, namely a lightweight generator based on a mixed cascade architecture and a discriminator for fusing semantic features, wherein the output of the lightweight generation countermeasure network is guided by a staged mixed loss function to finish the reconstruction of a low-resolution sonar image. The invention can generate the high-resolution sonar image with clear edges and vivid textures, and effectively solves the problems of low resolution, large noise and blurred edges of the sonar image.
Inventors
- DING JIE
- FU XINYAN
- ZOU BINBIN
- CHEN JINGJING
Assignees
- 复旦大学
- 中国科学院声学研究所东海研究站
Dates
- Publication Date
- 20260505
- Application Date
- 20251231
Claims (6)
- 1. A super-resolution reconstruction method of an underwater sonar image based on a lightweight generation countermeasure network is characterized by firstly constructing a super-resolution reconstruction frame of the sonar image, and specifically comprises two main networks, namely a lightweight generator based on a hybrid cascade architecture and a discriminator for fusing semantic features, wherein the network is the lightweight generation countermeasure network based on the hybrid cascade architecture, the output of the network is guided through a staged hybrid loss function, so that the model completes the high-quality reconstruction of a low-resolution sonar image, and the method comprises the following steps: the lightweight generator (LASDN) is used for extracting features from the low-resolution sonar image and reconstructing the high-resolution image, and a hybrid cascade strategy is adopted to combine the convolved induction bias with the dynamic modeling capability of the attention; The lightweight generator consists of a plurality of cascaded mixed large-core axial sensing modules (HLAB), wherein the HLAB module comprises a re-parameterized convolution unit (RCB), a large-core convolution unit (LKC) and an axial sensing self-attention unit (ASA) in detail, a sonar image is subjected to primary feature mapping through a blueprint separation convolution (BSConv) to obtain shallow features, and then the shallow features are input into the HLAB module, namely the shallow features are subjected to local feature preprocessing in the re-parameterized convolution unit (RCB), then the long-range context in a space is captured in the large-core convolution unit (LKC), and finally the full-dimensional feature refining in the axial sensing self-attention unit (ASA); The discriminator for fusing semantic features not only receives super-resolution images (SR) generated by a lightweight generator as input, but also additionally introduces a pre-trained visual model (PVM), wherein the PVM is used for extracting deep semantic characterization of the images, assisting the discriminator to judge authenticity through a semantic feature fusion module (SFB) embedded in the discriminator, and guiding the generator to generate realistic textures through a cross attention mechanism The method comprises the steps of adopting a U-Net-based encoder-decoder architecture by a semantic feature fusion module (SFD) and combining a multi-scale semantic injection mechanism, wherein an SFD main body consists of a downsampling encoding path and an upsampling decoding path, specifically, a basic convolution unit consists of a convolution layer, a spectrum normalization layer and a Leaky ReLU activation function which are connected in series and are used for extracting texture and structural features of an image, simultaneously, on the semantic injection path, semantic features of a pre-training visual model (PVM) using freezing parameters are respectively injected into different levels of the encoder path, a semantic feature fusion module (SFB) is arranged at each level output of the encoder, the SFB receives the semantic features from the PVM and the identifier downsampling image features of the current level, the feature fusion is carried out through a cross attention mechanism, the fused features are transmitted to an upsampling layer corresponding to the decoder, the final output of the decoder is subjected to upsampling and convolution processing, the final output of the decoder represents the probability that an input image is a true image, and the probability that the input image is judged to be the Real image or the Real image is judged to be the Real image.
- 2. The method for reconstructing the super-resolution of the underwater sonar image according to claim 1, wherein in the lightweight generator (LASDN), the re-parameterable convolution unit (RCB) adopts a structural design of 'training-reasoning' decoupling, and in a training stage, a two-stage multi-branch parallel structure is adopted, wherein in the first stage, the two-stage multi-branch parallel structure is formed by The convolution branch and the identity mapping branch are connected in parallel, the outputs of the two are added; the second stage is composed of Deep convolution branches The deep convolution branches are connected in parallel to form a parallel combination, and finally the two outputs and the identical mapping branches are added, and the functions are activated by GELU and then output Convolutional layer The series structure of the deep convolutional layers activates the output through GELU.
- 3. The method for reconstructing the super-resolution of the underwater sonar image according to claim 1, wherein in the lightweight generator (LASDN), the large kernel convolution unit (LKC) uses the idea of large kernel convolution decomposition to realize the reduction of the number of parameters while maintaining the large sensitivity of the convolution kernel, the input characteristics are divided into two paths, namely ① main path, which are convolution branches, sequentially pass through three convolution layers, namely, first Common convolution (Conv) for channel interaction and dimension reduction, followed by For extracting local spatial context, followed by a convolution kernel of size (DW-Conv) And a depth-expanded convolution (DW-D-Conv) with an expansion ratio of 3, expanding the effective receptive field to The method comprises the steps of capturing long-distance space dependence, enabling ② bypass to be a residual branch, enabling input features to be directly transmitted through identity mapping, enabling output of a main path to be added and fused with input of the bypass through residual connection, and outputting features containing macroscopic structure information.
- 4. The method for reconstructing an underwater sonar image super-resolution according to claim 1, wherein in the lightweight generator (LASDN), the axial awareness self-attention unit (ASA) performs attention calculations of a spatial axis and a channel axis sequentially by dimension transformation using an orthogonal cascade architecture, and in a spatial attention stage, input features are first passed through three parallel Convolution layers respectively generating query projections of space dimension Key projection Sum projection The dimensions are all , In order to be of a space size, Is the number of channels and utilizes And Performing matrix multiplication operation to generate an original score reflecting the dependency relationship among pixels, and performing Softmax normalization processing to obtain a spatial attention map with dimensions of Space attention is sought and sought Matrix multiplication is performed to obtain spatial features weighted by spatial attention Dimension is Space characteristics after Performing dimension rotation operation to obtain a space dimension of% ) Dimension with channel ) Interchanging, the final output feature dimension becomes In the channel attention stage, the characteristics after dimension change generate the inquiry of channel dimension through the projection layer again Key and key Sum value The dimensions are all In the same way, utilize And (d) sum Performing matrix multiplication operation, and performing Softmax normalization processing to obtain channel attention diagram with dimensions of Channel attention is sought and calculated Matrix multiplication to obtain channel characteristics with enhanced channel attention Dimension is Finally, the channel characteristics to be output Performing dimension rotation again to restore the dimension to be Obtaining the final output Through the structure, the ASA unit realizes serial full-dimensional interaction of 'first space and then channel', and solves the problems of large calculated amount or single dimension of the traditional attention mechanism.
- 5. The method for reconstructing the super-resolution of the underwater sonar image according to claim 1, wherein in the discriminator for merging semantic features, the semantic feature merging module (SFB) aligns and injects the HR semantic features extracted by PVM into a feature space of the discriminator by adopting a cross-attention mechanism, wherein the module comprises two inputs, namely semantic feature vectors extracted from a pre-training model And image features from the current layer of the discriminator ; In the semantic feature processing branch, semantic features The method comprises the steps of carrying out group normalization processing, then exchanging channel dimension with space dimension through dimension replacement to adapt to attention calculation, entering a self-attention module after layer normalization for enhancing context association inside semantic features, and simultaneously taking self-attention output as a subsequent cross-attention query (Q); In the image feature processing branch, the image features are discriminated First through a convolution layer and The convolution layer performs feature extraction and dimension reduction, then converts the feature into a sequence form through dimension conversion, the converted feature is used as a key (K) and a value (V) of cross attention, then cross attention calculation is performed on the cross attention by utilizing K, V generated by image features and Q generated by semantic features, related semantic information is dynamically aggregated according to the requirement of image content, an output result sequentially passes through layer normalization and GELU activation functions, and the original feature dimension is restored through dimension change: ; residual connecting the output of the attention branch with the convolved characteristic of the original input image, and finally passing through one And the convolution layer performs channel fusion and outputs final fusion characteristics.
- 6. The method for reconstructing the super-resolution of the underwater sonar image according to any one of claims 1 to 5, wherein the specific steps are as follows: Step 1, a low-resolution sonar image The method comprises the steps of inputting the images into a lightweight generator, firstly, performing primary feature mapping on the images through blueprint separation convolution in a shallow feature extraction layer, and expanding the input images in channel dimension by adopting a feature channel replication strategy to enrich feature expression to output shallow feature images And then, after that, Entering a deep feature extraction stage, namely a plurality of cascaded hybrid axial sensing modules; Step 2, the feature map enters a first-stage processing unit of a hybrid axial sensing module (HLAB), namely a re-parameterizable convolution unit (RCB) to perform efficient extraction and pretreatment of local features, and in the step, a re-parameterization strategy of 'two-stage series connection' is adopted: ① Firstly, point convolution transformation is carried out, namely, in the training stage, the characteristic diagram simultaneously passes through The convolution branch and identity mapping branch are added together, and the weights of these two branches are mathematically combined into a single path in the reasoning stage Convolving; ② Deep convolution feature extraction is then performed, wherein the feature images continue to pass through the parallel connection in the training stage Deep convolution branches Deep convolution branches, which are combined into a single path during the inference phase Finally, the output is processed by GELU activation functions to obtain a local feature map ; Step 3, the characteristics output in the step 2 The second stage processing unit input to the hybrid axial perception module (HLAB) is a large-kernel convolution unit (LKC) for capturing the macroscopic space structure of the sonar image, and the step utilizes the large-kernel convolution decomposition idea, and the feature map is sequentially processed by the following steps: ① Spatially localized convolution using Extracting local detail features; ② Spatial long-distance convolution using And expansion rate Is to expand the effective receptive field to the depth expansion convolution of To capture long-range dependencies of large-scale targets; ③ Channel convolution using Is used for fusing information among channels; finally, convolving the output with the input Residual connection is carried out, and a feature map containing rich space structure information is output ; Step 4, the characteristics output in the step 3 The third stage processing unit input to the hybrid axial sensing module (HLAB) is an axial sensing self-attention unit (ASA), introduces an orthogonal cascade mechanism, performs full-dimensional feature refinement, and comprises: ① Firstly, calculating the dense dependence of pixel level in a local window through spatial self-attention based on the window after the feature map is subjected to layer normalization, and accurately recovering the spatial geometry structure and texture detail; ② Channel axis interaction, namely, rearranging feature views through dimension rotation operation, calculating a cross covariance matrix among channels by utilizing shared projection weights so as to capture deep semantic association of cross channels, wherein ASA breaks through the limitation of single-dimension receptive fields under lower parameter cost through serial circulation and depth fusion of the space context and the channel context, realizes efficient aggregation and global optimization of features, and outputs final features ; Step 5, adopting an information distillation mechanism to output the characteristics of each HLAB modules Is divided into two parts, one part is used as the input of the next module, the other part is directly transmitted to the splicing layer at the end of the network, all the reserved features are spliced in the channel dimension and pass through Fusion convolution and GELU activated, with shallow features And finally, up-sampling through a sub-pixel convolution layer to output a super-resolution image ; Step 6, extracting semantic feature vectors from the high-resolution image by using a CLIP pre-training visual model of freezing parameters, namely a CLIP-ResNet visual encoder through a discriminator fused with semantic features Will (i) be Injection into the U-net encoder-decoder structure of a discriminator, through a cross-attention mechanism within a semantic feature fusion block (SFB), to For inquiring the vector, taking the image feature as a key value, guiding the discriminator to pay attention to the texture region with specific semantics; Step 7, training the network by adopting a staged mixed loss strategy, wherein the first stage is a geometrical structure recovery stage and only uses pixel loss A training generator with weight coefficients set as Aiming at quickly recovering the basic outline and geometric structure of the sonar image, wherein the second stage is a perceived quality improvement stage, and VGG perceived loss is introduced The weight coefficient is set as The third stage is a countermeasure detail optimization stage, and a countermeasure loss is introduced The weight coefficient is set as Wherein, the method comprises the steps of, 、 、 The value ranges of (a) are respectively as follows: , , ; at this stage, the generator and discriminator are trained alternately, with the loss functions defined as follows: (1) Total loss function of generator: , (1) Wherein: , (2) , (3) , (4) Wherein, the The input low-resolution sonar image; The real high-resolution sonar image is corresponding; Representing the data distribution of the super-resolution sonar image output by the generator, Representing samples sampled from the generator output data distribution (i.e., generating an image); A data distribution representing a real high resolution sonar image, Representing samples sampled from the real data distribution (i.e., real images); For the generated super-resolution image; To pretrain VGG-19 network A feature map of the layer output; is the output of the discriminator; (2) In the discriminator, the original countering loss Adding R1 and R2 gradient penalty terms into the loss function based on the model to restrict the gradient norm of the discriminator, so that the decision boundary is smoother, the problem of mode collapse in the process of generating the countermeasure network training is solved, the model can be ensured to converge, a high-resolution sonar image with vivid details is finally generated, and the total loss function of the discriminator is as follows: , (5) Wherein, the And Penalty weight coefficients for R1 and R2, respectively; wherein the R1 penalty term refers to the square norm of the discriminator to the gradient of the real high resolution sonar image (HR), the R2 penalty term refers to the square norm of the discriminator to the gradient of the generator output image (SR), the expression is , (6) , (7)。
Description
Underwater sonar image super-resolution reconstruction method based on lightweight generation countermeasure network Technical Field The invention belongs to the technical field of image processing, and particularly relates to a super-resolution reconstruction method of an underwater sonar image. Background The sonar plays an irreplaceable role in the fields of national defense safety, marine resource exploration, underwater engineering monitoring and the like. However, due to the complexity of the underwater environment, the severe attenuation characteristic of the sound wave propagating in the water medium and the physical aperture limitation of the sonar hardware device, the actually acquired sonar image often has the characteristics of low resolution, blurred edges and high-intensity noise. The low-quality sonar image severely restricts the execution effect of the subsequent underwater task. For example, in subsea pipeline inspection, sunken ship archaeology or underwater target identification tasks, low resolution images can lead to loss of key texture details, making it difficult for small targets to be distinguished from the background, thereby reducing the accuracy of target detection and classification algorithms. Although imaging resolution can be improved to some extent by elevating high frequency sonar equipment, this tends to mean extremely expensive hardware cost and smaller detection coverage. Therefore, the image quality is improved at the software level by signal processing or computer vision technology, namely, super-Resolution (SR) technology, which becomes an economical and efficient solution. The traditional image super-resolution method is mainly based on interpolation or sparse coding theory, but when a sonar image with complex texture and serious noise interference is processed, the real high-frequency details are difficult to recover, and even artifacts are introduced. In recent years, the deep learning technique has become a mainstream method in the field of image super-resolution. However, existing convolutional neural network or transducer based super-resolution models are designed for natural optical images, and direct migration to sonar images faces many challenges. Most sonar equipment is carried on an unmanned submarine or a small underwater robot, the computing capacity and the battery capacity of the edge equipment are extremely limited, and the existing high-performance super-resolution model is extremely high in computing complexity, so that the requirements of real-time processing or low-power deployment are difficult to meet. Therefore, the super-resolution reconstruction method for recovering the high-definition and realistic texture image can meet the light deployment requirement, effectively utilize the special spatial structure and channel information of the sonar image, and has important practical significance for improving the ocean perception capability of China and guaranteeing the safety of underwater operation. Disclosure of Invention In view of the above, the present invention aims to provide a super-resolution reconstruction method for an underwater sonar image based on a lightweight generation countermeasure network, which is used for realizing high-resolution reconstruction of a low-quality sonar image under the condition of limited computing resources. The invention provides an underwater sonar image super-resolution reconstruction method based on a lightweight generation countermeasure network, which comprises the steps of firstly, constructing a sonar image super-resolution reconstruction frame, and specifically comprising two main body networks, namely a lightweight generator based on a hybrid cascade architecture and a discriminator for fusing semantic features; the network is called a lightweight generation countermeasure network based on a hybrid cascade architecture, and the output of the network is guided through a staged hybrid loss function, so that the model completes high-quality reconstruction of a low-resolution sonar image. Wherein: The lightweight generator (LIGHTWEIGHT AXIS-Shifting Distillation Network, LASDN) is used to extract features from low-resolution sonar images and reconstruct high-resolution images, and combines the generalized bias of convolution with the dynamic modeling capability of attention using a hybrid cascade strategy, see fig. 1. The lightweight generator consists of a plurality of cascaded mixed large-core axial sensing modules (HLAB), wherein the HLAB module specifically comprises a re-parameterable convolution unit (RCB), a large-core convolution unit (LKC) and an axial sensing self-attention unit (ASA). The sonar image is firstly subjected to primary feature mapping through a blueprint separation convolution (BSConv) to obtain shallow features, then is input into a HLAB module, firstly enters a re-parameterable convolution unit (RCB) to perform local feature preprocessing, then enters a large kernel convolution u