CN-121982480-A - Image feature extraction device based on multi-scale convolution network model and application thereof

CN121982480ACN 121982480 ACN121982480 ACN 121982480ACN-121982480-A

Abstract

The invention relates to an image feature extraction device based on a multi-scale convolution network model and application thereof, belonging to the technical field of artificial intelligence and computer vision, wherein the device comprises a multi-scale convolution network model based on dynamic weight self-adaption, and the model comprises a self-adaption heterogeneous convolution kernel dynamic weighted depth convolution module, a multi-scale convolution kernel and a multi-scale convolution kernel, wherein the self-adaption heterogeneous convolution kernel dynamic weighted depth convolution module is used for self-adaption generation and fusion of weights of the multi-scale convolution kernel according to input image features; the system comprises a channel segmentation module, a dynamic Inception mixer, a dynamic mixing block, an integrated network module and a C2f architecture-based integrated network module, wherein the channel segmentation module is used for carrying out channel segmentation on a plurality of groups of characteristics, the dynamic mixing block is used for extracting and fusing multi-scale characteristics and channel interaction characteristics through a dual-path residual error structure, and the integrated network module comprises a plurality of dynamic mixing blocks connected in series and is used for converging the multi-scale characteristics and outputting final image characteristic representation. The invention can effectively capture multi-scale characteristics and long-distance dependence in the image, improves the model performance while reducing the parameter quantity, and is suitable for various computer vision tasks such as target detection, image segmentation and the like.

Inventors

ZHANG HANCHAO
ZHANG RUIQIAN
ZHANG QINGHUA
HAO MINGHUI
WANG HAO
ZHANG LI

Assignees

中国测绘科学研究院

Dates

Publication Date: 20260505
Application Date: 20260130

Claims (10)

1. An image feature extraction device, comprising a multi-scale convolutional network model based on dynamic weight adaptation, the model comprising: the self-adaptive heterogeneous convolution kernel dynamic weighted depth convolution module comprises a multi-scale heterogeneous convolution kernel which is processed in parallel and is used for self-adaptively generating and fusing the weight of the multi-scale heterogeneous convolution kernel according to the characteristics of an input image; The dynamic Inception mixer is connected with the self-adaptive heterogeneous convolution kernel dynamic weighted depth convolution module and is used for parallel processing of a plurality of groups of characteristics after channel segmentation; The dynamic mixing block is connected with the dynamic Inception mixer and is used for extracting and fusing multi-scale characteristics and channel interaction characteristics through a dual-path residual error structure; An integrated network module based on a C2f architecture, comprising a plurality of said dynamic blending blocks in series, for aggregating multi-scale features and outputting a final image feature representation.
2. The image feature extraction apparatus of claim 1, wherein the adaptive heterogeneous convolution kernel dynamic weighted depth convolution module comprises: A multi-scale heterogeneous convolution kernel set comprising a square depth convolution kernel, a horizontal stripe depth convolution kernel and a vertical stripe depth convolution kernel; The dual-branch dynamic weight generator is used for fusing global context information and local statistical information to generate a weight initial value; an adaptive temperature modulation unit configured to configure an independent learnable temperature parameter for each of the convolution kernels to modulate the weight initial value; The characteristic enhancement feedback module is used for enhancing the channel attention of the output of each convolution kernel; and the dynamic weighted fusion unit is used for carrying out weighted summation on the enhanced characteristics and the weights subjected to temperature modulation and normalization.
3. The image feature extraction apparatus according to claim 2, wherein the dual-branch dynamic weight generator includes: Global context branches, which generate global weight initial values through convolution layer processing after carrying out global self-adaptive average pooling on an input image feature map; Local statistics branches, namely performing global self-adaptive average pooling and convolution processing to generate local weight initial values after performing local depth convolution on an input image feature map; And the fusion module is used for carrying out weighted fusion on the global weight initial value and the local weight initial value through the learnable fusion weights alpha and beta, wherein alpha+beta=1.
4. The image feature extraction apparatus of claim 1, wherein the dynamic Inception mixer comprises: the channel segmentation module is used for segmenting the input image feature image into N groups along the channel dimension, wherein N is more than or equal to 2; N parallel self-adaptive heterogeneous convolution kernel dynamic weighted depth convolution modules respectively process the characteristics of the corresponding groups, and square convolution kernel size parameters k used in at least two modules are different; the characteristic splicing module is used for splicing the outputs of the N modules along the channel dimension; And the cross-channel fusion layer is used for fusing the spliced features.
5. The image feature extraction apparatus of claim 1, wherein the dynamic mixing block comprises: The first feature processing path comprises a first batch of normalization layers, a dynamic Inception mixer, a first leachable layer scaling parameter and a first random depth layer and is used for extracting multi-scale image features; the second characteristic processing path comprises a second batch of normalization layers, a channel mixing unit, a second learner layer scaling parameter and a second random depth layer and is used for information interaction among channels; and the dual residual error connection module is used for carrying out residual error addition on the input image characteristics and the outputs of the first characteristic processing path and the second characteristic processing path respectively.
6. The image feature extraction apparatus of claim 1, wherein the C2f architecture-based integrated network module comprises: the input convolution layer is used for carrying out preliminary feature transformation and channel adjustment on the input image feature map; a channel dividing unit for dividing the input image feature map into two or more parts along the channel dimension; n dynamic mixing blocks connected in series are used for carrying out depth extraction on the segmented partial features; the multi-scale characteristic splicing unit is used for splicing the output of the input convolution layer, the direct output of the channel segmentation unit and the output of each dynamic mixing block along the channel dimension; and outputting a fusion convolution layer for fusing the spliced features.
7. An image processing apparatus comprising a memory, a processor and a computer program stored on the memory, wherein the processor implements the functions of the image feature extraction apparatus according to any one of claims 1 to 6 when executing the computer program.
8. A method for extracting multi-scale convolution image features based on dynamic weight adaptation, which is characterized in that the multi-scale convolution image features are extracted by adopting a multi-scale convolution network model based on dynamic weight adaptation in an image feature extraction device according to any one of claims 1-6, and the method for extracting the multi-scale convolution image features comprises the following steps: Step S1, inputting an input image feature map into a double-branch dynamic weight generator to generate a weight initial value fused with global context and local statistical information; Step S2, performing self-adaptive temperature modulation and Softmax normalization on the weight initial value to obtain normalized weights respectively corresponding to a square depth convolution kernel, a horizontal stripe depth convolution kernel and a vertical stripe depth convolution kernel; Step S3, carrying out convolution operation on the input image feature map through the square depth convolution kernel, the horizontal stripe depth convolution kernel and the vertical stripe depth convolution kernel respectively to obtain three convolution output feature maps; S4, carrying out characteristic enhancement feedback processing on each convolution output characteristic diagram to obtain three enhanced characteristic diagrams; s5, carrying out weighted fusion on each enhanced feature map and the corresponding normalized weight to realize dynamic weighted summation, so as to obtain a fused feature map; and S6, carrying out batch normalization and nonlinear activation processing on the fusion feature images, and outputting a final feature image.
9. A target detection method, characterized in that a multi-scale convolution network model based on dynamic weight adaptation in the image feature extraction device according to any one of claims 1-6 is adopted as a main image feature extraction network, and a feature pyramid network and a detection head are combined to predict target types and positions.
10. An image segmentation method, characterized in that an encoder-decoder structure is constructed, the encoder is formed by connecting a plurality of multi-scale convolutional network models based on dynamic weight adaptation in the image feature extraction device according to any one of claims 1-6 in series, the decoder gradually restores spatial resolution through upsampling and jump connection, and a segmentation mask is output.

Description

Image feature extraction device based on multi-scale convolution network model and application thereof Technical Field The invention relates to an image feature extraction device based on a multi-scale convolution network model and application thereof, in particular to an image feature extraction device for a computer vision task and application thereof in the fields of target detection, image segmentation and the like, and belongs to the technical fields of artificial intelligence and computer vision. Background Convolutional Neural Networks (CNNs) have become the dominant model of computer vision tasks such as image classification, object detection, semantic segmentation, and the like. Conventional CNNs typically employ a convolution kernel of fixed size and fixed weight, such as a square convolution kernel of 3 x 3 or 5 x 5. This design has inherent limitations in that, first, the convolution kernel receptive field of a fixed size is limited and it is difficult to efficiently capture objects or features of different dimensions in the image. Although the receptive field can be enlarged by stacking multiple layers of convolutions, the computational complexity and network depth can be significantly increased, resulting in training difficulties. Secondly, the fixed convolution kernel weight cannot be adaptively adjusted according to the difference of the input image content, and the generalization capability of the object with changeable texture and shape is insufficient. Furthermore, standard square convolution kernels are inefficient in modeling long-range dependencies in the horizontal or vertical directions, such as for elongated objects or linear structures in images. To address the multi-scale problem, the prior art proposes a network such as the Inception series, capturing multi-scale features by using convolution kernels of different sizes in parallel. However, the weights of these parallel branches are typically static, employing the same branch importance for all input samples at the time of reasoning, and lack the ability to dynamically adjust according to the input characteristics. Dynamic convolution (Dynamic) Convolution) attempts to solve this problem by generating convolution kernel weights based on input, existing schemes are limited to generating dynamic weights of convolution kernels of the same shape (e.g., both square), failing to take full advantage of the complementary advantages of heterogeneous convolution kernels (e.g., square and stripe) in image feature extraction. In addition, existing dynamic weight generation mechanisms often rely only on a single global context generated by global averaging pooling, ignoring the importance of local region statistics to weight decisions, resulting in the possibility that the generated weights may not be sufficiently fine and accurate. In the process of realizing the invention, the inventor finds that at least the following problems exist in the prior art that the existing convolution network is difficult to adaptively and finely adjust the fusion weight of the heterogeneous multi-scale convolution kernel according to the input content, so that the multi-scale image feature extraction efficiency is low, and the modeling capability on long-distance dependence is limited. Therefore, there is an urgent need for an image feature extraction device and an application method thereof that can adaptively fuse global and local information, dynamically and finely allocate heterogeneous multi-scale convolution kernel weights, and simultaneously compromise computational efficiency and feature expression capability. Disclosure of Invention In order to solve the problems, the invention provides an image feature extraction device based on a multi-scale convolution network model and application thereof, and the device can adaptively and dynamically adjust fusion weights of convolution kernels of different shapes and scales according to the content of input image features, so that the multi-scale image feature extraction is realized more efficiently and more accurately, and meanwhile, long-distance dependency relationship is modeled efficiently. The technical scheme adopted by the invention for solving the technical problems is as follows: The invention provides an image feature extraction device which comprises a multi-scale convolution network model based on dynamic weight adaptation, a dynamic Inception mixer, a dynamic mixing block and an integrated network module based on a C2f architecture, wherein the model comprises an adaptive heterogeneous convolution kernel dynamic weighted depth convolution module used for adaptively generating and fusing the weights of multi-scale convolution kernels according to input image features, the dynamic Inception mixer is connected with the adaptive heterogeneous convolution kernel dynamic weighted depth convolution module and used for parallel processing of multiple groups of features after channel segmentation, the dynami