CN-122024013-A - RSFVIT-based lightweight visual global perception method

CN122024013ACN 122024013 ACN122024013 ACN 122024013ACN-122024013-A

Abstract

The invention discloses a RSFVIT-based lightweight visual global perception method, which relates to the technical field of visual perception models and comprises the steps of constructing a RSFVIT structure comprising a three-level network, completing a training stage of the constructed RSFVIT structure based on a reparameterized position code, carrying out weight conversion and reparameterization on frequency domain branches and space domain branches to obtain a standard space domain convolution kernel, carrying out channel optimization on the RSFVIT structure through split type mixed self-attention, introducing a learnable idle channel reparameterized feed-forward network, carrying out dynamic weighted distribution on characteristic channels through a learnable mask vector in the training stage, carrying out thresholding treatment on the mask vector in the reasoning stage to convert the mask vector into a binary mask, and realizing the improvement of the upper precision limit of the model on the premise of not increasing the reasoning cost by the design of retaining full-resolution Query and carrying out finer over-parameterization control on the idle channels through split type mixed self-attention.

Inventors

FU RUI
SUN HAO
Han jiao
GUO YANTONG

Assignees

潍坊科技学院

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (9)

1. RSFVIT-based lightweight visual global perception method is characterized by comprising the following steps of: Step S1, constructing RSFVIT structures comprising a three-level network, and completing a training stage of the constructed RSFVIT structure based on the re-parameterized position codes; s2, performing weight conversion and re-parameterization on the frequency domain branches and the space domain branches to obtain a standard space domain convolution kernel; step S3, carrying out channel optimization on the RSFVIT structure through split type mixed self-attention, wherein the split type mixed self-attention comprises a channel splitting strategy, a space reduction strategy and a gating attention strategy; And S4, introducing a learnable idle channel re-parameterized feed-forward network, dynamically weighting and distributing the characteristic channels through the learnable mask vectors in a training stage, and thresholding the mask vectors in an reasoning stage to convert the mask vectors into binary masks.
2. The light-weight visual global perception method based on RSFVIT of claim 1, wherein the process of step S1 is: Step S101, constructing a feature extraction branch, wherein the feature extraction branch comprises a space domain branch and a frequency domain branch; Step S102, respectively extracting the characteristics of the input through the constructed characteristic extraction branches; Step S103, dynamically fusing the extracted features based on the frequency domain-space domain dual re-parameterized position codes to obtain corresponding position sensing features.
3. The RSFVIT-based lightweight visual global perception method of claim 2, wherein the spatial branches are used to extract spatial features of the input features and the frequency domain branches are used to extract frequency domain features of the input features.
4. A lightweight visual global perception method as claimed in claim 2, the method is characterized in that the process of the step S102 is as follows: the input feature is marked as feature X; In the space domain branch, adopting multi-branch convolution combination, processing the characteristic X through the multi-branch convolution combination to obtain corresponding local detail characteristic, and taking the local detail characteristic as the output of the space domain branch, wherein the output of the space domain branch is recorded as Wherein: ; wherein SDB (-) represents a parallel structure comprising a main convolution, identity mapping and average pooling operator; In the frequency domain branch, the input characteristic X is subjected to global characteristic coupling through Fourier transformation and inverse Fourier transformation to obtain the output of the frequency domain branch, and the output of the frequency domain branch is recorded as Wherein: ; Wherein, the Representing the weights of the frequency domain, Represents the inverse transformation operator, Representing the transformation operator.
5. The RSFVIT-based lightweight visual global sensing method as claimed in claim 4, wherein the process of step S2 is as follows: Step S201, carrying out weight fusion on multi-branch convolution combinations in the airspace branches to obtain corresponding airspace equivalent convolution kernels; step S202, mapping the frequency domain weight of the frequency domain branch to a space domain, and extracting a corresponding equivalent frequency domain convolution kernel; and step 203, performing static reparameterization fusion on the weights of the space domain and the frequency domain by using the linear additivity of the convolution operator.
6. The light-weight visual global perception method based on RSFVIT of claim 5, wherein the step S202 is performed by: The space domain global convolution kernel obtained by the frequency domain weight after the inverse Fourier transform is recorded as ; Then the input feature X is subjected to frequency domain conversion using the convolution theorem, and the conversion result is denoted as F (X), where the convolution theorem is: ; Wherein, the The result is the inverse Fourier transform of the frequency domain weight; mapping the frequency domain weight to the space domain, wherein the process is as follows: ; ; Intercepting the obtained space domain global convolution kernel to obtain an equivalent frequency domain convolution kernel corresponding to the frequency domain weight, and recording the equivalent frequency domain convolution kernel as Wherein: ; Wherein the method comprises the steps of Representing the intercept operator.
7. The RSFVIT-based lightweight visual global perception method according to claim 6, wherein the process of step S3 is: step 301, splitting input features through a channel splitting strategy to obtain an active branch and a passive branch; step S302, compressing the search space of the attention matrix through a space reduction strategy; step S303, setting a gating attention strategy so that the active branch and the passive branch perform deep fusion of the features.
8. The RSFVIT-based lightweight visual global perception method according to claim 7, wherein the process of step S4 is: With the mask vector denoted as M, during the training phase, the process of dynamically weighting the feature channels may be expressed as: ; Wherein, the The output after the fusion is represented as such, Representing the output of the active branch, Representing the output of the passive branch, For a learnable bias term, M is a learnable mask vector, Input features for the portion of the channel that is divided into "idle branches"; The output of the learnable idle channel re-parameterized feed forward network is further obtained and recorded as Wherein: ; Wherein, the Representing the output projection weights, BN () represents the batch normalization operation.
9. The RSFVIT-based lightweight visual global perception method according to claim 8, wherein the process of step S4 further includes: In the reasoning stage, by matching Thresholding to obtain equivalent inference weights and equivalent biases, which are respectively recorded as And Wherein: ; ; Wherein, the Representing the input projection weight(s) of the image, To output projection weights, the representation Representing a diagonal matrix; Representing the bias of the input projection layer, Representing the offset of the output projection layer.

Description

RSFVIT-based lightweight visual global perception method Technical Field The invention relates to the technical field of visual perception models, in particular to a RSFVIT-based lightweight visual global perception method. Background Convolutional neural networks always have a bottleneck in capturing the overall locale length Cheng Yilai due to their local sensing field limitations. To break this limitation, vision transformers achieve excellent global modeling capabilities by introducing self-attention mechanisms, significantly improving the performance boundaries of computer vision tasks. However, the computational complexity of the self-attention mechanism grows in the square scale with feature map resolution, which makes standard visual transformers a serious deployment challenge on resource-constrained mobile end devices. To achieve efficient reasoning at the mobile end, researchers have proposed a variety of optimization paths. Early RepVGG, through a structural re-parameterization technology, improves the precision of a pure convolution architecture while keeping high speed, and then mixed architectures such as MobileViT and the like try to embed a transducer operator into a lightweight CNN, so that although the parameter quantity is reduced, the reasoning delay in actual hardware is still not as good as that of the prior art; The existing light-weight scheme still faces three major core conflicts, namely firstly perception conflicts, namely large receptive field acquisition usually comes at the cost of sacrificing memory access efficiency, secondly redundancy conflicts, a full-channel attention mechanism brings high performance and has serious feature redundancy, and finally resource utilization conflicts, and for a very small model, static idle channel allocation often leads to further reduction of precision and waste of computing resources, so that a light-weight visual global perception method based on RSFVIT is provided. Disclosure of Invention The invention aims to provide a RSFVIT-based lightweight visual global perception method. The invention aims at realizing the following technical scheme that the light visual global perception method based on RSFVIT comprises the following steps of: Step S1, constructing RSFVIT structures comprising a three-level network, and completing a training stage of the constructed RSFVIT structure based on the re-parameterized position codes; s2, performing weight conversion and re-parameterization on the frequency domain branches and the space domain branches to obtain a standard space domain convolution kernel; step S3, carrying out channel optimization on the RSFVIT structure through split type mixed self-attention, wherein the split type mixed self-attention comprises a channel splitting strategy, a space reduction strategy and a gating attention strategy; And S4, introducing a learnable idle channel re-parameterized feed-forward network, dynamically weighting and distributing the characteristic channels through the learnable mask vectors in a training stage, and thresholding the mask vectors in an reasoning stage to convert the mask vectors into binary masks. Further, the process of step S1 is as follows: Step S101, constructing a feature extraction branch, wherein the feature extraction branch comprises a space domain branch and a frequency domain branch; Step S102, respectively extracting the characteristics of the input through the constructed characteristic extraction branches; Step S103, dynamically fusing the extracted features based on the frequency domain-space domain dual re-parameterized position codes to obtain corresponding position sensing features. Further, the process of step S102 is as follows: the space domain branch is used for extracting space domain features of the input features, and the frequency domain branch is used for extracting frequency domain features of the input features; the input feature is marked as feature X; In the space domain branch, adopting multi-branch convolution combination, processing the characteristic X through the multi-branch convolution combination to obtain corresponding local detail characteristic, and taking the local detail characteristic as the output of the space domain branch, wherein the output of the space domain branch is recorded as Wherein: ; wherein SDB (-) represents a parallel structure comprising a main convolution, identity mapping and average pooling operator; In the frequency domain branch, the input characteristic X is subjected to global characteristic coupling through Fourier transformation and inverse Fourier transformation to obtain the output of the frequency domain branch, and the output of the frequency domain branch is recorded as Wherein: ; Wherein, the Representing the weights of the frequency domain,Represents the inverse transformation operator,Representing the transformation operator. Further, the process of step S2 is as follows: Step S201, carrying out weight fusion on multi