CN-121999414-A - Video significance prediction method based on multi-scale bidirectional gating and perception guidance fusion

CN121999414ACN 121999414 ACN121999414 ACN 121999414ACN-121999414-A

Abstract

The invention discloses a video saliency prediction method based on multi-scale bidirectional gating and perception guidance fusion, and aims to solve the problems of insufficient multi-scale feature fusion and poor time consistency in the prior art. The method comprises the steps of extracting multi-scale space-time features by Video Swin Transformer, enhancing feature interaction in a path from top to bottom through a multi-scale bidirectional gating fusion module, enhancing time sequence information of the highest layer of features by using a time attention module, and finally adaptively polymerizing the scale features through a perception guidance fusion module to generate a final saliency map. The invention strengthens the comprehensiveness of feature expression through a combined mechanism of bidirectional gating, time enhancement and perception guidance, ensures the time sequence consistency of the feature expression, and further improves the accuracy and stability of video significance prediction.

Inventors

LI YINGXU
ZHANG YUNZUO
ZHAO MENGRAN
ZHANG ZHIGUO
WANG YUJIANG

Assignees

石家庄铁道大学

Dates

Publication Date: 20260508
Application Date: 20260202

Claims (8)

1. A video significance prediction method based on multi-scale bidirectional gating and perception guidance fusion is characterized by comprising the following steps: s1, acquiring a video sequence to be predicted, selecting continuous frames and inputting the continuous frames into a video significance prediction network; S2, processing the video sequence by using Video Swin Transformer as an encoder of a backbone network to extract multi-scale space-time characteristics and obtain characteristic diagrams of different scales ; S3, inputting the feature graphs with different scales into a multi-scale bidirectional gating fusion module, so as to perform bidirectional gating fusion on the multi-scale features from top to bottom and from bottom to top to obtain fusion features with all scales; S4, time attention is enhanced to the highest layer of features in the scale fusion features, namely feature change information is extracted in a time dimension through a time attention module, and high-level reference features with consistent time sequences are generated; S5, inputting the high-level reference features and the scale fusion features to a sensing guide fusion module together, realizing self-adaptive aggregation of the multi-scale features under sensing guide, and generating a significance prediction graph consistent with input resolution through a prediction module.
2. The method for predicting video saliency based on multi-scale bi-directional gating and perceptual guided fusion of claim 1, wherein the encoder with Video Swin Transformer as a backbone network comprises four hierarchical structures, and as the hierarchy deepens, the temporal and spatial dimensions of the output feature map decrease and the channel dimension increases.
3. The method for predicting video saliency based on multi-scale bi-directional gating and perceptual guidance fusion of claim 1, wherein the multi-scale bi-directional gating fusion module in step S3 comprises two fusion paths from top to bottom and from bottom to top: Top-down gating of fusion paths from highest level features Firstly, aligning the spatial resolution of each layer of features through a tri-linear upsampling operation, gradually spreading deep semantic features to shallow features, and using a gating fusion unit to realize the guidance of deep semantic information to the shallow features; Bottom-up gating fusion path from bottom-most features And (3) aligning the spatial resolution of each layer of features through the maximum pooling operation, gradually spreading shallow detail features to deep features, and supplementing the deep features with shallow detail information by using a gating fusion unit.
4. The video saliency prediction method based on multi-scale bi-directional gating and perceptual guided fusion as defined in claim 1, wherein the implementation mechanism of the gating fusion unit is that firstly, the feature X from the previous fusion stage and the same-level feature Y from the encoder are spliced along the channel dimension, a spatially adaptive gating weight GA is generated through convolution and Sigmoid function, then the weight is used for weighting the second input feature, and finally, the weighted feature and the first input feature are adaptively fused through two learnable residual fusion parameters, and the specific formula is as follows: , , , wherein G is the characteristic after the splicing, The Sigmoid function is represented as a function, Representing a 3D convolution with a kernel size of 1 x 3, For the output characteristics after the gating fusion module, , For learnable scalar parameters, automatic optimization is trained through a network for controlling the weighted proportion of two input features in a fused output.
5. The method for predicting video saliency based on multi-scale bi-directional gating and perceptual guidance fusion of claim 1, wherein in step S4, the highest level feature of the scale fusion features performs temporal attention enhancement, specifically, the highest level feature is The processing procedure comprises inputting features by a time attention module The method comprises the steps of carrying out self-adaptive average pooling, self-adaptive maximum pooling and standard deviation pooling along a time dimension, reserving the time dimension, splicing multi-scale pooling results, learning the attention weight of a channel through a multi-layer perceptron, multiplying the attention weight TA with original features, and enhancing the features through residual connection, wherein the calculation formula is as follows: , , Wherein, the As Sigmoid function, MLP represents multi-layer perceptron, avgPool, maxPool and StdPool represent adaptive average pooling, max pooling and standard deviation pooling operations respectively, The feature is enhanced by the time attention module and is one of the input features of the perception guidance fusion module.
6. The method for predicting video saliency based on multi-scale bi-directional gating and perceptual guided fusion as defined in claim 1, wherein the perceptual guided fusion module splices the obtained fusion features with multi-scale features in step S5 With high-level reference features output via a temporal attention module For input, the processing includes first generating attention weight PEU by the perception enhancement unit to fuse features Feature enhancement followed by perceptually enhanced features With high-level reference features According to the weight capable of learning Weighting and fusing to obtain final output characteristics The calculation formula is as follows: , , , Wherein the method comprises the steps of Representing a1 x 1 3D convolution, reLU is the activation function.
7. The method of claim 1, wherein in step S5, the prediction module gradually restores the feature spatial resolution through a cascade three-dimensional convolution and up-sampling operation to finally generate a normalized saliency prediction graph, and the cascade structure is composed of four groups of alternating three-dimensional convolution layers and up-sampling layers, wherein the convolution kernel size comprises 4×3×3, 2×3×3, 2×1×1 and 1×1, and the up-sampling operation adopts a three-linear interpolation method to finally output the normalized saliency prediction graph through a Sigmoid function.
8. The method for video saliency prediction based on multiscale bi-directional gating and perceptual guidance fusion of claim 1, wherein the training step of the video saliency prediction network comprises: Constructing a video saliency prediction network employing a network structure as defined in the method of any one of claims 1 to 7; Constructing training data sets and verification data sets, wherein each data set comprises a video frame sequence and a corresponding real saliency map label; Inputting the training data set into the prediction network, calculating a loss function based on a significance prediction graph output by the network and the true significance icon, and performing back propagation optimization; After repeated rounds of training, model performance is assessed on the validation dataset, and the network weights at least lost on the validation set are saved as the final model.

Description

Video significance prediction method based on multi-scale bidirectional gating and perception guidance fusion Technical Field The invention relates to the technical field of video processing and computer vision, in particular to a video significance prediction method based on multi-scale bidirectional gating and perception guidance fusion. Background Video saliency prediction aims at simulating a human visual attention mechanism, predicting a salient region of human eye attention in video, and representing the salient region in the form of a saliency map. The result has wide application value in tasks such as video compression, target detection, visual coding, behavior recognition and the like. Existing video saliency prediction methods are mostly based on three-dimensional convolution (3D CNN) or video Transformer (Video Swin Transformer) structures, capturing salient regions by stacking spatio-temporal feature extraction modules. However, the method has the common problems that firstly, a simple top-down or bottom-up characteristic fusion path is adopted in a decoding stage, deep semantic information and shallow detail information are difficult to be combined effectively, the boundary of a salient region is fuzzy or the internal response is incomplete, secondly, modeling of long-term dependency relationship in time dimension is insufficient, jitter and instability of a continuous frame prediction result are easy to cause when a complex dynamic scene is processed, timing consistency is lacked, and thirdly, when multi-scale characteristics are polymerized, a simple splicing or adding operation is generally adopted, an adaptive characteristic screening and enhancing mechanism is lacked, redundant information or noise can be introduced, and final prediction accuracy is influenced. In order to solve the problems, the invention provides a video saliency prediction method based on multi-scale bidirectional gating and perception guidance fusion, which remarkably improves the precision, the space detail integrity and the time stability of video saliency prediction by constructing a bidirectional gating feature fusion frame, introducing time attention and designing a perception guidance fusion module. Disclosure of Invention The invention aims to overcome the defects of low significance prediction precision and poor time consistency caused by insufficient multi-scale feature fusion and insufficient time-dependent modeling in the prior art, and provides a video significance prediction method based on multi-scale bidirectional gating and perception guidance fusion, which realizes high-precision and stable prediction of a video significance region. In order to achieve the above purpose, the present invention provides a video saliency prediction method based on multi-scale bidirectional gating and perceptual guidance fusion, the method comprising the following steps: s1, acquiring a video sequence to be predicted, selecting continuous frames and inputting the continuous frames into a video significance prediction network; S2, processing the video sequence by using Video Swin Transformer as an encoder of a backbone network to extract multi-scale space-time characteristics and obtain characteristic diagrams of different scales ; S3, inputting the feature graphs with different scales into a multi-scale bidirectional gating fusion module, so as to perform bidirectional gating fusion on the multi-scale features from top to bottom and from bottom to top to obtain fusion features with all scales; S4, time attention is enhanced to the highest layer of features in the scale fusion features, namely feature change information is extracted in a time dimension through a time attention module, and high-level reference features with consistent time sequences are generated; S5, inputting the high-level reference features and the scale fusion features to a sensing guide fusion module together, realizing self-adaptive aggregation of the multi-scale features under sensing guide, and generating a significance prediction graph consistent with input resolution through a prediction module. The further technical scheme is that the Video Swin Transformer encoder as a backbone network comprises four hierarchical structures, and as the hierarchy deepens, the time and space dimensions of the output feature map decrease, and the channel dimension increases. The pyramid structure can effectively capture multi-level space-time information from the bottom texture detail to the high-level semantic concept. The further technical scheme is that the multi-scale bidirectional gating fusion module in the step S3 comprises two fusion paths from top to bottom and from bottom to top, wherein the top-down gating fusion path is from the highest layer of characteristicsFirstly, deep semantic features are gradually spread to shallow features through a tri-linear upsampling operation, and guidance is realized by using a gating fusion unit, wherein a bottom-up gati