CN-121981874-A - Anti-screen-shot image watermarking network method and system based on double-domain fusion and frequency domain attention

CN121981874ACN 121981874 ACN121981874 ACN 121981874ACN-121981874-A

Abstract

The invention provides a screen shot image watermark resisting network method and system based on double-domain fusion and frequency domain attention, wherein the method comprises the following steps of 1, preprocessing binary watermark information, 2, constructing a double-domain collaborative high-frequency residual enhancement network DRHE-Net, 3, constructing a real noise-containing watermark image for network training, 4, constructing a frequency domain attention decoding network matched with an encoder, and 5, completing network end-to-end training by adopting a weighted multi-objective loss function, and realizing the balance of watermark imperceptibility and robustness. The invention processes watermark information and carrier images by utilizing a double-domain collaborative residual enhancement network and a frequency domain attention mechanism, thereby realizing the precise balance of watermark robustness and imperceptibility under a screen shot scene.

Inventors

ZHOU HAIYANG
WANG BAOWEI
CUI QI
Meng Ruohan
Du Xilei
YANG GAOBO

Assignees

南京信息工程大学
江苏羽驰区块链科技研究院有限公司

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. The anti-screen-shot image watermarking network method based on the dual-domain collaborative high-frequency residual enhancement network and the frequency domain attention is characterized by comprising the following steps of: Step 1, preprocessing binary watermark information, expanding and combining with an image wavelet low-frequency component through a message processor MessageProcessor, and completing multi-scale frequency domain feature enhancement through a multi-frequency channel attention enhancement network MFCANet to obtain watermark features adapting to carrier sizes; step 2, constructing a dual-domain collaborative high-frequency residual enhancement network DRHE-Net, wherein the dual-domain collaborative high-frequency residual enhancement network is used for extracting and fusing image frequency domain robust features and airspace detail features through a dual-domain fusion feature extraction module DDFE-Block and discrete wavelet downsampling in a downsampling stage, guiding watermark embedding by using a texture attention mechanism in an upsampling stage, and finally directionally enhancing watermark high-frequency information through a high-frequency residual enhancement module RHFE-Block, and finally outputting a watermark-containing image; step 3, designing a micro-screen noise-shooting layer, sequentially superposing perspective distortion, illumination distortion, moire distortion and Gaussian noise on the watermark image, and constructing a real noise-containing watermark image for network training; Step 4, constructing a frequency domain attention decoding network of the matching encoder, recovering original watermark information through feature extraction, multi-scale frequency domain attention weighting and feature compression, and simultaneously completing end-to-end training of the network by adopting a multi-objective weighting loss function; And 5, completing end-to-end training of the network by adopting a weighted multi-objective loss function, and finally realizing balance of watermark imperceptibility and robustness by integrating four dimension loss components, namely constraining pixel difference of a watermark image and an original image by means of mean square error, recovering consistency of the watermark and the original watermark by means of mean square error constraint, realizing imperceptibility of anti-training enhancement by means of binary cross entropy, guiding the watermark to be embedded into a texture region by means of mask weighted mean square error.
2. The method of claim 1, wherein step 1 comprises: Step 1-1, randomly generating 64-bit binary watermark information ; Step 1-2, watermark information is sent to MessageProcessor, and the watermark information is sent to the watermark information through the linear layer Expansion to 256 dimensions, resulting in an expanded watermark ; Step 1-3, the Remodelling into a two-dimensional characteristic map with channel number 1, height 16 and width 16 Wherein Representing real space; Step 1-4, extracting characteristics by a convolution module, and outputting the number 64, the height 16 and the width 16 of the channels ; Step 1-5, obtaining 3 scale low frequency components of image wavelet transformation: 、、 First low frequency component Size 16 x 16, second low frequency component Size 32 x 32, third low frequency component The size is 64 multiplied by 64, and corresponds to the target dimension of three up-sampling steps; step 1-6, based on iterative fusion and up-sampling of three-scale low-frequency sub-band component LL components, channel splicing is carried out after each scale LL component is adjusted to the current watermark feature size in sequence: , Wherein the method comprises the steps of Is a fusion characteristic diagram after the ith round of splicing, For the channel-splicing operation, The watermark feature map is sampled on the ith-1 round, and the watermark feature map is subjected to fusion convolution dimensionality reduction to be the ith-64-channel feature Then the 2 times up sampling is completed through the transpose convolution, and then the 2 times up sampling is completed through the transpose convolution, so as to obtain the watermark feature map after the ith round of up sampling , After three rounds of operation, the size of the original image is matched Watermark features of (a) Wherein And Respectively represent Is the height and width of (2); step 1-7, message processor MessageProcessor enhances the watermark characterization by network MFCANet over multiple frequency channels Enhancement to output final watermark features Firstly, adopt The convolution kernels of the three different receptive fields execute multi-frequency branch convolution to obtain characteristic branches covering high, medium and low frequencies Wherein For the kth convolution operation, Each branch has the dimensions of Then global average pooling is performed on the channels of each branch, aggregating spatial features into channel-level scalars , wherein, ( ) Is the first Feature branches, channel-level scalar by multi-layer perceptron MLP Nonlinear transformation is carried out, and channel attention weight is generated through Softmax normalization Wherein For a channel-level scalar corresponding to a c-th channel in a kth feature branch, exp represents a natural exponential function; finally weighting each branch characteristic according to the corresponding channel weight, and passing through the leavable coefficients Fusing multiple branch results to obtain final watermark characteristics Dimension is , Is the spatial feature map corresponding to the c-th channel in the k-th feature branch.
3. The method according to claim 2, wherein step 2 comprises: Step 2-1, the downsampling stage of the dual-domain collaborative high-frequency residual enhancement network DRHE-Net inputs the original image , The method comprises the steps of firstly completing double-domain feature extraction and fusion through a double-domain fusion feature extraction module DDFE, executing a two-dimensional fast Fourier transform FFT2, and firstly converting an original image from a space domain to a frequency domain to obtain a frequency domain feature representation The formula is: , introducing a radial low-pass mask, wherein the expression is as follows: , Wherein, the In the frequency domain for a filter A response value at which the data is stored, The spatial frequency component in the x-direction and the spatial frequency component in the y-direction in the frequency domain, Is standard deviation of Gaussian filter, after learning complex weight modulation and low-pass filtering, frequency domain characteristics are converted back to space domain to obtain final characteristics of frequency domain branches : , Wherein the method comprises the steps of The representation takes the real part of the fourier transform, For the frequency domain branch final channel number feature, ; Capturing image detail information from space dimension after frequency domain feature extraction is completed, namely capturing small-scale edge features by 3×3 convolution and capturing large-scale region structural features by 5×5 convolution in parallel, splicing the small-scale edge features and the large-scale region structural features, and compressing the channel by 1×1 convolution to obtain space features ; And carrying out fusion processing on the double-domain features, wherein the formula is as follows: , Wherein the method comprises the steps of In order to be a two-domain fusion feature, For the purpose of 1*1 convolutions, Is frequency domain branching feature Representing an attention mask; Is Hadamard product operation; The frequency domain and the spatial feature are spliced, the dimension is regulated by 1 multiplied by 1 convolution, and then the attention mask generated by the spatial channel attention mechanism CBAM is combined Executing Hadamard product operation to finally obtain the fusion characteristic of the focusing effective information ; Based on DDFE extracted features, the residual wavelet downsampling is adopted to realize scale reduction, meanwhile, frequency domain structure information is reserved, a feature map is decomposed into 4 frequency domain sub-bands through Haar wavelet decomposition, and the expression is as follows: , Wherein, the For the low-to-high frequency sub-bands, For the high-frequency-low-frequency sub-bands, For the high frequency-high frequency sub-band, , A convolution operation of the Haar low-pass kernel in the row direction and a convolution operation in the column direction respectively, Is a Haar low-pass kernel which, 、 The Haar high-pass kernel is convolved in the row direction and in the column direction respectively, Is a Haar high-pass kernel which is used for the high-pass processing of the high-pass kernel, For the normalized input feature map, Indicating step length 2 downsampling, splicing sub-bands, and then combining with Residual connection is performed on the 1 x 1 convolved projection downsampled features to obtain a first set of downsampled features : , Wherein the method comprises the steps of For a first set of downsampled features Is used for the number of channels of a computer, , Is an intermediate feature after Haar wavelet decomposition; step 2-2, for the previous set of output features The optimized characteristics are obtained through a DDFE module, and downsampling is completed through Haar decomposition and residual connection, wherein the formula is as follows: , Wherein the method comprises the steps of Is the output characteristic diagram after the k+1st level downsampling, Is an optimized feature in the kth downsampling stage, For the number of downsampled feature channels at the k +1 stage, ; Step 2-3, recovering the feature size and embedding the watermark in the up-sampling stage of the dual-domain collaborative high-frequency residual enhancement network DRHE-Net, and for the fused feature Gradually sizing feature maps from four-stage upsampling Restoring to original image scale The up-sampling operation is firstly carried out at each stage to enlarge the size of the feature map by 2 times, the texture features of the left half side corresponding to the down-sampling stage are spliced through a jump connection mechanism, after the texture features are integrated with the adaptive scale watermark features generated by the message processor, the feature dimension regularity and feature fusion are completed through a convolution layer conv2d, the number of channels of the feature map is reduced from 128 to 64, 32 and 16 in sequence through four-stage up-sampling, and finally the fused feature map matched with the original image dimension is obtained The expression is: , Wherein, the For the skip texture feature on the left half corresponding to the downsampling stage, For the adaptive scale watermark feature output by the message processor, In order to perform the up-sampling operation, For the characteristic splicing operation, the method comprises the following steps, The final fusion feature map after the up-sampling is completed; Step 2-4, performing feature enhancement by using a high-frequency residual enhancement module RHFE-Block, and obtaining a feature map Mid-separation watermark residual component Discrete wavelet transform DWT is carried out and is decomposed into 1 low-frequency sub-band carrying basic visual information And 3 sub-bands corresponding to high frequency details of the watermark residual Maintaining low frequency subbands Unchanged, to Three high frequency subbands, each performing two depth-separable convolutions Operation, strengthening watermark characteristics of high-frequency region and suppressing screen shot noise interference Enhanced high frequency sub-band And original low frequency sub-band Performing inverse discrete wavelet transform to obtain watermark residual after high-frequency characteristic reinforcement The formula is: , Wherein, the Is a watermark residual characteristic diagram after high-frequency enhancement, Is an inverse discrete wavelet transform operation; step 2-4, watermark residual error characteristic diagram after high-frequency enhancement The residual characteristic diagram is mapped into a 3-channel residual characteristic diagram through 1X 1 convolution, and the formula is as follows: , Wherein the method comprises the steps of The 3-channel residual characteristic diagram is obtained after mapping; performing pixel level superposition on the residual characteristic diagram and the original carrier image, and restricting the output range to the (0, 1) interval by an activation function to generate a final watermark-containing image : , Wherein the method comprises the steps of Is a hyperbolic tangent activation function for restricting a numerical range to 。
4. A method according to claim 3, wherein step 3 comprises: Step 3-1, for the watermark-containing image generated in step 2 Using mathematical modeling method to simulate screen shooting noise and finally outputting noise-containing watermark image The water-contained print image output in step 2 For input, performing pixel value normalization to obtain normalized image : ; Step 3-2, for the normalized image Performing random perspective transformation to simulate geometrical distortion caused by inclination of shooting angles: , Wherein the method comprises the steps of For a geometrically distorted image after perspective transformation, The source point is 4 vertexes of the image as perspective transformation function The target point superimposes a random offset for each source point, calculates a perspective transformation matrix by get_ perspective _transform ; Step 3-3, sequentially superposing typical 3 types of physical noise in the screen shot SC scene, gradually constructing interference close to real shooting, simulating brightness deviation caused by non-uniform ambient light by using illumination distortion, and randomly selecting a linear gradual change or radial gradual change mode to generate an illumination mask Modulating image brightness: , Wherein the method comprises the steps of The brightness range of (2) is ; Generating a moire mask using moire distortion to simulate interference fringes caused by mismatch of screen and camera pixel density Normalized to [ -1,1] and superimposed to produce a final moire distortion image The formula is: , modeling electronic noise of a camera sensor using Gaussian noise, injecting Gaussian noise with a mean value of 0 and a variance of 0.001, and generating Gaussian noise images : , Wherein the method comprises the steps of Is a standard gaussian distribution; Step 3-4, inversely normalizing the processed noise-containing image, and restoring the noise-containing image to the conventional pixel range [0,1] of the image to obtain a final screen shot noise image : 。
5. The method of claim 4, wherein step 4 comprises: step 4-1, extracting initial characteristics, namely embedding watermark output in step 3 and carrying out noise layer attack on the image For input, an initial feature extraction module consisting of 3-layer single-layer convolution SingleConv and a 3-layer residual error block ResidualBlock is used for extracting and primarily denoising multi-scale spatial features of an input image to obtain an initial feature diagram of a decoder ; Step 4-2, introducing a multi-frequency channel attention enhancement network MFCANet prior to downsampling, based on the initial feature map Carrying out frequency domain feature statistics and adaptively generating channel weights Weighting and enhancing the initial feature map by element-by-element product operation to obtain a modulated feature map The formula is: ; Step 4-3, setting Stage downsampling, each stage downsampling is followed by introducing a multi-frequency channel attention enhancement network MFCANet: repeating the process of the current feature Generating channel weights using a multi-frequency channel attention enhancement network MFCANet Obtaining weighted feature map through element-by-element product operation Finally obtain the first Level downsampling feature map ; Step 4-4, performing feature compression, message mapping and watermark output, introducing global average pooling, and performing feature map Is compressed to the space dimension of Obtaining a one-dimensional feature vector Sequentially mapping the features to 256-dimension and 64-dimension through a two-stage Linear layer, and activating the function through Sigmoid After the binary watermark information is processed by threshold value binarization, the finally extracted 30-bit binary watermark information is obtained The formula is: , Wherein the method comprises the steps of As a function of the threshold value binarization, A first linear layer and a second linear layer, respectively.
6. The method of claim 5, wherein step 5 comprises: in step 5-1, a weighted multi-objective loss function is adopted in the training process, and the total loss L formula is as follows: , Wherein the method comprises the steps of Loss for encoder Is used for the weight of the (c), Lost for decoder Is used for the weight of the (c), Loss for discriminator Is used for the weight of the (c), Directing loss for textures Weights of (2); Step 5-2 coding loss The mean square error MSE is adopted, and the formula is as follows: ; Step 5-3 decoder loss The mean square error MSE is adopted, and the formula is as follows: , Wherein, the Watermark information recovered from the attacked image for the decoder; Step 5-4, loss of discriminator Binary cross entropy BCE is adopted, and the formula is as follows: , Wherein the method comprises the steps of The target label is 1; step 5-5, texture boot penalty The mean square error is weighted with a mask.
7. The method of claim 6, wherein in steps 5-5, texture steering is lost The formula is: , Wherein the method comprises the steps of Masking for a simple region; Is a simple region fidelity intensity.
8. An anti-panning robust image watermarking system implemented based on the method of any of claims 1-7, comprising: The data preprocessing unit is used for receiving the carrier image, performing wavelet transformation on the carrier image, and outputting three low-frequency component characteristics with different scales, namely a first-level low-frequency component LL1, a second-level low-frequency component LL2 and a third-level low-frequency component LL3; The message processing unit is used for receiving 64-bit binary watermark message and low-frequency component characteristics, performing linear expansion, convolution characteristic extraction, wavelet low-frequency component iterative fusion and frequency domain up-sampling through the message processor MessageProcessor, performing multi-frequency branch convolution and channel attention weighting through the multi-frequency channel attention enhancement network MFCANet, and outputting frequency domain enhanced watermark characteristics M_ feat which adapt to the image size of a carrier; The coding unit is used for receiving watermark features and carrier images, constructing a double-domain collaborative high-frequency residual enhancement network DRHE-Net, extracting frequency domain robust features and airspace detail features through a DDFE module and fusing the frequency domain robust features and the airspace detail features, combining residual wavelet downsampling to reserve a frequency domain structure, gradually recovering feature sizes in an upsampling stage, guiding watermark embedding through a jump texture feature and a texture attention mechanism to generate a primary watermark-containing feature map, finally directionally enhancing watermark high-frequency residual information through a high-frequency residual enhancement module RHFE-Block, and finally realizing self-adaptive fusion of watermark features and image features through Hadamard products to output a watermark-containing image I_w; inputting the watermark-containing image into the countermeasure training unit, performing end-to-end cooperative training through a multi-target weighting loss function, and strengthening the imperceptibility of the watermark-containing image through a discriminator network; The noise unit is used for inputting the watermark-containing image into a pre-established micro-screen noise-shooting layer, sequentially superposing perspective distortion, illumination distortion, moire distortion and Gaussian noise to construct a real noise-containing watermark image; The decoding and extracting unit inputs the noisy image into a pre-established decoding and extracting unit, and the decoding and extracting unit inherits the encoder double-domain processing logic, and extracts 64-bit one-dimensional binary watermark information through initial feature extraction, multi-scale frequency domain attention weighting of a multi-frequency channel attention enhancement network MFCANet, three-level downsampling feature compression, global pooling and linear mapping.
9. An electronic device comprising a processor and a memory, the memory storing program code that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 7.
10. A storage medium storing a computer program or instructions which, when run on a computer, performs the steps of the method of any one of claims 1 to 7.

Description

Anti-screen-shot image watermarking network method and system based on double-domain fusion and frequency domain attention Technical Field The invention belongs to the field of information security, and particularly relates to a screen shot image watermark resisting network method and system based on double-domain fusion and frequency domain attention. Background With the popularization of digital media propagation and the easy use of image editing tools, the problems of copyright infringement and counterfeiting and falsification of digital contents are increasingly highlighted, and the invisible watermark technology is widely focused as a core means for guaranteeing the authenticity and ownership of the contents. The key objective is to embed imperceptible watermark information on the premise of not affecting the visual quality of the original image, and accurately extract the watermark information after various interferences are experienced, thereby realizing copyright authentication and content tracing. The performance of the invisible watermarking technology mainly depends on three core indexes, namely imperceptibility (no obvious visual difference between an image and an original image after watermark embedding), robustness (the watermark can still be accurately extracted when the watermark is subjected to noise, compression, geometric transformation and other interferences), and embedding capacity (the effective information length which can be embedded once). According to the difference of the embedded domains, the traditional invisible watermarking method can be divided into two types, namely a spatial domain method and a frequency domain, wherein the spatial domain method directly modifies the embedded watermarking at the pixel level, the operation is simple, the robustness is extremely poor, the watermark is easily influenced by slight interference such as compression, noise and the like, the frequency domain method (such as Fourier transform and wavelet transform) embeds the watermarking into the frequency domain component of the image, the characteristic that human eyes are insensitive to high-frequency details is utilized, the imperceptibility is superior to that of the spatial domain method, but the traditional frequency domain method is mostly dependent on the characteristics and the embedding strategy of manual design, and the adaptability to complex real scenes is insufficient. In recent years, development of deep learning technology promotes the rise of an end-to-end invisible watermarking scheme, realizes automatic embedding and extraction of watermarks through an encoder-decoder architecture, and remarkably improves the robustness of conventional image processing operation. However, with popularization of cross-equipment propagation scenes such as screen shooting (ScreenShooting), the existing method faces new challenges that the SC scene involves multiple physical interferences such as perspective transformation, illumination distortion, moire interference, sensor noise and the like to form complex frequency domain and space domain mixed disturbance, the traditional deep learning method has obvious short plates that firstly, double-domain feature fusion is insufficient, most methods only focus on space domain feature extraction, excavation of frequency domain robust information is insufficient and the frequency domain disturbance under the SC scene is difficult to resist, secondly, noise simulation is simplified, the existing scheme adopts single Gaussian noise or geometric transformation simulation interference, and the specific illumination distortion, the special illumination distortion and the special illumination of the SC scene are not accurately modeled, The model generalization capability is insufficient due to physical noise such as moire, the attention mechanism is designed on one side, the pertinence enhancement of the related characteristics of the frequency domain watermark is lacked in the multi-focus space dimension of the existing attention module, the watermark extraction precision is greatly reduced under complex noise, the multi-objective balance is poor, the robustness is excessively pursued by part of the methods, the imperceptibility is sacrificed, or the dynamic balance of the imperceptibility, the robustness and the embedding capacity is difficult to realize through simple loss function combination constraint. In addition, the existing watermark embedding strategies mostly adopt global embedding or single-scale feature fusion, the image texture distribution difference is not fully considered, namely, a simple flat area is sensitive to pixel modification, a texture complex area is more suitable for hiding the watermark, and the watermark is exposed or the robustness is insufficient due to the improper embedding strategy. Meanwhile, the characteristic extraction process of the decoder lacks layering processing on multi-scale frequency domain information, and