CN-121999278-A - Image detection method and computing device

CN121999278ACN 121999278 ACN121999278 ACN 121999278ACN-121999278-A

Abstract

The embodiment of the specification provides a method and computing equipment for detecting images, wherein the method comprises the steps of processing a first image to be detected through a specified classification network to obtain attention patterns corresponding to a plurality of specified network layers, wherein the plurality of specified network layers are network layers based on an attention mechanism in the specified classification network, the attention patterns represent importance weights of classification tasks of all areas of the first image, frequency domain analysis is respectively carried out on all the attention patterns to obtain frequency domain features corresponding to the first image, and a detection result is determined through a detection network based on the frequency domain features, wherein the detection result at least indicates whether the first image is a generated image or not, so that accurate detection of the generated image is realized.

Inventors

ZHANG YI
LI JIANSHU
GAO WEIZE
DENG WENZHONG
YAO WEIBIN

Assignees

蚂蚁区块链科技(上海)有限公司

Dates

Publication Date: 20260508
Application Date: 20260107

Claims (10)

1. A method of detecting an image, comprising: processing a first image to be detected through a specified classification network to obtain attention force diagrams corresponding to a plurality of specified network layers, wherein the specified network layers are network layers based on an attention mechanism in the specified classification network, and the attention force diagrams represent importance weights of various areas of the first image to classification tasks; frequency domain analysis is carried out on each attention map respectively to obtain frequency domain features corresponding to the first image; and determining a detection result by detecting a network based on the frequency domain features, wherein the detection result at least indicates whether the first image is a generated image.
2. The method of claim 1, further comprising: Obtaining a predictive score corresponding to a predictive classification result obtained by the specified classification network processing the first image; determining a gradient map of the predictive score relative to the first image; processing the gradient map through an extraction network to obtain an extraction result which is used as a semantic feature corresponding to the first image; the determining the detection result comprises the following steps: and determining the detection result through the detection network based on the frequency domain features and the semantic features.
3. The method of claim 1, wherein the performing frequency domain analysis on each attention map to obtain the frequency domain feature corresponding to the first image includes: and based on a fast Fourier transform FFT algorithm, respectively carrying out frequency domain analysis on each attention map to obtain the frequency domain characteristics corresponding to the first image.
4. The method of claim 1, wherein obtaining the frequency domain feature corresponding to the first image comprises: carrying out frequency domain analysis on each attention map respectively to obtain each frequency domain analysis result corresponding to each attention map; Respectively extracting features of all frequency domain analysis results corresponding to all attention diagrams to obtain frequency statistical features corresponding to all attention diagrams; and obtaining the frequency domain feature corresponding to the first image based on the frequency statistical feature corresponding to each attention map.
5. The method of claim 2, wherein the determining the detection result comprises: fusing the frequency domain features and the semantic features to obtain fused features; And obtaining the detection result through the detection network based on the fusion characteristics.
6. The method of claim 1, wherein the specified classification network is a Transformer structure-based neural network, the attention attempt being determined based on an attention weight matrix output by a respective specified network layer; Or the specified classification network is a convolutional neural network, each specified network layer of the convolutional neural network being followed by a module based on a spatial attention mechanism, the attention being intended to be determined based on an output of the module based on the spatial attention mechanism added after the corresponding specified network layer; The specified classification network is a convolutional neural network, the convolutional neural network is provided with an SE module based on a channel attention mechanism, a plurality of specified network layers comprise a convolutional layer behind the SE module and/or a last convolutional layer of the specified classification network, and attention is sought to be determined based on output of the corresponding specified network layer and a prediction value corresponding to a prediction classification result obtained by the specified classification network processing the first image.
7. The method of claim 1, further comprising: Determining semantic features of the first image at least through an encoder corresponding to a designated image processing model based on the first image, wherein each sample image in a training set of the designated image processing model is a real image; the determining the detection result comprises the following steps: and combining the frequency domain features and the semantic features, and determining the detection result through the detection network.
8. The method of claim 7, wherein the specified image processing model is an image reconstruction model; The determining the semantic features of the first image includes: and obtaining the semantic features through an encoder of the image reconstruction model based on the first image.
9. The method of claim 7, wherein the specified image processing model is an image generation model; The determining the semantic features of the first image includes: Processing the first image with an encoder corresponding to the image generation model to obtain a first potential encoding of the first image; Generating a model through the image based on the first potential code to obtain a generated image; the semantic features of the first image are determined based on image differences between the generated image and the first image and/or encoding differences between the first potential encoding and a potential encoding distribution corresponding to the image generation model.
10. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-9.

Description

Image detection method and computing device Technical Field The present disclosure relates to the field of artificial intelligence, and in particular, to a method and computing device for detecting images. Background With the development of deep learning technology, it is becoming more realistic to generate images (hereinafter referred to as generated images) generated by generation models such as a countermeasure network (GAN) and a Latent Diffusion Model (LDM). These generative models can generate highly simulated face images or videos, which are generally difficult to identify by naked eyes, so that users cannot judge the authenticity of network content, and threatens the content security and the authenticity of the network environment. In order to protect the content security and authenticity of a network environment, a scheme of generating image detection (Camera Source Fingerprint Detection) based on a fingerprint of a camera source is provided in the related art, in which noise residuals such as photo response non-uniformity (PRNU) are utilized as "fingerprints" of a photographing device, wherein an authentic image may have the PRNU noise residuals of a photographing camera thereof, and a falsified image (such as whole face substitution) may destroy the continuity and consistency of such noise residuals. In view of the above characteristics of the real image and the fake image, in the process of generating the image detection based on the fingerprint of the camera source, firstly acquiring a plurality of frames of real images acquired by the shooting camera, extracting and analyzing the PRNU noise residual error in each frame of real images, calculating the reference PRNU noise residual error corresponding to the shooting camera based on the PRNU noise residual error of the plurality of frames of real images, and subsequently, after the image to be detected is acquired, extracting and analyzing the PRNU noise residual error in the image to be detected, and judging whether the image to be detected is matched with the reference PRNU noise residual error corresponding to the shooting camera or not, wherein if the image to be detected is judged to be the real image, if the image to be detected is not matched, the image to be detected is the fake image (namely, the generated image). However, in the image detection process based on the generation of the camera source fingerprint, the reference PRNU noise residual error corresponding to the camera needs to be acquired or estimated, and the image detection process is not suitable for a complex internet environment, for example, some social media platforms compress and process uploaded images, so that the PRNU noise residual error in the images can be erased, and accurate detection of such images cannot be performed. In view of this, a new method for detecting images is needed to achieve accurate detection of the generated images. Disclosure of Invention One or more embodiments of the present disclosure provide a method and computing device for detecting images to enable accurate detection of generated images. According to a first aspect, there is provided a method of detecting an image, comprising: processing a first image to be detected through a specified classification network to obtain attention force diagrams corresponding to a plurality of specified network layers, wherein the specified network layers are network layers based on an attention mechanism in the specified classification network, and the attention force diagrams represent importance weights of various areas of the first image to classification tasks; frequency domain analysis is carried out on each attention map respectively to obtain frequency domain features corresponding to the first image; and determining a detection result by detecting a network based on the frequency domain features, wherein the detection result at least indicates whether the first image is a generated image. According to a second aspect, there is provided an apparatus for detecting an image, comprising: The processing module is configured to process a first image to be detected through a specified classification network to obtain attention force diagrams corresponding to a plurality of specified network layers, wherein the specified network layers are network layers based on an attention mechanism in the specified classification network, and the attention force diagrams represent importance weights of various areas of the first image to classification tasks; the obtaining module is configured to respectively perform frequency domain analysis on each attention map to obtain frequency domain features corresponding to the first image; A determination module configured to determine a detection result by detecting a network based on the frequency domain feature, wherein the detection result indicates at least whether the first image is an image. According to a third aspect, there is provided a computer readable storage me