CN-122021787-A - Black box attack resistance method based on deep reinforcement learning

CN122021787ACN 122021787 ACN122021787 ACN 122021787ACN-122021787-A

Abstract

The invention discloses a black box anti-attack method based on deep reinforcement learning, which relates to the technical field of deep learning and comprises the following steps of obtaining an original image and a corresponding real class label thereof, and designating a target attack class different from the real class; randomly selecting an image from the auxiliary data set of the target attack class as an initial challenge sample, and calculating the difference between the initial challenge sample and the original image as initial noise. According to the invention, the high-dimensional pixel space optimization problem is converted into the low-dimensional sequence decision problem which can be solved efficiently, so that an intelligent agent can learn to minimize the change to an original image on the premise of keeping the attack efficacy, a generated countermeasure sample is highly similar to the original image in vision and is difficult to be perceived, and the actual threat and concealment of the attack are obviously improved.

Inventors

ZHOU CHAO
HE SHENGHUA
PU XIYI
LI QILIN
ZHONG QINGLING
MO JIAXIN
YU JIAYAO
WU JIAQI

Assignees

湖州学院

Dates

Publication Date: 20260512
Application Date: 20260119

Claims (10)

1. The black box attack resistance method based on deep reinforcement learning is characterized by comprising the following steps of: s1, acquiring an original image and a corresponding real class label thereof, and designating a target attack class different from the real class; S2, randomly selecting an image from an auxiliary data set of a target attack class to serve as an initial countermeasure sample, and calculating a difference value between the initial countermeasure sample and an original image to serve as initial noise; S3, randomly scrambling the pixels of the current countermeasure sample, and uniformly dividing the pixels into a plurality of pixel groups, wherein each pixel group serves as a uniform action unit, and executing an action corresponds to zeroing all noise values of the selected pixel group, so that an action space with a dimension far lower than the total number of pixels of the image is constructed; S4, training an agent to learn a noise removal strategy based on a depth Q network by taking a current countermeasure sample image as an input state and taking actions in an action space as selection objects, wherein the agent performs sequence decision optimization through the depth Q network and performs action selection by adopting an epsilon-greedy strategy; S5, gradually selecting and removing noise of a corresponding pixel group by using the trained intelligent body, generating a new countermeasure sample, and verifying whether the countermeasure sample is still misjudged as a target category by a target model after removing the noise each time; s6, iteratively executing the following processes of randomly disturbing pixel grouping again, updating an action space, retraining an agent again and executing a new round of noise removal within the preset model query times so as to continuously optimize a countermeasure sample; And S7, outputting a final challenge sample which meets the attack success condition and has the least noise retention.
2. The black box attack resistance method based on deep reinforcement learning according to claim 1, wherein in S1, the method is aimed at a hard tag black box attack scene, an attacker can only query a target model and obtain a final decision result, and cannot access internal parameters, structural information or gradient information of the model.
3. The method for black box challenge based on deep reinforcement learning as claimed in claim 1, wherein in S2, the initial challenge sample is generated by assuming that an attacker can obtain a small auxiliary data set containing an image of a target class, randomly selecting an image from the data set as a starting point, defining initial noise as a difference between the initial challenge sample and the original image, and ensuring that the initial challenge sample is classified as the target class by the target model.
4. The black box challenge method based on deep reinforcement learning of claim 1, wherein in the step S3, the specific implementation manner of the pixel grouping is that firstly, the arrangement sequence of all pixels of a current challenge sample is randomly disturbed, then, the pixels are evenly divided into n groups in sequence, each group comprises m pixels, so that the dimension of an action space is n, and m times n is equal to the total number of pixels of an image; Constructing a binary mask vector for the pixel group, wherein each element has an initial value of 1, and a value of 1 indicates that the noise of the corresponding pixel group is reserved and a value of 0 indicates that the noise of the corresponding pixel group is removed; In a subsequent noise removal process, said performing an action corresponds to zeroing out the corresponding elements of the selected group of pixels in the binary mask vector W, i.e. recording a noise removal operation by updating the binary mask vector; when a new countermeasure sample is generated, the value expansion of each element in the binary mask vector is applied to all pixels of a corresponding pixel group, and then element-by-element multiplication operation is carried out with initial noise, so that the noise is removed by groups.
5. The black box challenge method based on deep reinforcement learning of claim 1, wherein the training process of the deep reinforcement learning model is modeled as a markov decision process in S4, wherein the reward function is designed to give positive rewards, which are positively correlated with the number of non-zero pixels in the removed noise and multiplied by a predefined scale factor, when the noise removal results in the loss of the resistance of the challenge sample, and to give a larger negative penalty value and terminate the current training round immediately after the noise removal is performed while the challenge sample is still classified as the target class by the target model.
6. The black box attack resistance method based on deep reinforcement learning according to claim 1, wherein the deep Q network in step S4 adopts a convolutional neural network architecture, and specifically comprises: a first convolution layer, which uses thirty-two three-three filters and cooperates with a ReLU activation function; a second convolution layer, which uses sixty-four three-by-three filters, the step length is two, and the second convolution layer is matched with a ReLU activation function; And the full-connection layer is used for outputting Q value estimation of all actions through the two full-connection layers after flattening the convolution characteristics.
7. The method for black box attack resistance based on deep reinforcement learning according to claim 6, wherein an experience playback mechanism is adopted when training the deep Q network, wherein states, actions, rewards, next states and termination marks generated by the interaction of the agent with the environment are stored in a playback buffer, and batch data are randomly sampled from the playback buffer for learning during training.
8. The black box attack mitigation method of claim 7, wherein in each training round of the training depth Q network, an action masking mechanism is introduced which maintains a binary mask vector to mark the actions performed, initially all actions being available, once the actions are performed, their corresponding entries are set to zero, thereby ensuring that each group of pixels is selected at most once in one training round; when training the deep Q network, an independent target network is adopted to generate a stable target Q value, a discount factor gamma is adopted to calculate a future cumulative prize, parameters of the target network are regularly copied from the main network, and the updating frequency is that every 100 training steps are performed; the S4 further includes: And the intelligent agent adopts an epsilon-greedy strategy to perform action selection when making a decision in each step, wherein the action is selected randomly by using probability epsilon for exploration, and the action with the largest current Q value is selected by using probability 1 epsilon for utilization.
9. The black box challenge-attack method based on deep reinforcement learning of claim 1, wherein in the step S5, the specific process of the noise removing operation is that after the agent selects the action according to the current policy, the noise of the corresponding pixel group is removed to generate a new challenge sample, after each removal, the target model is queried immediately to verify the challenge, if the challenge is maintained, the state is updated and continued, and if the challenge is lost, the current round is terminated and a penalty is applied.
10. The black box challenge method according to claim 1, wherein the iterative optimization process in step S6 includes updating the current challenge sample with the generated new challenge sample after each noise removal, re-randomly disturbing the pixel grouping scheme, and redefining the action space based on the updated challenge sample, and then re-training the agent to adapt to the new environment and perform a new round of noise removal, and the optimization is repeated with the objective of minimizing the total number of mask bits corresponding to the noise retained in the final generated challenge sample on the premise of satisfying the success of the challenge.

Description

Black box attack resistance method based on deep reinforcement learning Technical Field The invention relates to the technical field of deep learning, in particular to a black box anti-attack method based on deep reinforcement learning. Background With the wide application of deep learning in various fields, the safety problem of deep learning is also gradually attracting attention. Deep Neural Networks (DNNs) can produce false positives on perturbed input samples, which are called countersamples, and have recently become a focus of attention for researchers. Since challenge samples pose a threat to the security of deep learning systems, especially black box attacks that are implemented by aggressors without model parameters, it is necessary to study the development of challenge samples for defense strategies and enhancement of model robustness. At present, a lot of research on black box attacks is carried out at present, but most of black box attacks produce global disturbance to fool a target model by a gradient estimation mode, it is well known that for black box models, particularly hard tag black box models, an attacker has difficulty in obtaining gradient information of the models, the attacker often estimates the gradient information by constructing a substitution model or converting a finding countermeasure sample into an optimization problem and the like, however, the gradient information estimated by the mode is often inaccurate, so that the countermeasure sample generated by the gradient estimation mode often generates large image distortion, and the changed pixels relate to full images, thereby influencing the visual effect of the countermeasure sample. Disclosure of Invention The main objective of the present invention is to provide a black box attack resistance method based on deep reinforcement learning, so as to solve the problems set forth in the background. In order to achieve the purpose, the technical scheme adopted by the invention is that the black box anti-attack method based on deep reinforcement learning comprises the following steps: s1, acquiring an original image and a corresponding real class label thereof, and designating a target attack class different from the real class; S2, randomly selecting an image from an auxiliary data set of a target attack class to serve as an initial countermeasure sample, and calculating a difference value between the initial countermeasure sample and an original image to serve as initial noise; S3, randomly scrambling the pixels of the current countermeasure sample, and uniformly dividing the pixels into a plurality of pixel groups, wherein each pixel group serves as a uniform action unit, and executing an action corresponds to zeroing all noise values of the selected pixel group, so that an action space with a dimension far lower than the total number of pixels of the image is constructed; S4, training an agent to learn a noise removal strategy based on a depth Q network by taking a current countermeasure sample image as an input state and taking actions in an action space as selection objects, wherein the agent performs sequence decision optimization through the depth Q network and performs action selection by adopting an epsilon-greedy strategy; S5, gradually selecting and removing noise of a corresponding pixel group by using the trained intelligent body, generating a new countermeasure sample, and verifying whether the countermeasure sample is still misjudged as a target category by a target model after removing the noise each time; s6, iteratively executing the following processes of randomly disturbing pixel grouping again, updating an action space, retraining an agent again and executing a new round of noise removal within the preset model query times so as to continuously optimize a countermeasure sample; And S7, outputting a final challenge sample which meets the attack success condition and has the least noise retention. Preferably, in the step S1, in the method, for a hard tag black box attack scenario, an attacker can only query a target model and obtain a final decision result, and cannot access internal parameters, structural information or gradient information of the model. Preferably, in the step S2, the initial challenge sample is generated by assuming that an attacker can obtain a small auxiliary data set including the target class image, randomly selecting an image from the data set as a starting point, defining initial noise as a difference between the initial challenge sample and the original image, and ensuring that the initial challenge sample is classified as the target class by the target model. Preferably, in the step S3, the specific implementation manner of the pixel grouping is that firstly, the arrangement sequence of all pixels of the current countermeasure sample is randomly disturbed, then the pixels are evenly divided into n groups in sequence, each group comprises m pixels, so that the dimension of an action spa