EP-4738203-A1 - METHOD FOR SECURING A NEURAL NETWORK AGAINST BACKDOOR ATTACKS AT THE TRAINING PHASE

EP4738203A1EP 4738203 A1EP4738203 A1EP 4738203A1EP-4738203-A1

Abstract

The present invention relates to a method for securing a neural network against backdoor attacks at the training phase, wherein the neural network comprises an input layer, hidden layers and an output layer and is trained by classifying datapoints into a set of output classes independently of its use after training, said method being performed by a computer system comprising, at the training phase: - programming (S1) the computer system with the neural network to be trained, - acquiring (S2) a training dataset comprising datapoints, - training (S3) said neural network using said training dataset over several epochs, - evaluating (S4), over at least one epoch, for each output class, an accuracy of classification into said class, - performing (S5) a test over all output classes on said accuracies evaluated for each of said output classes, said test identifying as a deviating class at least one output class whose accuracy is deviating with regard to the accuracy of the other output classes over at least one epoch, - when said test identifies at least one output class as a deviating class, performing (S6) a predetermined action for securing said neural network.

Inventors

TEGLIA, YANNICK
LE ROUX, Quentin
Bourbao, Eric

Assignees

THALES DIS FRANCE SAS

Dates

Publication Date: 20260506
Application Date: 20241104

Claims (10)

A method for securing a neural network against backdoor attacks at the training phase, wherein the neural network comprises an input layer, hidden layers and an output layer and is trained by classifying datapoints into a set of output classes independently of its use after training, said method being performed by a computer system (300) comprising, at the training phase: - programming (S1) the computer system with the neural network to be trained; - acquiring (S2) a training dataset comprising datapoints; - training (S3) said neural network using said training dataset over several epochs; - evaluating (S4), over at least one epoch, for each output class, an accuracy of classification into said class; - performing (S5) a test over all output classes on said accuracies evaluated for each of said output classes, said test identifying as a deviating class at least one output class whose accuracy is deviating with regard to the accuracy of the other output classes over at least one epoch; - when said test identifies at least one output class as a deviating class, performing (S6) a predetermined action for securing said neural network.
The method of claim 1, wherein performing said test (S5) comprises comparing, for each output class, the average value, the slope or the acceleration of the accuracy evaluated for said output class over a predetermined number of classifications of datapoints into said output class, with the average values, the slopes or the accelerations of accuracies evaluated for all output classes, over said predetermined number of classifications of datapoints into said output classes.
The method of claim 1, wherein performing said test (S5) comprises comparing, for each output class, the average value, the slope or the acceleration of the accuracy evaluated for said output class over a predetermined number of batches or epochs with the average value, the slope or the acceleration of accuracies evaluated for all output classes over said batches or epochs.
The method of any of claim 1 to 3, wherein said test is among a statistical or heuristic outlier detection test.
The method of any of claim 1 to 4, wherein performing a predetermined action for securing said neural network (S6) comprises removing from the training dataset datapoints classified by the neural network as belonging to at least one output class identified by said test as a deviating class.
The method of any of claims 1 to 4, wherein performing a predetermined action for securing said neural network (S6) comprises removing from the output layer said at least one output class identified by said test as a deviating class.
The method of any of claims 1 to 6, wherein the predetermined action is performed immediately after said test has identified at least one output class as a deviating class.
The method of any of claims 1 to 6, wherein the predetermined action is performed only after a new test performed over next batches or epochs has identified again as a deviating class said at least one output class already identified as a deviating class.
A computer program product directly loadable into the memory of at least one computer, comprising software code instructions for performing the steps of any one of claims 1 to 8 when said product is run on the computer.
A computer system (300), for securing a neural network against backdoor attacks at the training phase, programmed with the neural network to be trained, wherein the neural network comprises an input layer, hidden layers and an output layer and is trained by classifying datapoints into a set of output classes independently of its use after training, and comprising: - a processor (301); - a communication interface (306) connected to the processor, configured for acquiring a training dataset comprising datapoints and to provide the training dataset to the processor; - at least one memory (305) connected to the processor, configured for storing said neural network to be trained and including instructions executable by the processor, the instructions comprising: • training said neural network using said training dataset over several epochs; • evaluating, over at least one epoch, for each output class, an accuracy of classification into said class; • performing a test over all output classes on said accuracies evaluated for each of said output classes, said test identifying as a deviating class at least one output class whose accuracy is deviating with regard to the accuracy of the other output classes over at least one epoch; • when said test identifies at least one output class as a deviating class, performing a predetermined action for securing said neural network.

Description

The present invention relates, generally, to the protection of neural networks against attacks, and, more particularly, to a method preventing a neural network from learning backdoors during training. BACKGROUND OF THE INVENTION Neural networks have become an increasingly valuable tool for addressing several problems such as image recognition, pattern recognition, or voice recognition. Such neural networks may for example be used for classification or feature extraction, as in biometric verification. Accurate classification would correctly predict the class an input most likely belongs to. This is found, for instance, in authenticating persons correctly based on face pictures. Inaccurate classification can cause both false positives, such as interpreting an imposter as being the person being authenticated, and false negatives, such as falsely interpreting a person as being an imposter. Feature extraction would correctly transform a raw input, such as a face image, into a number vector called an embedding that enables comparisons with other embeddings. For instance in a facial authentication system, face embeddings could be used to compare a candidate face with a previously stored face that has been enrolled in some application. A small distance metric or a high similarity score between the two embeddings would indicate that the two faces belong to the same person. Neural networks may be subject to backdoor attacks. Such attacks consist in injecting into a model, during its training, a malicious trigger associated with a malicious, erroneous behavior of the neural network, such as classifying an input in a class to which it does not belong for real. Such an injection later enables an attacker to activate again this erroneous behavior at any time during inference by presenting the network with an input containing the backdoor trigger. Training-time backdoors are specially-crafted patterns injected in a victim neural network by injecting it into a training input such as an image, which is called data poisoning. Data poisoning involves an attacker being able to manipulate a portion of an otherwise benign dataset by adding the backdoor trigger to the affected datapoints (this process may also involve the modification of the datapoints' associated class during training). Then, during each datapoint forward and backward propagation, the training neural networks will learn to associate the expected malicious behavior with the trigger used to poison the dataset. Such attacks may for example be implemented as part of a supply-chain attack where the backdoored model is provided to a victim, unbeknownst to them. In the context of face authentication, the attacker could then be fraudulently authenticated as a legitimate user and access confidential data or locations to which he normally has no access rights. Such attacks are all the more powerful that the attacker does not even need to conduct an interrogation of the model at test time, which is detectable, as in the case of adversarial attacks. The attacker only needs to query the model once, at inference, with the backdoor's designed trigger. Existing solutions against backdoor attacks often aim at detecting at inference when a backdoor is triggered, for example by comparing the neural network output or inner state with values obtained when processing inputs which do not contain any backdoor. Such solutions do not enable to detect at training that a neural network is currently learning a malicious behavior due to backdoored training data. They rather only detect the backdoor after it has been learned by the neural network. In such a case, training would have to be performed again to purify the network from the malicious behavior embedded in it, which can be very costly in the case of large models. Consequently, there is a need for a method enabling the protection of neural networks against such backdoor attacks as soon as in the model training phase, therefore preventing the protected neural network from learning a malicious behavior associated to a backdoor trigger despite a data poisoning of the model training data set. SUMMARY For this purpose and according to a first aspect, this invention therefore relates to a method for securing a neural network against backdoor attacks at the training phase, wherein the neural network comprises an input layer, hidden layers and an output layer and is trained by classifying datapoints into a set of output classes independently of its use after training, said method being performed by a computer system comprising, at the training phase: programming the computer system with the neural network to be trained,acquiring a training dataset comprising datapoints,training said neural network using said training dataset over several epochs,evaluating, over at least one epoch, for each output class, an accuracy of classification into said class,performing a test over all output classes on said accuracies evaluated for each of said output