US-12619708-B2 - Protection of a machine-learning model

US12619708B2US 12619708 B2US12619708 B2US 12619708B2US-12619708-B2

Abstract

A computer-implemented method or protecting a machine-learning model against training data attacks is disclosed. The method comprises performing an initial training of a machine-learning system with controlled training data, thereby building a trained initial machine-learning model and identifying high-impact training data from a larger training data set than in the controlled training data, wherein the identified individual training data have an impact on a training cycle of the training of machine-learning model, wherein the impact is larger than a predefined impact threshold value. The method also comprises building an artificial pseudo-malicious training data set from the identified high-impact training data and retraining the machine-learning system comprising the trained initial machine-learning model using the artificial pseudo-malicious training data set.

Inventors

Matthias SEUL
Andrea Giovannini
Frederik Frank Flother
Tim Uwe Scheideler

Assignees

INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date: 20260505
Application Date: 20221121
Priority Date: 20221006

Claims (19)

1 . A computer-implemented method for protecting a machine-learning model against training data attacks, said method comprising: performing an initial training of a machine-learning system with a controlled training data set, thereby building a trained initial machine-learning model; identifying high-impact training data from a larger training data set than in said controlled training data set, wherein said identified high-impact training data have an impact on a training cycle of said training of the machine-learning model, wherein said impact is larger than a predefined impact threshold value; adding, to the high-impact training data from the larger training data set, malicious data identified as being at least one of similar in characteristic to or in a same category as the high-impact training data; building an artificial pseudo-malicious training data set from said identified high-impact training data that is sorted ascendingly based on the impact on the training cycle; and retraining said machine-learning system comprising said trained initial machine-learning model using said ascendingly sorted artificial pseudo-malicious training data set, wherein the retraining said machine-learning system is performed until an observed error of one or more samples of the controlled training data set that were excluded during the initial training of the machine-learning system is below a threshold mean squared error loss.
2 . The method according to claim 1 , also comprising: extending said larger training data set to publicly available data, and retrain said machine-learning system comprising said trained initial machine-learning model using said extended larger training data set.
3 . The method according to claim 2 , wherein said retrained machine-learning model having used publicly available data for said retraining is used for autonomous driving.
4 . The method according to claim 3 , also comprising: extending said artificial pseudo-malicious training data set by using a categorical generative adversarial network (CatGAN) system comprising a generator component and a discriminator component, wherein said categorical generative adversarial network has been trained with said artificial pseudo-malicious training data set for generating additional artificial pseudo-malicious training data.
5 . The method according to claim 4 , wherein said machine-learning model is a categorizing machine-learning model adapted to predict that unknown data belong to one of a plurality of categories.
6 . The method according to claim 5 , wherein said discriminator component of said CatGAN system has been trained to predict malicious samples for said artificial pseudo-malicious training data set, wherein said predicted malicious samples are evenly distributed across said plurality of categories.
7 . The method according to claim 6 , wherein said impact on a training cycle of said training of machine-learning model is determined based on an amount value of a utilized training loss function for said training of said machine-learning model.
8 . The method according to claim 7 , wherein said high-impact training data is obtained by sorting samples based on their mean square error.
9 . The method according to claim 8 , wherein said controlled training data set and said larger training data set are un-checked with regard to them being compromised.
10 . A training system for protecting a machine-learning model against training data attacks, said system comprising: a processor and a memory operationally coupled to said processor, wherein said memory is adapted for storing program code, which, wherein executed by said processor, enables said processor to: perform an initial training of a machine-learning system with a controlled training data set, thereby building a trained initial machine-learning model; identify high-impact training data from a larger training data set than in said controlled training data set, wherein said identified high-impact training data have an impact on a training cycle of said training of the machine-learning model, wherein said impact is larger than a predefined impact threshold value; adding, to the high-impact training data from the larger training data set, malicious data identified as being at least one of similar in characteristic to or in a same category as the high-impact training data; build an artificial pseudo-malicious training data set from said identified high-impact training data that is sorted ascendingly based on the impact on the training cycle; and retrain said machine-learning system comprising said trained initial machine-learning model using said ascendingly sorted artificial pseudo-malicious training data set, wherein the retraining said machine-learning system is performed until an observed error of one or more samples of the controlled training data set that were excluded during the initial training of the machine-learning system is below a threshold mean squared error loss.
11 . The system according to claim 10 , also comprising: extending said larger training data set to publicly available data, and retrain said machine-learning system comprising said trained initial machine-learning model using said extended larger training data set.
12 . The system according to claim 11 , wherein said retrained machine-learning model having used publicly available data for said retraining is used for autonomous driving.
13 . The system according to claim 12 , also comprising: extending said artificial pseudo-malicious training data set by using a categorical generative adversarial network (CatGAN) system comprising a generator component and a discriminator component, wherein said categorical generative adversarial network has been trained with said artificial pseudo-malicious training data set for generating additional artificial pseudo-malicious training data.
14 . The system according to claim 13 , wherein said machine-learning model is a categorizing machine-learning model adapted to predict that unknown data belong to one of a plurality of categories.
15 . The system according to claim 14 , wherein said discriminator component of said CatGAN has been trained to predict malicious samples for said artificial pseudo-malicious training data set, wherein said predicted malicious samples are evenly distributed across said plurality of categories.
16 . The system according to claim 15 , wherein said impact on a training cycle of said training of machine-learning model is determined based on an amount value of a utilized training loss function for said training of said machine-learning model.
17 . The system according to claim 16 , wherein said high-impact training data is obtained by sorting samples based on their mean square error.
18 . The system according to claim 17 , wherein said controlled training data set and said larger training data set are un-checked with regard them being compromised.
19 . A computer program product for protecting a machine-learning model against training data attacks, said computer program product comprising a computer readable storage medium having program instructions embodied therewith, said program instructions being executable by one or more computing systems or controllers to cause said one or more computing systems to: perform an initial training of a machine-learning system with a controlled training data set, thereby building a trained initial machine-learning model; identify high-impact training data from a larger training data set than in said controlled training data set, wherein said identified high-impact training data have an impact on a training cycle of said training of machine-learning model, wherein said impact is larger than a predefined impact threshold value; adding, to the high-impact training data from the larger training data set, malicious data identified as being at least one of similar in characteristic to or in a same category as the high-impact training data; building an artificial pseudo-malicious training data set from said identified high-impact training data that is sorted ascendingly based on the impact on the training cycle; and retraining said machine-learning system comprising said trained initial machine-learning model using said ascendingly sorted artificial pseudo-malicious training data set, wherein the retraining said machine-learning system is performed until an observed error of one or more samples of the controlled training data set that were excluded during the initial training of the machine-learning system is below a threshold mean squared error loss.

Description

BACKGROUND Field of the Invention The invention relates generally to a method and a system for safeguarding a machine-learning process, and more specifically, to a computer-implemented method or protecting a machine-learning model against training data attacks. The invention relates further to a training system for protecting a machine-learning model against training data attacks, and a computer program product. Related Art The interest in and usage of solutions using machine-learning (ML) technologies and artificial intelligence in the industry is ever-increasing. As related machine-learning models proliferate, so do attacks on them. One example concerns those ML models whose training data is (at least partly) publicly accessible or worthy attackers were able to obtain unauthorized access to the training data. Complex ML model applications like autonomous driving and language translation depend on large amounts of training data provided by public sources. Attackers may manipulate the training data, e.g., by introducing additional samples, in such a fashion that humans or analytic systems cannot detect the change in the training data. This represents a serious threat to the behavior and predictions of machine-learning systems because unexpected and dangerous results may be generated. The publication NISTIR 8269 “A Taxonomy and Terminology of Adversarial and Machine-Learning” classifies and names various attack and defense techniques. Hence, there is a need to counter such attacks to the training data and thus to unintended training of the machine-learning model. In general, increasing the number of trainings tends to increase performance and accuracy of the ML model, and so continues to convergence. However, an adversary can significantly increase the error rate of another ML model by inserting a relatively small amount of maliciously crafted samples. For instance, the article “Active Learning for Classification with Maximum Model Change” quantifies the model change with respect to changes in the training data (compare https://dl.acm.org/doi/pdf/10.1145/3086820). Moreover, independent of any changes in the training data, there may be learning curves that exhibit instability. For example, “Adaptive Learning Rate Clipping Stabilizes Learning” shows a learning curve with high loss spikes or peaks that excessively disturbed a trainable parameter distribution. This further underscores the potential of an attacker to drastically impact an ML model during the training phase (compare also FIG. 2). The above-mentioned document NISTIR 8269 names three defense techniques against the described attacks: (i) data encryption and access control to prevent an adversary from injecting training data, which is of course not an option when public data is used for training; (ii) data sanitization requires a separate system for “testing the impact of examples on justification performance”; and (iii) robust statistics “use constraints and regulation techniques to reduce potential distortions”. All of these defense techniques put basically a gate or filter in front of the ML model during training. However, it does not modify the ML model in a way that it becomes sort of immune to malicious data. There are some disclosures related to a computer-implemented method or protecting a machine-learning model against training data attacks. One example is the document US 2021/0 157 912 A1. It discloses techniques for detecting adverse areas attacks. A machine-learning system processes the input into and output of the ML model using an adversary a detection module that does not contain a direct external interface. Thereby, the adversary a detection module includes a detection model that generates a score indicative of whether the input is adversarial using, e.g., a neural fingerprint technique or a comparison of features extracted by a surrogate ML model to an expected future distribution for the ML model output. However, also this is only partially addressing the core problem of malicious training data so that the machine-learning itself may comprise a sort of self-defense mechanism against malicious training data. The here proposed approach addresses this problem. SUMMARY OF THE INVENTION According to one aspect of the present invention, a computer-implemented method or protecting a machine-learning model against training data attacks may be provided. The method may comprise performing an initial training of a machine-learning system with controlled training data, thereby building a trained initial machine-learning model and identifying high-impact training data from a larger training data set than in the controlled training data, wherein the identified individual training data have an impact on a training cycle of the training of the machine-learning model, wherein the impact is larger than a predefined impact threshold value. The method may also comprise building an artificial pseudo-malicious training data set from the identified high-im