US-20260129076-A1 - SYSTEMS AND METHODS FOR ADVERSARIAL TEXT PURIFICATION VIA LARGE LANGUAGE MODELS

US20260129076A1US 20260129076 A1US20260129076 A1US 20260129076A1US-20260129076-A1

Abstract

Examples inventive concepts of using adversarial text purification for adversarial attack defense are provided that in general take adversarial input texts to text classifiers and purify them into samples that are synonymous with the inputs but are benign (i.e., correctly classified by the classifiers). This harnesses the generative capabilities of LLMs to purify adversarial text without the need to explicitly characterize the discrete noise perturbations. Prompt engineering is implemented to exploit LLMs for recovering the purified samples for given adversarial examples such that they are semantically similar and correctly classified. Applications include software based solutions to increase robustness, reliability, and trustworthiness of text classifiers already deployed.

Inventors

Raha Moraffah
Shubh Khandelwal
Amrita Bhattacharjee
Huan Liu

Assignees

Raha Moraffah
Shubh Khandelwal
Amrita Bhattacharjee
Huan Liu

Dates

Publication Date: 20260507
Application Date: 20250925

Claims (20)

1 . A method of defending against adversarial attacks by adversarial text purification, comprising: accessing an attacked input defining an adversarially perturbed version of an original input text, the attacked input configured to induce misclassification including a different classification outcome relative to a ground truth classification associated with the original text input; and transforming the attacked input to a purified version that is synonymous with the original input text and can be correctly classified, comprising: guiding an LLM via at least one prompt to reconstruct the semantic content of the attacked input while removing adversarial perturbations, and generating, as output from the LLM in view of the at least one prompt, a purified text sample that is semantically similar to the adversarially perturbed version of the original input text original input text but aligned with the ground truth classification.
2 . The method of claim 1 , further comprising: classifying, by a trained classifier, the purified text sample to produce a classification output corresponding to the ground truth classification associated with the original input text.
3 . The method of claim 1 , wherein the purified text sample is generated without explicitly characterizing adversarial perturbations associated with the attacked input.
4 . The method of claim 1 , wherein the at least one prompt is configured to understand an effect of an explicit instruction to ensure the purified text sample is classified as the ground truth classification.
5 . The method of claim 1 , wherein the at least one prompt elicits the LLM to generate a paraphrased version of the attacked input.
6 . The method of claim 1 , wherein the LLM comprises a generative transformer-based model selected from the group consisting of GPT-3, GPT-3.5, GPT-4, GPT-5, or a fine-tuned variant thereof.
7 . The method of claim 1 , wherein the at least one prompt is configured to ensure the purified text sample retains semantic similarity to the attacked input.
8 . The method of claim 1 , wherein the at least one prompt is configured to elicit the LLM to generate the purified text sample to be benign such that it avoids misclassification but maintains semantically similarity relative to the original input text.
9 . The method of claim 1 , wherein the at least one prompt harnesses the generative capabilities of the LLM to purify adversarial text without the need to explicitly characterize the discrete noise perturbations.
10 . The method of claim 1 , wherein transforming the attacked input into the purified text sample improves classification accuracy of a classifier under adversarial attack relative to classification of adversarially perturbed input texts without purification.
11 . A system for defending a text classifier against adversarial attacks, the system comprising: one or more processors; and a memory storing instructions that, when executed by the one or more processors, cause the one or more processors to: receive an input text comprising an adversarially perturbed version of an original text, generate a purified text sample that is semantically similar to the original text but free of adversarial perturbations by engagement of an LLM via prompt engineering, and classify the purified text sample using the text classifier to produce a classification output corresponding to an intended classification for the original text.
12 . The system of claim 11 , wherein the prompt engineering is configured to elicit a paraphrased version of the adversarially perturbed input text from the LLM.
13 . The system of claim 11 , wherein generating the purified text sample improves classification accuracy of the text classifier under adversarial attack by at least 25 percent relative to classification of adversarially perturbed input texts without purification.
14 . The system of claim 11 , wherein the prompt engineering harnesses the generative capabilities of the LLM to purify adversarial text without the need to explicitly characterize the discrete noise perturbations.
15 . The system of claim 11 , wherein the LLM comprises a generative transformer-based model selected from the group consisting of GPT-3, GPT-3.5, GPT-4, GPT-5, or a fine-tuned variant thereof.
16 . The system of claim 11 , wherein the purified text sample is generated without explicitly characterizing adversarial perturbations associated with an attacked input.
17 . The system of claim 11 , wherein the prompt engineering includes a plurality of prompts that guide the LLM to ensure the purified text sample retains semantic similarity to its adversarial counterparts.
18 . The system of claim 11 , wherein the one or more processors implement automated text purification using two Linux systems implemented in Pytorch.
19 . The system of claim 11 , wherein the prompt engineering includes reference to parameters including an altered sentence associated with adversarially perturbed input text, a misclassified label, a correct label, and a list of classification categories referring to possible labels for the input text, and the LLM is guided to generate the purified text sample via reference to the parameters.
20 . The system of claim 11 , wherein the text classifier comprises a masked language model selected from the group consisting of BERT and ROBERTa.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This is a non-provisional patent application that claims benefit to U.S. Provisional Patent Application Ser. No. 63/699,051 filed on Sep. 25, 2024, which is herein incorporated by reference in its entirety. GOVERNMENT SUPPORT This invention was made with government support under W911NF-20-2-0124 awarded by the Army Research Laboratory and under W911NF-21-1-0030 awarded by the Army Research Office. The government has certain rights in the invention. FIELD The present disclosure generally relates to artificial intelligence including large language models (LLMs); and in particular to examples for adversarial text purification. BACKGROUND Despite the tremendous success of text classification models, studies have exposed their susceptibility to adversarial examples, i.e., carefully crafted sentences with human-unrecognizable changes to the inputs that are misclassified by the classifiers. The dependability and integrity of NLP applications are seriously threatened by the vulnerability of text classification models to these attacks. Thus, developing stronger defenses against adversarial attacks is crucial in improving the classification model's robustness. It is with these observations in mind, among others, that various aspects of the present disclosure were conceived and developed. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is an illustration of a proposed LLM-guided adversarial text purification framework described herein. FIG. 2 is simplified diagram of a system for implementing the proposed LLM-guided adversarial text purification framework described herein. FIG. 3A is an example prompt provide to an LLM to elicit a purified version of text. FIG. 3B is a variant of the prompt illustrated in FIG. 3A. FIG. 4 is a process flow of an example method associated with the concepts described herein. Corresponding reference characters indicate corresponding elements among the view of the drawings. The headings used in the figures do not limit the scope of the claims. DETAILED DESCRIPTION The present disclosure generally relates to inventive concepts including frameworks, systems, and methods for using adversarial purification methods to defend text classifiers (without knowledge of the type of attacks or training of the classifier). Adversarial purification is a desirable defense because it does not require prior knowledge of the type of attack. Proposed methods herein use the capabilities of Large Language Models (LLMs) to purify text without having to explicitly classify noise perturbations. Purification methods can be used to edit the text and remove adversarial perturbations from the text. The model can then correctly classify the text. The inventive concept described herein has been shown to be remarkably more effective in defending against adversarial attacks. Exemplary features include the following. Classifiers based on two pre-trained masked language models: BERT and ROBERTa; instruction-tuned LLM follows human-written instructionsPrompt PO to elicit the purified version of the text from the LLMVariant of prompt P1 removes instruction regarding generating text that would correct misclassified labelVariant of prompt P2 prompts the LLM to generate a paraphrased version of the input text In the following disclosure, the effectiveness of adversarial purification methods in defending text classifiers is investigated. A novel adversarial text purification concept is proposed that harnesses the generative capabilities of Large Language Models (LLMs) to purify adversarial text without the need to explicitly characterize the discrete noise perturbations. Prompt engineering is implemented to exploit LLMs for recovering the purified samples for given adversarial examples such that they are semantically similar and correctly classified. Proposed methods demonstrate remarkable performance over various classifiers, improving accuracy under the attack by over 65% on average. 1 INTRODUCTION Adversarial purification is a type of defense mechanism against adversarial attacks. It characterizes and removes the adversarial perturbations from the attacked inputs to generate purified samples that are similar to the attacked ones and are classified correctly by the classifier. These methods have demonstrated efficacy in the field of image classification without making assumptions on the form of an attack and a classification model, thus being able to defend pre-existing classifiers against unseen threats. The potential of adversarial purification, however, has not been explored for text classification, due to the challenges of characterizing the adversarial perturbations for discrete data. In particular, contrary to images, where perturbations can be generated based on continuous gradients, for text data, adversarial perturbations are generated by manipulating combinations of words in the input text. Therefore, identifying these perturbations is also a combinatorial problem. An ideal solution to