US-20260127278-A1 - LANGUAGE MODEL SAFETY CONTROL METHOD

US20260127278A1US 20260127278 A1US20260127278 A1US 20260127278A1US-20260127278-A1

Abstract

A method for preventing unsafe responses of a first language model includes receiving, by a protection model, an input prompt including a prompt directed to the first language model, classifying, by the protection model, the input prompt into an evaluation class based on the input prompt and training data. The evaluation classes include at least a violate class and a permit class and the training data includes reference prompts of the violate class, preventing input of the input prompt into the first language model when the evaluation class is the violate class to prevent outputting of unsafe responses by the first language model. The training data includes at least one reference prompt that when input into the first language model generates a response that violates a use policy of the first language model. The use policy includes rules that define how the first language model is not to be used.

Inventors

Marius CIUREA
Chandran ARUMUGAM
Oliver Mey
Richard KILMURRAY

Assignees

VODAFONE GROUP SERVICES LIMITED

Dates

Publication Date: 20260507
Application Date: 20251021
Priority Date: 20241104

Claims (20)

1 . A method for preventing unsafe responses of a first language model, comprising: receiving, by a protection model, an input prompt, wherein the input prompt comprises a prompt directed to the first language model; classifying, by the protection model, the input prompt into an evaluation class based on the input prompt and training data, wherein evaluation classes comprise at least a violate class and a permit class and the training data comprises reference prompts of the violate class, and wherein inputting the reference prompts of the violate class into the first language model results in output of responses that violate a use policy by the first language model; and preventing input of the input prompt into the first language model when the evaluation class is the violate class to prevent outputting of unsafe responses by the first language model, wherein the training data comprises at least one reference prompt of the reference prompts that when input into the first language model generates a response that violates the use policy of the first language model, and wherein the use policy comprises rules that define how the first language model is not to be used.
2 . The method of claim 1 , wherein the violate class is one of a plurality of violate classes, and wherein the training data comprises at least one reference prompt for each of the violate classes.
3 . The method of claim 2 , wherein the classifying the input prompt comprises: embedding the input prompt and the reference prompts to generate an embedded input prompt and embedded reference prompts; determining a nearest neighbor of the embedded input prompt among the embedded reference prompts; and when the nearest neighbor is an embedded prompt of the violate class and a distance between the nearest neighbor and the embedded input prompt is less than a threshold distance, the evaluation class is the violate class.
4 . The method of claim 3 , wherein the determining the nearest neighbor comprises: grouping the embedded reference prompts into one or more violate classes; determining average embedding values for respective ones of the embedded reference prompts of each violate class; and determining the nearest neighbor based on the average embedding values.
5 . The method of claim 3 , wherein the training data further comprises reference prompts that, when input into the first language model, the first language model is configured to generate a response that is in line with the use policy of the first language model and are classified as reference prompts of the permit class, and when the nearest neighbor of the embedded input prompt is an embedded reference prompt of the permit class or when a nearest neighbor is a reference prompt of the violate class and the distance to the nearest neighbor is larger than the threshold distance, the evaluation class is the permit class.
6 . The method of claim 3 , wherein the determining a nearest neighbor comprises at least one of: performing a principal component analysis, performing an approximate nearest neighbor search, performing a cluster analysis, performing a singular value decomposition, or performing a hierarchical navigable small world analysis, and wherein the determining a distance between the nearest neighbor and the input prompt comprises at least one of: applying a cosine distance metric, applying a Euclidian distance metric, or applying an L2 distance.
7 . The method of claim 3 , wherein the embedding the input prompt comprises embedding the input prompt using at least one of: a TF-IDF vectorization, a word embedding, a sentence embedding, a first language model-based sentence embedding, audio embedding, image embedding, video embedding, or a multimodal embedding.
8 . A system comprising a protection model for preventing unsafe responses by a first language model according to the method of claim 3 , wherein the protection model is configured to receive input data and generate output data, and wherein the input data is the input prompt directed to the first language model and the output data corresponds to the evaluation class determined by the protection model.
9 . The system of claim 8 , wherein the protection model comprises an embedding module to compute the embedding, and a mapping executed by the protection model is an end-to-end mapping and the protection model is configured to take prompts as input data and output the evaluation class as the output data, or wherein the protection model is a combination of the embedding module and a nearest neighbor module, so that the protection model performs the classifying in stepped manner, the embedding module is configured to receive the input prompts and reference prompts as input data, embed the received prompts and output embedded prompts as output data, and the nearest neighbor module is configured to compute the evaluation class from the embedded input prompts and the embedded reference prompts by determining the nearest neighbor.
10 . The system of claim 9 wherein the embedding module comprises at least one of a TF-IDF vectorization, a word embedding, a sentence embedding, a language model-based sentence embedding, audio embedding, image embedding, video embedding, or a multimodal embedding.
11 . A method for training the protection model of claim 9 , further comprising: receiving training data as input data, wherein the training data is based on the reference prompts and comprises, for each reference prompt, a corresponding annotation that indicates the evaluation class of the respective reference prompt; and optimizing the protection model to output a result in accordance with the annotation.
12 . The method of claim 11 , wherein when the protection model is configured to execute the end-to-end mapping, the training data comprises the reference prompts and when the protection model and the embedding module form the protection model in a stepped manner, wherein the training data comprises embedded reference prompts output by the embedding module as input data for the nearest neighbor module or the training data comprises the reference prompts and the reference prompts need to be processed by the embedding module before training the nearest neighbor module.
13 . The method of claim 11 , further comprising: selecting a number of reference prompts as training data for the protection model for a plurality of the evaluation classes, wherein a response output by the first language model in response to inputting the reference prompts gives a result in accordance with the evaluation class, so that responses output by the first language model based on the reference prompts annotated as belonging to a violate class violate a use policy of the first language model and responses output by the first language model based on reference prompts annotated as belonging to the permit class are in-line with the use policy of the first language model and the selecting the number of reference prompts is performed such that the training data comprises at least one reference prompt for each of the violate classes.
14 . The method of claim 13 , wherein the selecting a number of reference prompts comprises: running the first language model in a test mode, wherein input prompts are directly input into the first language model, receiving, by the first language model, the input prompts, classifying, using a classifier, the responses as at least one of the violate class and the permit class, selecting the input prompts corresponding to responses classified as the violate class as the reference prompts of the violate class and selecting the input prompts corresponding to responses classified as the permit class as reference prompts of the permit class, and collecting the selected input prompts as basis for the training data.
15 . The method of claim 13 , wherein the selecting a number of reference prompts comprises: generating, using an attack model, attack prompts for the first language model; processing, by the first language model, the attack prompts; generating, by a judging model, a judgment result based on the response and the use policy that can be used to determine the evaluation classes; and when the evaluation class is not one of the violate classes, iteratively refining the attack prompt based on the judgment result.
16 . The method of claim 15 , wherein the classifying by the judging model comprises: receiving, by the judging model, for each of the attack prompts and at least one rule of rules of the use policy a judging prompt as input data, wherein the judging prompt comprises a statement requesting the judging model to evaluate whether the at least one rule of the rules of the use policy is violated by either the respective attack prompt or the corresponding response, each of the rules of the use policy corresponding to one of the violate classes, wherein the judgment result gives a score according to which it can be determined whether or not the judging prompt violates the respective rule.
17 . The method of claim 16 , wherein the judging model is configured to generate the judgment result for the attack prompt and the response, and when the evaluation class of the response is the violate class and the evaluation class of the corresponding attack prompt is also the violate class, the iteratively refining the attack prompt based on the judgment result comprises: generating, using the attack model, manipulated attack prompts based on the attack prompt using a token manipulation, such that the manipulated attack prompt is semantically equivalent to the attack prompt, until an evaluation class of the manipulated attack prompt is the permit class while the evaluation class of the response generated based on the manipulated attack prompt continues to be the violate class.
18 . A non-transitory computer readable storage medium comprising training data for use in the method of claim 11 , and when the protection model is configured to execute the end-to-end mapping, the training data comprises the reference prompts as input data and corresponding annotations, and when the protection model and the embedding module form the protection model in a stepped manner, the training data comprises embedded reference prompts output by the embedding module as input data and the corresponding annotations.
19 . A computing apparatus comprising a processor and a memory storing instructions that, when executed by the processor, configure the apparatus to perform operations comprising: receiving, by the computing apparatus, a query from a user device that is remote from the computing apparatus, receiving, by a protection model of the computing apparatus, an input prompt, wherein the input prompt is included in the query from the user device and comprises a prompt directed to a first language model, classifying, by the protection model, the input prompt into an evaluation class based on the input prompt and training data, wherein evaluation classes comprise at least a violate class and a permit class and the training data comprises reference prompts of the violate class, and wherein inputting the reference prompts of the violate class into the first language model results in output of responses that violate a use policy by the first language model, preventing input of the input prompt into the first language model when the evaluation class is the violate class to prevent outputting of unsafe responses by the first language model, wherein the training data comprises at least one reference prompt of the reference prompts that when input into the first language model generates a response that violates the use policy of the first language model, and wherein the use policy comprises rules that define how the first language model is not to be used.
20 . A non-transitory computer-readable storage medium including instructions that, when processed by a computer, configure the computer to perform operations comprising: receiving, by the computer, a query from a user device that is remote from the computer, receiving, by a protection model of the computer, an input prompt, wherein the input prompt is included in the query from the user device and comprises a prompt directed to a first language model, classifying, by the protection model, the input prompt into an evaluation class based on the input prompt and training data, wherein evaluation classes comprise at least a violate class and a permit class and the training data comprises reference prompts of the violate class, and wherein inputting the reference prompts of the violate class into the first language model results in output of responses that violate a use policy by the first language model, preventing input of the input prompt into the first language model when the evaluation class is the violate class to prevent outputting of unsafe responses by the first language model, wherein the training data comprises at least one reference prompt of the reference prompts that when input into the first language model generates a response that violates the use policy of the first language model, and wherein the use policy comprises rules that define how the first language model is not to be used.

Description

REFERENCE TO PRIORITY APPLICATION The present application claims the benefit of European Patent Application No. 24465589.0 filed Nov. 4, 2024, the entire disclosure of which is incorporated herein by reference. TECHNICAL FIELD This disclosure relates to methods for preventing unsafe responses by classifying input prompts. BACKGROUND Recent applications of large language models, (L)LMs, have shown potential risks, including the generation of misleading information or harmful content as unfiltered output data. Instances of mischief or misuse involve LLMs being used to create fake news, impersonate individuals, or generate offensive material. Multimodal LLMs, which can use text, images, video, audio or any other data or combination of those as input data as well as generate it as unfiltered output data, could even be used to generate fake pictures, videos or sounds, which could also be misused. To mitigate these issues, providers implement safeguards such as filtering mechanisms that detect and block inappropriate prompts, monitoring systems for misuse detection, and content moderation policies. Additionally, some LLM platforms incorporate user feedback loops to refine their models' outputs continually. Providers also work closely with policymakers and researchers to develop industry-wide standards and best practices that ensure responsible use of these powerful tools while balancing the benefits they offer for innovation and progress in various domains. Traian Rebedea, et al: “NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails” describes a toolkit in which responses are filtered using Guardrails throughout this application, as a post processing layer. After the LLM generates a response, the Guardrails evaluate the output data against predefined rules and guidelines to determine if it adheres to acceptable conversation boundaries. If the output violates any rules, it can be modified, blocked, or redirected. This ensures that even if the LLM generates inappropriate or harmful content, it is intercepted and adjusted before reaching the user. SUMMARY Even though various ideas to prevent misuse of first language models are already used to improve safety of use of first language models, there is still an issue with the usage of these tools in that prompts in which a guardrail cannot identify any misuse might still end up being processed by the first language model and in turn generate a response that constitutes misuse. In turn, a guardrail that also analyses the response would need to be in place and catch those inappropriate outputs. However, this extra layer, that post processes the responses introduces a delay, because the response can only be output to a user after the guardrail checked compliance with the use policy, therefore such a post processing guardrail deteriorates user experience, as the responses cannot be output to the user in a streamed mode. Furthermore, as first language model processing is quite resource and bandwidth intense, the post processing of the responses will result in preventing output of the responses that are classified as unsafe and therefore, the resources used for generating the responses classified as unsafe are wasted. Some embodiments of the present invention provide a method that increases a user experience of a first language model while still maintaining high safety, fast response times and still reduces resource usage. Various embodiments may be directed towards a method for preventing unsafe responses of a first language model. The method includes receiving, by a server, a query from a user device that is remote from the server. The method further includes receiving, by a protection model of the server, an input prompt, where the input prompt is included in the query from the user device and includes a prompt directed to the first language model. The method includes classifying, by the protection model, the input prompt into an evaluation class based on the input prompt and training data, where evaluation classes include at least a violate class and a permit class and the training data includes reference prompts of the violate class, and where inputting the reference prompts of the violate class into the first language model results in output of responses that violate a use policy by the first language model. The method includes preventing input of the input prompt into the first language model when the evaluation class is the violate class to prevent outputting of unsafe responses by the first language model. The training data includes at least one reference prompt of the reference prompts that when input into the first language model generates a response that violates the use policy of the first language model. The use policy includes rules that define how the first language model is not to be used. According to some embodiments, the violate class may be one of a plurality of violate classes. The training data may include at lea