EP-4738183-A1 - LANGUAGE MODEL SAFETY CONTROL METHOD

EP4738183A1EP 4738183 A1EP4738183 A1EP 4738183A1EP-4738183-A1

Abstract

Computer implemented method for preventing unsafe responses of a first language model, comprising: receiving, by a protection model, an input prompt, the input prompt comprising a prompt directed to the first language model, classifying, by the protection model, the input prompt into an evaluation class based on the input prompt and training data, wherein evaluation classes comprise at least a violate class and a permit class and the training data comprising reference prompts of the violate class, wherein inputting the reference prompts of the violate class into the first language model would result in output of responses that violate the use policy by the first language model, preventing input of the input prompt into the first language model when the evaluation class is the violate class to prevent outputting of unsafe responses by the first language model, characterized in that the training data comprise at least one reference prompt that when input into the first language model generates a response that violates a use policy of the first language model, wherein the use policy comprises rules that define how the first language model is not to be used.

Inventors

CIUREA, Marius
ARUMUGAM, Chandran
Mey, Oliver
KILMURRAY, Richard

Assignees

Vodafone Group Services Limited

Dates

Publication Date: 20260506
Application Date: 20241104

Claims (20)

Computer implemented method for preventing unsafe responses of a first language model, comprising: receiving, by a protection model, an input prompt, the input prompt comprising a prompt directed to the first language model, classifying, by the protection model, the input prompt into an evaluation class based on the input prompt and training data, wherein evaluation classes comprise at least a violate class and a permit class and the training data comprising reference prompts of the violate class, wherein inputting the reference prompts of the violate class into the first language model would result in output of responses that violate the use policy by the first language model, preventing input of the input prompt into the first language model when the evaluation class is the violate class to prevent outputting of unsafe responses by the first language model, characterized in that the training data comprise at least one reference prompt that when input into the first language model generates a response that violates a use policy of the first language model, wherein the use policy comprises rules that define how the first language model is not to be used.
The method of claim 1, wherein there are a plurality of violate classes and the training data comprise at least one reference prompt for each of the violate classes.
The method of claim 2 wherein the classifying the input prompt comprises: embedding the input prompt and the reference prompts to generate an embedded input prompt and embedded reference prompts, determining a nearest neighbor of the embedded input prompt among the embedded reference prompts, and when the nearest neighbor is an embedded prompt of the violate class and a distance between the nearest neighbor and the embedded input prompt is smaller than a threshold distance, the evaluation class is the violate class.
The method of claim 3 wherein the determining a nearest neighbor comprises grouping the embedded reference prompts into one or more violate classes, determining an average embedding value for the embedded reference prompts of each violate class and determining the nearest neighbor based on the determined average embedding values.
The method of any one of claims 1 to 4 wherein the training data further comprises reference prompts that when input into the first language model generate a response that is in line with the use policy of the first language model and are classified as reference prompts of the permit class and when the nearest neighbor of the embedded input prompt is an embedded reference prompt of the permit class or when a nearest neighbor is a reference prompt of the violate class and the distance to the nearest neighbor is larger than the threshold distance, the evaluation class is the permit class.
The method of any one of claims 2 to 5 wherein the determining a nearest neighbor comprises at least one of: - performing a principal component analysis, - performing an approximate nearest neighbor search, - performing a cluster analysis, - performing a singular value decomposition, and - performing a hierarchical navigable small world analysis, and the determining a distance between the nearest neighbor and the input prompt comprises at least one of: - applying a cosine distance metric, - applying a Euclidian distance metric, - applying an L2 distance.
The method of any one of claims 1 to 6 wherein the embedding the input prompt comprises at least one of embedding the input prompt using at least any one of: - a TF-IDF vectorization, - a word embedding, - a sentence embedding, - a first language model-based sentence embedding - audio embedding, - image embedding, - video embedding, or - a multimodal embedding.
A protection model for use in the method for prefenting unsafe responses by a first language model according to any one of claims 1 to 7, wherein the protection model is configured to receive input data and generate output data, wherein the input data is the input prompt directed to the first language model and the output data corresponds to the evaluation class determined by the protection model.
The protection model of claim 8, wherein the protection model either comprises an embedding module to compute the embedding, and a mapping executed by the protection model is an end-to-end mapping and the protection model is configured to take prompts as input data and output the evaluation class as the output data; or the protection model is a combination of the embedding module and a nearest neighbor module, so that the protection model performs the classifying in stepped manner, the embedding module is configure to receive the input prompts and reference prompts as input data, embed the received prompts and output embedded prompts as output data, and the nearest neighbor module configured to compute the evaluation class from the embedded input prompts and the embedded reference prompts by determining the neareast neighbor.
The protection model of claim 8 or 9 wherein the embedding module comprises at least one of a TF-IDF vectorization, a word embedding, a sentence embedding, a language model-based sentence embedding, audio embedding, image embedding, video embedding, a multimodal embedding.
A computer implemented method for training the protection model of any one of claims 8 to 10 comprising: receiving training data as input data, wherein the training data is based on the reference prompts and comprises, for each reference prompt, a corresponding annotation that indicate the evaluation class of the respective reference prompt, optimizing the protection model to output a result in accordance with the annotation.
The method for training the protection model of claim 11, wherein when the protection model is configured to execute the end-to-end mapping, the training data comprises the reference prompts and when the protection model and the embedding module form the protection model in a stepped manner, the training data may comprise embedded reference prompts output by the embedding module as input data for the nearest neighbor module or the training data comprises the reference prompts and the reference prompts need to be processed by the embedding module before training the nearest neighbor module.
A computer implemented method for generating training data for a protection model, particular for use in the method of claim 11 or 12, comprising: selecting a number of reference prompts as training data for the protection model for a plurality of the evaluation classes, wherein a response output by the first language model in response to inputting the reference prompts gives a result in accordance with the evaluation class, so that responses output by the first language model based on the reference prompts annotated as belonging to a violate class violate a use policy of the first language model and responses output by the first language model based on reference prompts annotated as belonging to the permit class are in-line with the use policy of the first language model and the selecting a number of reference prompts is performed such that the training data comprises at lease one reference prompt for each of the violate classes.
The method of claim 13, wherein the selecting a number of reference prompts comprises: running the first language model in a test mode, wherein input prompts directly input into the first language model, receiving, by the first language model, input prompts, classifying, using a classifier, the responses as at least one of the violate class and the permit class, selecting the input prompts corresponding to responses classified as the violate class as the reference prompts of the violate class and selecting the input prompts corresponding to responses classified as the permit class as reference prompts of the permit class, and collecting the selected input prompts as basis for the training data.
The method of claim 13, wherein the selecting a number of reference prompts comprises: generating, using an attack model, attack prompts for the first language model; processing, by the first language model, the attack prompts; generating, by a judging model, a judgement result based on the response and the use policy that can be used to determined the evaluation classes; and when the evaluation class is not one of the violate classes: iteratively refining the attack prompt based on the judgement result.
The method of claim 15, wherein the classifying by the judging model comprises: receiving, by the judging model, for each of the attack prompts and at least one rule of rules of the use policy a judging prompt as input data, wherein the judging prompt comprises a statement requesting the judging model to evaluate whether the at least one rule of the rules of the use policy is violated by either the respective attack prompt or the corresponding response, each of the rules of the use policy corresponding to one of the violate classes, wherein the judging result gives a score according to which it can be determined whether or not the judging prompt violates the respective rule.
The method of claim 16, wherein the judging model generates the judgement result for the attack prompt and the response, and when the evaluation class of the response is the violate class and the evaluation class of the corresponding attack prompt is also the violate class the iteratively refining the attack prompt based on the judgement result comprises: generating, using the attack model, manipulated attack prompts based on the attack prompt using a token manipulation, such that the manipulated attack prompt is semantically equivalent to the attack prompt, until an evaluation class of the manipulated attack prompt is the permit class while the evaluation class of the response generated based on the manipulated attack prompt maintains to be the violate class.
Training data for use in the method of claim 11 or 12, and when the protection model is configured to execute the end-to-end mapping, the training data comprises the reference prompts as input data and corresponding annotations, and when the protection model and the embedding model form the protection model in a stepped manner, the training data comprises embedded reference prompts output by the embedding model as input data and the corresponding annotations.
A computing apparatus including a processor and a memory storing instructions that, when executed by the processor, configure the apparatus to perform the method of any one of claims 1-7 or 11-17.
A non-transitory computer-readable storage medium including instructions that, when processed by a computer, configure the computer to perform the method of any one of claims 1-7 or 11-17.

Description

TECHNICAL FIELD This disclosure relates to methods for preventing unsafe responses by classifying input prompts. BACKGROUND Recent applications of large language models, (L)LMs, have shown potential risks, including the generation of misleading information or harmful content as unfiltered output data. Instances of mischief or misuse involve LLMs being used to create fake news, impersonate individuals, or generate offensive material. Multimodal LLMs, which can use text, images, video, audio or any other data or combination of those as input data as well as generate it as unfiltered output data, could even be used to generate fake pictures, videos or sounds, which could also be misused. To mitigate these issues, providers implement safeguards such as filtering mechanisms that detect and block inappropriate prompts, monitoring systems for misuse detection, and content moderation policies. Additionally, some LLM platforms incorporate user feedback loops to refine their models' outputs continually. Providers also work closely with policymakers and researchers to develop industry-wide standards and best practices that ensure responsible use of these powerful tools while balancing the benefits they offer for innovation and progress in various domains. Traian Rebedea, et al: "NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails" describes a toolkit in which responses are filtered using Guardrails throughout this application, as a post processing layer. After the LLM generates a response, the Guardrails evaluate the output data against predefined rules and guidelines to determine if it adheres to acceptable conversation boundaries. If the output violates any rules, it can be modified, blocked, or redirected. This ensures that even if the LLM generates inappropriate or harmful content, it is intercepted and adjusted before reaching the user. SUMMARY OF INVENTION Even though various ideas to prevent misuse of first language models are already used to improve safety of use of first language models, there is still an issue with the usage of these tools in that prompts in which a guardrail cannot identify any misuse might still end up being processed by the first language model and in turn generate a response that constitutes misuse. In turn, a guardrail that also analyses the response would need to be in place and catch those inappropriate outputs. However, this extra layer, that post processes the responses introduces a delay, because the response can only be output to a user after the guardrail checked compliance with the use policy, therefore such a post processing guardrail deteriorates user experience, as the responses cannot be output to the user in a streamed mode. Furthermore, as first language model processing is quite resource and bandwidth intense, the post processing of the responses will result in preventing output of the responses that are classified as unsafe and therefore, the resources used for generating the responses classified as unsafe are wasted. The object of the present invention is to provide a method that increases a user experience of a first language model while still maintaining high safety, fast response times and still reduces resource usage. In one aspect, a computer implemented method for preventing unsafe responses of a first language model, comprising: receiving, by a protection model, an input prompt, the input prompt comprising a prompt directed to the first language model,classifying, by the protection model, the input prompt into evaluation classes based on the input prompt and training data, wherein the evaluation classes comprise at least a violate class and a permit class and the training data comprising reference prompts of the violate class, inputting the reference prompts the violate class into the first language model resulting in output of unsafe responses by the first language model,preventing input of the input prompt into the first language model when the evaluation class is the violate class to prevent outputting of unsafe responses by the first language model, characterized in that the training data comprises at least one reference prompt that when input into the first language model generates a response that violates a use policy of the first language model and thus constitutes an unsafe response, wherein particularly the use policy comprises rules that define how the first language model is not to be used. "Unsafe response" refers to a response that violates at least one use policy. "First language model" refers to a trained model that was trained for processing input prompts. The first language model may be a large first language model, LLM, a small first language model or a multimodal LLM. "Language model" refers to a trained model, specifically a trained machine learning model, that was trained for processing input prompts, the processing of the language model often comprising natural language processing. The language model