US-20260127216-A1 - User Configurable, Intention Driven Dynamic Guardrail System

US20260127216A1US 20260127216 A1US20260127216 A1US 20260127216A1US-20260127216-A1

Abstract

The present disclosure relates to computer security, specifically systems and methods for intent-based observability and control of Artificial Intelligence (AI) model interactions. The described technology addresses the technical problem of insufficient control and observability over AI interactions, which can lead to unauthorized use and security risks. The solution involves a system that classifies user prompts using AI models, such as Large Language Models (LLMs), to determine user intent and applies granular control policies based on this intent. This enables enterprises to manage AI tool usage effectively, ensuring security and compliance. The system captures AI input data, classifies the data to ascertain user intent, and enforces control policies that include filters, rules, and actions. Principal uses include monitoring AI interactions, applying security policies, and preventing misuse of AI tools. The described technology is applicable across various platforms and AI models, enhancing enterprise security and operational efficiency.

Inventors

Amr A. Ali
Gil Spencer
Ahmed Ewais
Ibrahim Abdelrahman

Assignees

WitnessAI, Inc.

Dates

Publication Date: 20260507
Application Date: 20251219

Claims (20)

1 . A computer-implemented method for controlling Artificial Intelligence (AI) model interactions, the method comprising: storing, in a database of a computing system, a plurality of intent classification rules as data entries separate from AI model parameters, each intent classification rule comprising: (a) a label identifying an intent category; (b) a definition describing the intent category in natural language; and (c) an associated control policy action; training an AI classification model to perform generalized semantic comparison between arbitrary input text and arbitrary definition text without training the AI classification model on specific intent categories; intercepting input data directed from a user to a target AI model; classifying the input data using the AI classification model by: (i) retrieving the plurality of intent classification rules from the database; (ii) performing semantic comparison between the input data and the definitions of the retrieved intent classification rules using the AI classification model; and (iii) identifying a matching intent classification rule based on semantic similarity between the input data and the definitions; applying the control policy action associated with the matching intent classification rule to control transmission of the input data to the target AI model; and enabling real-time modification of the plurality of intent classification rules by adding, modifying, or deleting data entries in the database without retraining the AI classification model.
2 . The method of claim 1 , wherein the control policy action comprises one or more of: blocking transmission of the input data to the target AI model; allowing transmission of the input data to the target AI model; generating a warning based on the matching intent classification rule; routing the input data to a different target AI model; sending the input data to a security information and event management (SIEM) system; and calling a third-party application programming interface (API).
3 . The method of claim 1 , further comprising: providing a graphical user interface to an administrator; receiving, via the graphical user interface, administrator input specifying a new label, a new definition, and a new control policy action for a new intent classification rule; storing the new intent classification rule in the database; and immediately classifying subsequent input data using the new intent classification rule without retraining the AI classification model and without system downtime.
4 . The method of claim 1 , further comprising: receiving a response generated by the target AI model in reply to the input data; classifying the response using the AI classification model by comparing the response against the plurality of intent classification rules to identify a second matching intent classification rule; and filtering content from the response based on a control policy action associated with the second matching intent classification rule before the response reaches the user.
5 . The method of claim 1 , wherein the AI classification model identifies the matching intent classification rule despite the input data containing one or more of: typographical errors, synonyms, paraphrasing, and implicit language.
6 . The method of claim 1 , wherein applying the control policy action comprises using one or more protection filters selected from: data protection, model protection, and behavioral protection.
7 . The method of claim 1 , the control policy action comprising routing the input data to a selected AI model from a plurality of available AI models based on the matching intent classification rule, data sensitivity of the input data, and an enterprise security policy, the routing comprising directing input data classified as high-risk to a secure internal AI model and directing input data classified as low-risk to a public AI model.
8 . The method of claim 1 , further comprising detecting behavior of the user by: determining intent classification of a plurality of input data entered by the user; aggregating the intent classifications of the plurality of input data; comparing the aggregated intent classifications to a risk threshold for an enterprise; and generating an enterprise action based on the risk threshold.
9 . The method of claim 1 , further comprising logging one or more of: the matching intent classification rule, the determined intent category, and the control policy action.
10 . The method of claim 1 , wherein classifying the input data comprises: generating, using a transformer-based encoder of the AI classification model, a semantic representation of the input data; generating semantic representations of the definitions by processing each definition through the transformer-based encoder; computing similarity scores between the semantic representation of the input data and the semantic representations of the definitions using an attention mechanism of the transformer-based encoder; and selecting the matching intent classification rule as the intent classification rule having a highest similarity score that exceeds a predetermined threshold.
11 . A computer system for controlling Artificial Intelligence (AI) model interactions, the system comprising: one or more processors; a database configured to store a plurality of intent classification rules as structured data entries separate from AI model parameters, each intent classification rule comprising: a label field storing a label identifying an intent category; a definition field storing a natural language definition describing the intent category; and a policy action field storing a control policy action associated with the intent category; an AI classification model comprising: (a) a trained transformer-based encoder with fixed model parameters encoding semantic comparison capabilities rather than specific intent category recognition; and (b) an output component configured to generate similarity scores between input data and intent category definitions; a prompt capture component configured to intercept input data directed from a user to a target AI model; a runtime comparison processor configured to: (i) retrieve the plurality of intent classification rules from the database; (ii) provide the input data and the definitions from the retrieved intent classification rules to the AI classification model and receive similarity scores from the AI classification model; and (iii) identify a matching intent classification rule based on the similarity scores; and a policy enforcement component configured to apply the control policy action associated with the matching intent classification rule to control transmission of the input data to the target AI model; the database being further configured to receive updates to the plurality of intent classification rules without requiring modification to the fixed model parameters of the AI classification model.
12 . The system of claim 11 , wherein the control policy action comprises one or more of: blocking transmission of the input data to the target AI model; allowing transmission of the input data to the target AI model; generating a warning based on the matching intent classification rule; routing the input data to a different target AI model; sending the input data to a security information and event management (SIEM) system; and calling a third-party application programming interface (API).
13 . The system of claim 11 , wherein the one or more processors are further configured to enable an administrator to add, modify, or remove intent classification rules in real-time, and wherein the AI classification model uses the added, modified, or removed intent classification rules without retraining the AI classification model.
14 . The system of claim 11 , wherein the one or more processors are further configured to: receive a response generated by the target AI model in reply to the input data; classify the response using the AI classification model by comparing the response against the plurality of intent classification rules based on the definitions to identify a matching intent classification rule; and apply a control policy action associated with the matching intent classification rule to the response before the response reaches the user.
15 . The system of claim 11 , wherein the AI classification model is configured to identify the matching intent classification rule despite the input data containing one or more of: typographical errors, synonyms, paraphrasing, and implicit language.
16 . The system of claim 11 , wherein applying the control policy action comprises using one or more protection filters selected from: data protection, model protection, and behavioral protection.
17 . The system of claim 11 , wherein the policy enforcement component is configured to route the input data to a selected AI model from a plurality of available AI models based on the matching intent classification rule, data sensitivity of the input data, and an enterprise security policy, the routing comprising directing input data classified as high-risk to a secure internal AI model and directing input data classified as low-risk to a public AI model.
18 . The system of claim 11 , wherein the one or more processors are further configured to detect behavior of the user by: determining intent classification of a plurality of input data entered by the user; aggregating the intent classifications of the plurality of input data; comparing the aggregated intent classifications to a risk threshold for an enterprise; and generating an enterprise action based on the risk threshold.
19 . The system of claim 11 , wherein the one or more processors are further configured to log one or more of: the matching intent classification rule, the determined intent category, and the control policy action.
20 . A non-transitory computer-readable medium storing instructions that, when executed by one or more processors of a computing system, cause the computing system to: store, in a database of a computing system, a plurality of intent classification rules as data entries separate from AI model parameters, each intent classification rule comprising: a label identifying an intent category; a definition describing the intent category in natural language; and an associated control policy action; train an AI classification model to perform generalized semantic comparison between arbitrary input text and arbitrary definition text without training the AI classification model on specific intent categories; intercept input data directed from a user to a target AI model; classify the input data using the AI classification model by: retrieving the plurality of intent classification rules from the database; performing semantic comparison between the input data and the definitions of the retrieved intent classification rules using the AI classification model; and identifying a matching intent classification rule based on semantic similarity between the input data and the definitions; apply the control policy action associated with the matching intent classification rule to control transmission of the input data to the target AI model; and enable real-time modification of the plurality of intent classification rules by adding, modifying, or deleting data entries in the database without retraining the AI classification model.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This patent application is a continuation-in-part of U.S. Non-Provisional patent application Ser. No. 19/056,582, filed on Feb. 18, 2025, and titled “Systems and Methods for Intent Based Observability and Control of Artificial Intelligence (AI) Model Interactions.” U.S. Non-Provisional patent application Ser. No. 19/056,582 claims the benefit of priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application Ser. No. 63/717,878, filed on Nov. 7, 2024. Each of the above-identified applications is hereby incorporated by reference herein in its entirety including all references cited therein. FIELD OF THE TECHNOLOGY The present technology relates to computer security, particularly to systems and methods for intent-based observability and control of Artificial Intelligence (AI) model interactions. BACKGROUND Existing methods for controlling and observing interactions with Artificial Intelligence (AI) models have primarily focused on general monitoring and management of AI systems without specific consideration for user intent. Traditional approaches involve monitoring system performance metrics, such as accuracy and efficiency, to ensure the AI model is functioning as expected. Additionally, some methods utilize predefined rules and thresholds to trigger alerts or actions based on certain conditions or events within the AI system. However, these approaches lack the ability to provide granular control over individual user interactions with the AI model based on the specific intent behind each user input. In the context of AI systems, the use of Large Language Models (LLMs) has gained popularity for natural language processing tasks. LLMs leverage vast amounts of text data to understand user inputs. Current approaches do not provide a systematic framework for applying fine-grained control policies that include filters, rules, and actions tailored to specific user intents in real-time interactions with AI models. Moreover, the need for enhanced observability and control over AI model interactions has become increasingly critical as AI technologies are integrated into various applications and services. The ability to interpret user intents accurately and apply precise control policies based on those intents is essential for ensuring the responsible and effective deployment of AI systems. Existing solutions have not fully addressed the challenges associated with intent-based observability and control of AI model interactions. Therefore, there is a demand for a comprehensive solution that combines intent classification using Artificial Intelligence (AI) models (e.g., LLMs) with granular control policies to enable effective management of user interactions with AI models. None of the previous approaches have provided a comprehensive solution that combines the features described in this disclosure. Consequently, there is a need for a system that can provide intent-based observability and control, enabling enterprises to manage the use of AI model tools effectively while maintaining security and compliance. SUMMARY The accompanying drawings, where like reference numerals refer to identical or functionally similar elements throughout the separate views, together with the detailed description below, are incorporated in and form part of the specification, and serve to further illustrate embodiments of concepts that include the claimed disclosure, and explain various principles and advantages of those embodiments. Some embodiments include a computer-implemented method for intent based observability and control of Artificial Intelligence (AI) model interactions, the method including: receiving Artificial Intelligence (AI) input data entered by a user, the Artificial Intelligence (AI) input data including a prompt; classifying the prompt using an Artificial Intelligence (AI) model to determine an intent of the prompt entered by the user; and applying a granular control Artificial Intelligence (AI) policy to the intent of the prompt entered by the user. In some embodiments the Artificial Intelligence (AI) model includes a Large Language Model (LLM). In some embodiments the classifying the prompt using the Artificial Intelligence (AI) model to determine the intent of the prompt entered by the user includes fine grained intention classification that provides a precise intent classification of the prompt entered by the user, the Artificial Intelligence (AI) model being a Machine Learning (ML) model. In some embodiments the classifying the prompt using the Artificial Intelligence (AI) model to determine the intent of the prompt entered by the user includes coarse intention classification that provides a coarse intent classification of the prompt entered by the user by the intent of the prompt being chosen from a predetermined list of intents using the Artificial Intelligence (AI) model, the Artificial Intelligence (AI) model being a Machine Learning (ML) model. In some embodiments th