US-12625892-B1 - Controlled probabilistic sentence space expansion

US12625892B1US 12625892 B1US12625892 B1US 12625892B1US-12625892-B1

Abstract

The disclosed embodiments describe a method, system, and computer-readable medium for reducing hallucinations in a large language model in the field of natural language processing involving identifying a plurality of input-output pairs, each input-output pair assigned to a probability and comprising an example input text string and a corresponding output text string, sampling a set of input-output pairs from the plurality of input-output pairs based on the probabilities assigned to the plurality of input-output pairs, generating a list of aggregated input-output pairs from the set of input-output pairs, generating one or more queries by iteratively replacing identified variables within each aggregated input-output pair with a different value from a database, and training, using the one or more queries, a first large language model to convert input queries to machine-readable prompts configured for input into a second large language model.

Inventors

Samuel Atkins
GIACOMO DOMENICONI
Ali Fathi

Assignees

U.S. Bancorp, National Association

Dates

Publication Date: 20260512
Application Date: 20250929

Claims (20)

1 . A system for enhancing the reliability of a natural language processing pipeline, the system comprising: a memory; and one or more processors configured to execute instructions stored in the memory to: receive, via a first language model, a natural language query and convert the natural language query into a structured output comprising a machine-readable prompt in a structured data format, wherein each entity extracted from the natural language query is represented as a key-value pair; validate, via a validation module communicatively coupled to the first language model, the structured output against one or more predetermined rules or a reference database, wherein the validation comprises performing schema and data integrity checks on the structured output, including comparing an entity in the structured output to a database record to verify accuracy; and upon successful validation, forward the validated structured output to a second language model configured to convert the structured output to a query language format for generating a machine-executable database query, and upon failed validation, invoke an error handling operation.
2 . The system of claim 1 , wherein the validation module is further configured to log each validation result, including a timestamp and an identity of the natural language query, in a persistent audit memory.
3 . The system of claim 1 , wherein the validation module performs the validation using both rule-based checks and data consistency checks with an external database.
4 . The system of claim 1 , wherein the first language model is trained to produce the structured output in JavaScript Object Notation (JSON) or Extensible Markup Language (XML) format.
5 . The system of claim 1 , wherein the error handling operation comprises generating a notification to a system administrator or user indicating failure and reason for rejection.
6 . The system of claim 1 , wherein the controller further initiates retraining of at least one of the language models based on patterns in failed validation outcomes.
7 . The system of claim 1 , wherein the second language model is configured to use the structured output to construct a machine-executable query in a query language.
8 . The system of claim 7 , wherein the query language is Structured Query Language (SQL), PostgreSQL, or a NoSQL database query language.
9 . The system of claim 1 , wherein the validation module performs schema validation to ensure the structured output contains all required fields for downstream data retrieval.
10 . A computer-implemented method for enhancing the reliability of a natural language processing pipeline, the method comprising: receiving, by a first language model, a natural language query; generating, by the first language model, a structured output from the natural language query, the structured output comprising a machine-readable prompt in a structured data format, wherein each entity extracted from the natural language query is represented as a key-value pair; validating, by a validation module, the structured output against a reference rule set or database, the validating comprising determining whether the structured output conforms to a predetermined schema and contains valid entity values and performing schema and data integrity checks on the structured output, including comparing an entity in the structured output to a database record to verify accuracy; upon validation success, forwarding, by a controller, the structured output to a second language model configured to convert the structured output to a query language format for generating a machine-executable database query; and upon validation failure, invoking, by the controller, an error handling routine.
11 . The method of claim 10 , further comprising recording, by the validation module, validation results and corresponding queries in an audit log stored in non-transitory memory.
12 . The method of claim 10 , wherein the error handling routine comprises transmitting a message to a remote user device indicating a validation failure.
13 . The method of claim 10 , wherein the validating further comprises comparing a predicted entity in the structured output to a lookup table or database record to verify accuracy.
14 . The method of claim 10 , wherein the second language model generates a machine-executable query based on the structured output for retrieving information from a database.
15 . The method of claim 14 , further comprising executing the machine-executable query on a database to obtain a result set.
16 . The method of claim 10 , further comprising, upon detection of recurring validation failures for a specific input pattern, triggering retraining of the first language model using augmented training data.
17 . The method of claim 10 , wherein validation comprises both format validation and semantic validation.
18 . A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the processors to enhance the reliability of a natural language processing pipeline by: receiving a natural language query; process the natural language query using a first language model to convert the natural language query into a structured output comprising a machine-readable prompt in a structured data format, wherein each entity extracted from the natural language query is represented as a key-value pair; validating the structured output, using a validation module, according to a set of predefined rules or by referencing a database, wherein the validation comprises performing schema and data integrity checks on the structured output, including comparing an entity in the structured output to a database record to verify accuracy; routing the structured output, if validation is successful, to a second language model configured to convert the structured output to a query language format for generating a machine-executable database query; and if validation fails, initiating an error response, including recording failure in an audit trail.
19 . The non-transitory computer-readable medium of claim 18 , wherein the instructions further cause the processors to send a notification to a user or administrator upon a validation failure.
20 . The non-transitory computer-readable medium of claim 18 , wherein the instructions further cause the processors to update the validation rules based on received user feedback or detected error patterns.

Description

CROSS-REFERENCE TO RELATED APPLICATION(S) This application claims the benefit of priority as a Continuation of U.S. Nonprovisional patent application Ser. No. 19/212,571, filed May 19, 2025, the disclosure of which is incorporated by reference herein in its entirety and for all purposes. BACKGROUND Functional representation tasks generally involve the transformation of unstructured textual input into a structured data format, such as JavaScript Object Notation (JSON), eXtensible Markup Language (XML), or YAML Ain′t Markup Language (YAML). The specific structured format utilized may vary depending on the nature of the task and the requirements of the downstream system. In practical applications, functional representation may require predicting several label types, including continuous values, one-hot encoded vectors, and binary indicators from diverse sentence inputs. When processing unvetted user-generated text, the variability and breadth of possible inputs results in a vast search space. Often in early stages of product development, large volumes of high-quality, task-specific training data are not readily available. Synthetic data can be employed but may fail to adhere to a desired structure and thereby cause the trained model to produce hallucinated output. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings are not intended to be drawn to scale. Like reference numbers and designations in the various drawings indicate like elements. For purposes of clarity, not every component may be labeled in every drawing. In the drawings: FIG. 1 shows a high-level system for reducing hallucinations in a large language model, according to an example embodiment. FIG. 2A illustrates an example flowchart of a process for generating training data for training a large language model, in accordance with an implementation. FIG. 2B illustrates an example flowchart of a process for executing a trained LLM, in accordance with an implementation. FIG. 3 illustrates an example pipeline of a process for generating training data for training a large language model, in accordance with an implementation. FIG. 4 discloses a computing environment in which aspects of the present disclosure may be implemented. FIG. 5 illustrates an example machine learning framework that techniques described herein may benefit from. DETAILED DESCRIPTION The example embodiments of this invention involve methods, systems, and computer program products for reducing hallucinations in a large language model. Samples are generated and used to train models that convert textual input to structured formats for downstream tasks, such as generating a visual representation of a data structure obtained from a structured format query. Functional representation is a branch of text analysis in which data are represented as sentence-label pairs. The sentences can be transformed into tokens, and models can utilize these tokens to predict a label. The label can have various forms depending on the context. Approaches to functional representation tasks include tuned transformer models trained to predict structured text that captures the underlying label of input sentences, sentence embedding models to map sentence input to high-dimensional embedding vectors and predict labels associated with each sentence, full-view encoder-predictor models to encode the input into a higher dimensional feature space and transform the feature space into a label prediction, and Named-Entity Recognition (NER), in which labels are explicitly mentioned in the input sentence and models can assign labels to individual words or phrases. Such approaches to functional representation tasks rely on datasets that adequately capture the possible form, grammar, and vocabulary of the input sentences. However, such datasets are often scarce or imbalanced—especially in emerging or specialized application areas—leading to poor generalization and reduced model performance. Further, labels associated with the input sentences do not cover a large enough portion of the possible combinations of data labels. Labeled text data can be augmented by methods such as manual or semi-supervised data generation and labelling by domain experts. This process, however, is time-consuming, expensive, and error prone. Other methods include rudimentary data augmentation techniques such as synonym substitution, back-translation, word shuffling, and paraphrase generation. While effective at generating diversity in input sentences, these methods lack the ability to provide user control over the preferred forms of input sentences. Careful implementations may be used to generate augmented training data while preserving the underlying meaning of input sentences. A computer implementing the systems and methods described here can overcome the aforementioned technical deficiencies. To do so, for example, the computer can identify a plurality of input-output pairs. Each input-output pair can be assigned to a probability and include a