US-20260127191-A1 - UNSUPERVISED VALIDATION FRAMEWORK FOR LARGE LANGUAGE MODEL (LLM) OUTPUTS
Abstract
The disclosure relates to a system and method for unsupervised validation of Large Language Model (LLM)-generated outputs. The method includes the steps of receiving input data, which may comprise structured or unstructured text; generating an LLM output based on the input data; extracting a first set of topics from the input data, with each topic represented by a set of keywords; converting the first set of topics into human-readable reference data; extracting a second set of topics from the LLM output; and converting the second set of topics into human-readable candidate data. The method further includes comparing the reference data with the candidate data using one or more performance metrics. Finally, the method determines a validation score based on the comparison results. The disclosed method allows for the objective validation of LLM-generated content, enhancing accuracy and efficiency in diverse applications such as customer service, technical documentation, and legal analysis.
Inventors
- Waad Subber
- Ankit Singh
- Eric HARTYE
Assignees
- HONEYWELL INTERNATIONAL INC.
Dates
- Publication Date
- 20260507
- Application Date
- 20241106
Claims (20)
- 1 . A method for unsupervised validation of a Large Language Model (LLM) generated output, comprising the steps of: receiving, by a generative Artificial Intelligence (AI) system, input data representing a set of information to be processed by a primary Large Language Model (LLM1); generating LLM output by the LLM1 based on the input data; extracting a first set of topics from the input data, wherein each topic in the first set of topics is represented by a set of keywords; converting the first set of topics into human-readable data by a second Large Language Model (LLM2) to generate reference data representing key topics from the first set of topics of the input data; extracting a second set of topics from the generated LLM output; converting the second set of topics into human-readable data by the LLM2 to generate candidate data representing key topics from the second set of topics of the generated LLM output; comparing the reference data with the candidate data using one or more performance metrics; determining a validation score based on the comparison of the reference data and the candidate data; and automatically performing a control action comprising: releasing the generated LLM output when the validation score meets or exceeds a predefined threshold, and routing the generated LLM output for manual review when the validation score is below the predefined threshold.
- 2 . The method as claimed in claim 1 , wherein the one or more performance metrics comprise at least one of cosine similarity metric, precision metric, recall metric, and F1 score metric.
- 3 . The method as claimed in claim 1 , wherein the extracting step applies a topic modeling algorithm selected from the group consisting of Latent Dirichlet Allocation (LDA) algorithm, Non-negative Matrix Factorization (NMF) algorithm, and Latent Semantic Analysis (LSA) algorithm.
- 4 . The method as claimed in claim 1 , wherein extracting the first set of topics comprises determining an optimal number of the first set of topics based on computation of a coherence score.
- 5 . The method as claimed in claim 1 , further comprising selecting an LLM from a plurality of LLMs for output generation based on the validation score.
- 6 . The method as claimed in claim 1 , further comprising generating a confidence score based on the validation score, wherein the confidence score reflects probability of the LLM-generated output meeting a predefined quality standard.
- 7 . The method as claimed in claim 1 , wherein the input data comprises structured or unstructured text data, and the generated LLM output comprises a summary, a letter, or a report derived from the input data.
- 8 . A generative Artificial Intelligence (AI) system for unsupervised validation of a large language model (LLM) generated output, the system comprising: a memory for storing input data representing a set of information to be processed by the LLM; a processor configured to: receive input data by a primary Large Language Model (LLM1); generate LLM output by the LLM1 based on the input data; extract, by an extraction module, a first set of topics from the input data, wherein each topic in the first set of topics is represented by a set of keywords; convert the first set of topics into human-readable data by a second Large Language Model (LLM2) to generate reference data representing key topics from the first set of topics of the input data; extract, by the extraction module, a second set of topics from the generated LLM output; convert the second set of topics into human-readable data by the LLM2 to generate candidate data representing key topics from the second set of topics of the generated LLM output; compare the reference data with the candidate data using one or more performance metrics; determine a validation score based on the comparison of the reference data and the candidate data; and automatically perform a control action comprising: releasing the generated LLM output when the validation score meets or exceeds a predefined threshold, and routing the generated LLM output for manual review when the validation score is below the predefined threshold.
- 9 . The system as claimed in claim 8 , wherein the one or more performance metrics comprise at least one of cosine similarity metric, precision metric, recall metric, and F1 score metric.
- 10 . The system as claimed in claim 8 , wherein the system is configured to apply a topic modeling algorithm selected from the group consisting of Latent Dirichlet Allocation (LDA) algorithm, Non-negative Matrix Factorization (NMF) algorithm, and Latent Semantic Analysis (LSA) algorithm.
- 11 . The system as claimed in claim 8 , wherein the system is further configured to determine an optimal number of the first set of topics based on computation of a coherence score.
- 12 . The system as claimed in claim 8 , wherein the system is further configured to select an LLM from a plurality of LLMs for output generation based on the validation score.
- 13 . The system as claimed in claim 8 , wherein the system is further configured to generate a confidence score based on the validation score, wherein the confidence score reflects probability of the LLM-generated output meeting a predefined quality standard.
- 14 . The system as claimed in claim 8 , wherein the input data comprises structured or unstructured text data, and the generated LLM output comprises a summary, a letter, or a report derived from the input data.
- 15 . A non-transitory computer-readable medium having stored thereon computer-readable instructions that, when executed by a processor, cause the processor to execute a method for unsupervised validation of a Large Language Model (LLM) generated output, comprising the steps of: receiving, by a generative Artificial Intelligence (AI) system, input data representing a set of information to be processed by a primary Large Language Model (LLM1); generating LLM output by the LLM1 based on the input data; extracting a first set of topics from the input data, wherein each topic in the first set of topics is represented by a set of keywords; converting the first set of topics into human-readable data by a second Large Language Model (LLM2) to generate reference data representing key topics from the first set of topics of the input data; extracting a second set of topics from the generated LLM output; converting the second set of topics into human-readable data by the LLM2 to generate candidate data representing key topics from the second set of topics of the generated LLM output; comparing the reference data with the candidate data using one or more performance metrics; determining a validation score based on the comparison of the reference data and the candidate data; and automatically performing a control action comprising: releasing the generated LLM output when the validation score meets or exceeds a predefined threshold, and routing the generated LLM output for manual review when the validation score is below the predefined threshold.
- 16 . The non-transitory computer-readable medium as claimed in claim 15 , wherein the one or more performance metrics comprise at least one of cosine similarity metric, precision metric, recall metric, and F1 score metric.
- 17 . The non-transitory computer-readable medium as claimed in claim 15 , wherein the extracting step applies a topic modeling algorithm selected from the group consisting of Latent Dirichlet Allocation (LDA) algorithm, Non-negative Matrix Factorization (NMF) algorithm, and Latent Semantic Analysis (LSA) algorithm.
- 18 . The non-transitory computer-readable medium as claimed in claim 15 , wherein extracting the first set of topics comprises determining an optimal number of the first set of topics based on computation of a coherence score.
- 19 . The non-transitory computer-readable medium as claimed in claim 15 , further comprising selecting an LLM from a plurality of LLMs for output generation based on the validation score.
- 20 . The non-transitory computer-readable medium as claimed in claim 15 , further comprising generating a confidence score based on the validation score, wherein the confidence score reflects probability of the LLM-generated output meeting a predefined quality standard.
Description
TECHNICAL FIELD The present disclosure relates to systems and methods for validating the outputs of large language models (LLMs). More specifically, the present disclosure pertains to an unsupervised validation framework that uses topic modeling and performance metrics to evaluate the quality and coherence of LLM-generated content. BACKGROUND In recent years, advancements in artificial intelligence (AI) have led to the development of powerful generative models, such as Large Language Models (LLMs). These models leverage machine learning techniques to generate human-like text by analyzing and learning from vast amounts of data. By identifying patterns and relationships within the data, the LLMs can produce coherent and contextually relevant text outputs across a wide range of tasks. Examples of tasks where LLMs have shown remarkable success include document summarization, dialogue generation, content creation, machine translation, and creative writing. Despite the potential of generative AI models, particularly LLMs, evaluating their performance remains a significant challenge. In traditional machine learning systems, performance evaluation typically involves splitting the dataset into training, validation, and testing sets. The model's performance is assessed using known ground-truth labels, allowing for objective metrics such as accuracy, precision, recall, and F1 scores to be calculated. However, generative models like LLMs pose a unique challenge because their outputs are often subjective and diverse. Unlike classification or regression models, the outputs of generative models are open-ended, meaning there may be no single “correct” response. This inherent variability complicates the task of defining clear-cut evaluation metrics. Historically, the performance of generative models has been assessed through a combination of automated metrics (e.g., BLEU, ROUGE, or METEOR scores) and human evaluation. Human reviewers assess the relevance, coherence, fluency, and creativity of the generated text in comparison to human-written reference texts. However, these evaluation methods have limitations. Automated metrics, while useful, often fail to capture the nuances of human language, and they can be poor proxies for the true quality of the generated content. Human evaluation, on the other hand, is time-consuming, subjective, and resource-intensive, making it impractical for large-scale or real-time applications. The issue becomes even more pronounced in scenarios where there is a lack of pre-existing human-generated reference data. In many real-world applications, especially those involving novel tasks or domains, historical records or human annotations are not always available. One such example arises in the context of complaint management systems, where organizations may be required to generate closing letters summarizing customer complaints and the corresponding resolution. In many cases, such closing letters are either non-existent or vary significantly from case to case, which complicates the process of evaluating the quality of the letters generated by LLMs. Furthermore, the reliance on human-generated reference texts or manual evaluation in such scenarios can lead to delays, inconsistencies, and subjective judgments. To address these challenges, there is a need for more efficient, reliable, and objective methods to evaluate the outputs of LLMs, particularly in the absence of human reference data. The need is particularly acute in scenarios where human resources are limited, or where reference data is unavailable for new or evolving tasks. Additionally, the subjective nature of human evaluation introduces inconsistencies and variability into the assessment process, further emphasizing the need for automated or unsupervised methods of validation. SUMMARY The present disclosure seeks to resolve the above-mentioned challenges by introducing a novel unsupervised validation framework for LLM-generated outputs. Unlike traditional validation techniques that rely on the availability of human-generated reference texts, the proposed framework operates independently of such data. By employing topic modeling and performance metrics such as cosine similarity, precision, recall, and F1 score, the framework allows for the evaluation of LLM outputs in a robust and objective manner. This unsupervised approach is particularly well-suited to use cases such as the generation of closing letters in complaint management systems, where reference data may be unavailable or subjective human evaluation is impractical. The proposed validation framework operates by extracting key topics from the input dataset (e.g., a complaint record) and generating a corresponding human-readable text using a validation LLM. These extracted topics serve as a reference for comparison with the generated output, such as the closing letter. Performance metrics are computed by comparing the topics in the reference text and the generated text, providing a qu