BR-102025012368-A2 - A method implemented by a processor, system, and one or more means of storing non-transient, machine-readable information.
Abstract
The configurations of the present invention address unresolved problems in evaluating the response quality of LLM and general LLM models. Existing approaches to LLM evaluation and LLM response assessment can be broadly categorized into automated evaluation metrics, human evaluation, and adversarial testing. The configurations presented here provide a method and a system for dynamic weighted selection of performance metrics for generating the LLM response score. Furthermore, the system is configured to generate LLM maturity gap analysis and associated recommendations for improving the LLM response score. Finally, the system generates a certificate of conformity for each model (version) with a level (threshold) score and generates an NFT using a smart contract-based blockchain, utilizing metadata associated with the model and the evaluation metrics and results.
Inventors
- SALIM HAMSA
- Jayantrao Mohite
- DINESHKUMAR JANG BAHADUR SINGH
- NANDAN SINGH RAJPOOT
- Ajay Mittal
- ASHITA ATULKUMAR PATEL
- Srinivasu Pappula
Assignees
- TATA CONSULTANCY SERVICES LIMITED
Dates
- Publication Date
- 20260310
- Application Date
- 20250617
- Priority Date
- 20240617
Claims (15)
- 1. Method implemented by processor (300), characterized by understanding: receiving (302), through an input/output (I/O) interface, at least one task, metadata associated with a large language model (LLM), an input prompt provided by a user, and an output associated with the LLM input prompt; determining (304), through one or more hardware processors, a plurality of task contexts corresponding to the output obtained from the LLM using a contextual task analysis module, wherein the plurality of task contexts is determined based on information available in at least one task and in a predefined domain knowledge base; retrieving (306), through one or more hardware processors, a set of evaluation metrics associated with the determined plurality of task contexts from a predefined task metrics knowledge graph database; training (308), through one or more hardware processors, a machine learning (ML) model based on at least one received task, the determined plurality of task contexts, task and in the obtained set of evaluation metrics to estimate a weight for each of the obtained sets of evaluation metrics; dynamically select (310), using one or more hardware processors, one or more evaluation metrics from the searched set of evaluation metrics based on the plurality of task contexts and one or more user preferences obtained from the received input prompt using a predefined ensemble technique and a semantic analysis for the plurality of task contexts; assign (312), using one or more hardware processors, the estimated weight for each of the one or more evaluation metrics dynamically selected using the trained ML model; aggregate (314), using one or more hardware processors, the results of the one or more evaluation metrics dynamically selected based on the weights assigned using a context performance evaluation model (CPEM); calculate (316), using one or more hardware processors, an LLM response quality score for each of the plurality of task contexts, computing the one or more selected evaluation metrics and the aggregated results of one or more dynamically selected evaluation metrics; identify (318), using one or more hardware processors, a maturity gap for each of the plurality of task contexts, comparing the calculated LLM response quality score with a predefined expected response score to detect one or more problems in the LLM output using a data quality analysis, a contextual analysis, and a question analysis technique; perform (320), using one or more hardware processors, a root cause analysis using a decision tree-based technique to identify the root cause of the identified maturity gap for each of the plurality of task contexts and the one or more problems detected in the LLM output; evaluate (322), using one or more hardware processors, a potential impact of the identified maturity gap for each of the plurality of task contexts and the one or more problems detected in the LLM output to address the potential impact of the detected problems and the maturity gap; monitor (324) recursively, through one or more hardware processors, the maturity gap identified for each of the plurality of task contexts and the one or more problems detected in the output obtained from the LLM to recommend improving the LLM response quality score.
- 2. Method implemented by processor (300), according to claim 1, characterized by a rule-based technique being used to perform a root cause analysis to detect the root cause of the maturity gap identified for each of the plurality of task contexts and the one or more problems detected in the output obtained from the LLM.
- 3. Method implemented by processor (300), according to claim 1, characterized in that a certificate of conformity is generated based on a predefined threshold LLM response quality score.
- 4. Method implemented by processor (300), according to claim 1, characterized by a non-fungible token (NFT) being generated to represent the generated certificate of conformity and to integrate the generated certificate of conformity into a smart contract.
- 5. Method implemented by processor (300), according to claim 1, characterized by a rule-based technique considering at least one task, the plurality of task contexts and the LLM response quality score associated with each of the determined evaluation metrics to detect one or more problems.
- 6. System (100) characterized by comprising: an input/output interface (104) for receiving at least one task, metadata associated with a large language model (LLM), a user-supplied input prompt and an output associated with the LLM input prompt; one or more hardware processors (108); a memory (110) in communication with the one or more hardware processors (108), wherein the one or more hardware processors (108) are configured to execute programmed instructions stored in memory (110) to: determine a plurality of task contexts corresponding to the output obtained from the LLM using a contextual task analysis module, wherein the plurality of task contexts is determined based on information available in at least one task and in a predefined domain knowledge base; retrieve a set of evaluation metrics associated with the determined plurality of task contexts from a predefined task metrics knowledge graph database; train a machine learning (ML) model based on at least a received task, within the determined plurality of task contexts and the sought set of evaluation metrics, to estimate a weight for each of the sought sets of evaluation metrics; dynamically select one or more evaluation metrics from the sought set of evaluation metrics based on the plurality of task contexts and one or more user preferences obtained from the received input prompt using a predefined ensemble technique and semantic analysis for the plurality of task contexts; assign the estimated weight to each of the one or more dynamically selected evaluation metrics using the trained ML model; Aggregate results from one or more dynamically selected evaluation metrics based on assigned weights using a context-specific performance evaluation model (CPEM); calculate an LLM response quality score for each of the plurality of task contexts by calculating one or more selected evaluation metrics and the aggregated results of the one or more dynamically selected evaluation metrics; identify a maturity gap for each of the plurality of task contexts by comparing the calculated LLM response quality score with a predefined expected response score to detect one or more problems in the LLM output using data quality analysis, contextual analysis, and question analysis techniques; perform a root cause analysis using a decision tree-based technique to identify the root cause of the identified maturity gap for each of the plurality of task contexts and the one or more problems detected in the LLM output; assess the potential impact of the identified maturity gap for each of the plurality of task contexts and the one or more problems detected in the LLM output to address the potential impact. of the problems detected and the maturity gap; to recursively monitor the maturity gap identified for each of the plurality of task contexts and the one or more problems detected in the output obtained from the LLM in order to recommend improvements to the quality score of the LLM response.
- 7. System (100), according to claim 6, characterized by a rule-based technique being used to perform a root cause analysis to detect the root cause of the maturity gap identified for each of the plurality of task contexts and the one or more problems detected in the output obtained from the LLM.
- 8. System (100), according to claim 6, characterized in that a certificate of conformity is generated based on a predefined threshold LLM response quality score.
- 9. System (100), according to claim 6, characterized in that a non-fungible token (NFT) is generated to represent the generated certificate of conformity and to integrate the generated certificate of conformity into a smart contract.
- 10. System (100), according to claim 6, characterized by a rule-based technique considering at least one task, the plurality of task contexts and the LLM response quality score associated with each of the determined evaluation metrics to detect one or more problems.
- 11. One or more means of storing non-transient, machine-readable information, characterized by comprising one or more instructions that, when executed by one or more hardware processors, cause: receiving, through an Input/Output (I/O) interface, at least one task, metadata associated with a large language model (LLM), an input prompt provided by a user, and an output associated with the LLM input prompt; determining a plurality of task contexts corresponding to the output obtained from the LLM using a contextual task analysis module, where the plurality of task contexts is determined based on information available in at least one task and in a predefined domain knowledge base; retrieving a set of evaluation metrics associated with the determined plurality of task contexts from a predefined task metrics knowledge graph database; training a machine learning (ML) model based on at least one received task, the determined plurality of task contexts, and the obtained set of evaluation metrics to estimate a weight for each of the sets. obtained from evaluation metrics; dynamically select one or more evaluation metrics from the searched set of evaluation metrics based on the plurality of task contexts and one or more user preferences obtained from the received input prompt using a predefined ensemble technique and semantic analysis for the plurality of task contexts; assign an estimated weight to each of the one or more dynamically selected evaluation metrics using the trained ML model; aggregate the results of the one or more dynamically selected evaluation metrics based on the assigned weights using a context performance evaluation model (CPEM); calculate an LLM response quality score for each of the plurality of task contexts by computing the one or more selected evaluation metrics and the aggregated results of the one or more dynamically selected evaluation metrics; identify a maturity gap for each of the plurality of task contexts by comparing the calculated LLM response quality score with a predefined expected response score to detect one or more problems in the output obtained from the LLM using a quality analysis. From data, a contextual analysis and a question analysis technique; perform a root cause analysis using a decision tree-based technique to identify the root cause of the maturity gap identified for each of the plurality of task contexts and the one or more problems detected in the output obtained from the LLM; assess the potential impact of the maturity gap identified for each of the plurality of task contexts and the one or more problems detected in the output obtained from the LLM to address the potential impact of the detected problems and the maturity gap; and recursively monitor the maturity gap identified for each of the plurality of task contexts and the one or more problems detected in the output obtained from the LLM to recommend improvements to the LLM response quality score.
- 12. One or more means of storing non-transient, machine-readable information, according to claim 11, characterized in that a rule-based technique is used to perform a root cause analysis to detect the root cause of the maturity gap identified for each of the plurality of task contexts and the one or more problems detected in the output obtained from the LLM.
- 13. One or more means of storing non-transient, machine-readable information, according to claim 11, characterized in that a certificate of conformity is generated based on a predefined threshold LLM response quality score.
- 14. One or more means of storing non-transient, machine-readable information, according to claim 11, characterized in that a non-fungible token (NFT) is generated to represent the generated certificate of conformity and to integrate the generated certificate of conformity into a smart contract.
- 15. One or more means of storing non-transient, machine-readable information, according to claim 11, characterized by a rule-based technique considering at least one task, the plurality of task contexts, and the LLM response quality score associated with each of the determined evaluation metrics to detect one or more problems.
Description
Cross-referencing for related orders and prioritization. [001] This application claims priority over Indian application no. 202421046440, filed on June 17, 2024. Field of Invention [002] The present invention generally relates to the field of Large Language Model (LLM) evaluation and, more particularly, to a method and system for evaluating and tokenizing LLMs based on weighted dynamic metrics. Background of the Invention [003] Language Models (LLMs) generate responses using large-scale neural network architectures trained on large amounts of text data. These models, such as OpenAI's Generative Pre-trained Transformer (GPT) series or Google's Bidirectional Encoder Representations from Transformers (BERT), employ techniques such as self-awareness mechanisms and transformer architectures to understand and generate human-like textual responses. However, despite their impressive capabilities, LLMs face several challenges and problems in generating responses: I. Lack of contextual understanding: LLMs may have difficulty understanding the contextual nuances of a given prompt or query, leading to irrelevant or semantically inconsistent responses. II. Bias and unbiasedness: LLMs may exhibit biases present in their training data, leading to biased or unfair responses. III. Robustness and Adversarial Attacks: LLMs are vulnerable to adversarial attacks, where small modifications to the input can result in drastically different or undesirable outputs. This vulnerability raises concerns about the robustness and reliability of the responses generated by LLMs. [004] Given these challenges, there is an urgent need for robust validation of LLM responses and models. Validation of LLM responses involves assessing the quality, relevance, and ethical implications of the generated text. Validation ensures that LLMs produce accurate, consistent, and unbiased responses, aligned with user expectations and ethical standards. Validation of the LLM model encompasses the assessment of the overall performance, generalization capabilities, and adherence to ethical guidelines of the underlying models. Model validation helps identify weaknesses, biases, or vulnerabilities in LLMs and guides improvements to increase their reliability, unbiasedness, and dependability. Brief Description of the Invention [005] The configurations of the present invention offer technological improvements as solutions to one or more of the technical problems mentioned above, recognized by the inventors in conventional systems. For example, in one configuration, a method is provided for evaluating and tokenizing Large Language Models (LLMs) based on weighted dynamic metrics. The processor-implemented method includes receiving, through an Input/Output (I/O) interface, at least one task, metadata associated with a Large Language Model (LLM), an input prompt provided by a user, and an output associated with the LLM's input prompt, determining a plurality of task contexts corresponding to the output obtained from the LLM using a task contextual analysis module. The plurality of task contexts is determined based on the information available in at least one task and a predefined domain knowledge base. [006] Furthermore, the method implemented by the processor includes searching for a set of evaluation metrics associated with the plurality of task contexts determined from a predefined database in the form of a knowledge graph of tasks and metrics, and training a machine learning (ML) model based on at least one received task, the plurality of task contexts determined, and the searched set of evaluation metrics, in order to estimate a weight for each evaluation metric in that set. Moreover, the method implemented by the processor includes dynamically selecting one or more evaluation metrics from the searched set of evaluation metrics based on the plurality of task contexts and one or more user preferences obtained from the received input prompt, using a predefined ensemble technique and a semantic analysis of the plurality of task contexts, and assigning the estimated weight to each of the dynamically selected evaluation metrics using the trained ML model. [007] In addition, the method implemented by the processor includes aggregating results from one or more dynamically selected evaluation metrics based on assigned weights using a context performance evaluation model (CPEM), calculating an LLM response quality score for each of the plurality of task contexts by calculating one or more selected evaluation metrics and the aggregated results of one or more dynamically selected evaluation metrics, and identifying a maturity gap for each of the plurality of task contexts by comparing the calculated LLM response quality score with a predefined expected response score to detect one or more problems in the output obtained from the LLM using a data quality analysis, a contextual analysis, and a question analysis technique. [008] In addition, the method implemented by the processo