US-20260127408-A1 - COMPUTING SYSTEMS AND METHODS FOR AUTOMATICALLY COMPUTING ACCURACY OF A LARGE LANGUAGE MODEL
Abstract
An artificial intelligence computing tool is provided for automatically evaluating an operating large language model (LLM) against a benchmark LLM for integration into an application. The benchmark LLM is used to compute a benchmark question and a benchmark answer per portion of text data from amongst a plurality of portions of text data. The plurality of benchmark questions and the plurality of portions of text data are inputted into the operating LLM to compute a plurality of comparative answers that respectively correspond to the plurality of benchmark questions and respectively correspond to the plurality of portions of text data. Benchmark answers are compared with respective comparative answers to output correctness values. The correctness values associated with the plurality of benchmark questions are used to compute an accuracy score of the operating LLM. In some cases, the operating LLM is smaller than the benchmark LLM.
Inventors
- Marc MAHE
- Dino VITALE
- Behrooz Heshmaty
Assignees
- THE TORONTO-DOMINION BANK
Dates
- Publication Date
- 20260507
- Application Date
- 20241101
Claims (20)
- 1 . A server system for evaluating an operating large language model (LLM), the server system comprising: a memory storing at least a benchmark LLM and the operating LLM, a network interface, and a processor, the processor operably coupled to the memory and the network interface, the processor configured to: obtain a plurality of portions of text data; use the benchmark LLM to compute at least one benchmark question and one benchmark answer per portion of text data from amongst the plurality of portions of text data, and store a plurality of benchmark questions and a plurality of benchmark answers respectively in association with the plurality of portions of text data; input the plurality of benchmark questions and the plurality of portions of text data into the operating LLM to compute a plurality of comparative answers that respectively correspond to the plurality of benchmark questions and respectively correspond to the plurality of portions of text data; for each one of the plurality of benchmark questions, compare a respective benchmark answer from amongst the plurality of benchmark answers and a respective comparative answer from amongst the plurality of comparative answers to output a correctness value; and compute and output an accuracy score of the operating LLM based on a combination of a plurality of correctness values associated with the plurality of benchmark questions.
- 2 . The server system of claim 1 , wherein the plurality of portions of text data are from a group of documents, and the group of documents is associated with an interactive chat knowledge application.
- 3 . The server system of claim 2 , wherein, after determining that the accuracy score of the operating LLM is above a threshold score, automatically integrating the operating LLM in the interactive chat knowledge application; and wherein the interactive chat knowledge application comprises: a chatbot user interface, the operating LLM, and a database comprising the group of documents.
- 4 . The server system of claim 3 , wherein, when the operating LLM has been automatically integrated into the interactive chat knowledge application, the processor is further configured to at least: receive a user-inputted question via the chatbot user interface; process the user-inputted question using the operating LLM to output a response derived from one or more documents from the group of documents; and display, via the chatbot interface, the response and one or more citations corresponding to the one or more documents.
- 5 . The server system of claim 2 , wherein a plurality of operating LLMs are automatically evaluated against the benchmark LLM, and the processor is further configured to at least: identify a given operating LLM with a highest accuracy score from amongst the plurality of operating LLMs, and automatically integrate the given operating LLM into the interactive chat knowledge application; and wherein the interactive chat knowledge application comprises: a chatbot user interface, the operating LLM, and a database comprising the group of documents.
- 6 . The server system of claim 1 , wherein the benchmark LLM has a higher number of parameters than the operating LLM.
- 7 . The server system of claim 1 , wherein a comparator LLM is used to compare the respective benchmark answer from amongst the plurality of benchmark answers and the respective comparative answer from amongst the plurality of comparative answers to output the plurality of correctness values.
- 8 . The server system of claim 7 , wherein the comparator LLM is the benchmark LLM.
- 9 . The server system of claim 7 , wherein the comparator LLM is a secondary benchmark LLM that is more accurate than the operating LLM.
- 10 . The server system of claim 1 , wherein the correctness value is a correct value or an incorrect value, and the accuracy score is computed by: a number of correct values divided by a number of the plurality of benchmark questions.
- 11 . A method for evaluating an operating large language model (LLM), the method executed in a computing environment comprising one or more processors and memory, wherein the memory stores at least a benchmark LLM and the operating LLM, and the method comprising: obtaining a plurality of portions of text data; using the benchmark LLM to compute at least one benchmark question and one benchmark answer per portion of text data from amongst the plurality of portions of text data, and storing a plurality of benchmark questions and a plurality of benchmark answers respectively in association with the plurality of portions of text data; inputting the plurality of benchmark questions and the plurality of portions of text data into the operating LLM to compute a plurality of comparative answers that respectively correspond to the plurality of benchmark questions and respectively correspond to the plurality of portions of text data; for each one of the plurality of benchmark questions, comparing a respective benchmark answer from amongst the plurality of benchmark answers and a respective comparative answer from amongst the plurality of comparative answers to output a correctness value; and computing and outputting an accuracy score of the operating LLM based on a combination of a plurality of correctness values associated with the plurality of benchmark questions.
- 12 . The method of claim 11 , wherein the plurality of portions of text data are from a group of documents, and the group of documents is associated with an interactive chat knowledge application.
- 13 . The method of claim 12 , wherein, after determining that the accuracy score of the operating LLM is above a threshold score, automatically integrating the operating LLM in the interactive chat knowledge application; and wherein the interactive chat knowledge application comprises: a chatbot user interface, the operating LLM, and a database comprising the group of documents.
- 14 . The method of claim 13 , wherein, when the operating LLM has been automatically integrated into the interactive chat knowledge application, the method further comprising: receiving a user-inputted question via the chatbot user interface; processing the user-inputted question using the operating LLM to output a response derived from one or more documents from the group of documents; and displaying, via the chatbot interface, the response and one or more citations corresponding to the one or more documents.
- 15 . The method of claim 12 , wherein a plurality of operating LLMs are automatically evaluated against the benchmark LLM, and the method further comprising: identifying a given operating LLM with a highest accuracy score from amongst the plurality of operating LLMs, and automatically integrating the given operating LLM into the interactive chat knowledge application; and wherein the interactive chat knowledge application comprises: a chatbot user interface, the operating LLM, and a database comprising the group of documents.
- 16 . The method of claim 11 , wherein the benchmark LLM has a higher number of parameters than the operating LLM.
- 17 . The method of claim 11 , wherein a comparator LLM is used to compare the respective benchmark answer from amongst the plurality of benchmark answers and the respective comparative answer from amongst the plurality of comparative answers to output the plurality of correctness values.
- 18 . The method of claim 17 , wherein the comparator LLM is a secondary benchmark LLM that is more accurate than the operating LLM.
- 19 . The method of claim 11 , wherein the correctness value is a correct value or an incorrect value, and the accuracy score is computed by: a number of correct values divided by a number of the plurality of benchmark questions.
- 20 . A non-transitory computer readable medium storing computer executable instructions which, when executed by at least one computer processor, cause the at least one computer processor to carry out a method for evaluating an operating large language model (LLM), the non-transitory computer readable medium further comprising at least a benchmark LLM and the operating LLM, and the method comprising: obtaining a plurality of portions of text data; using the benchmark LLM to compute at least one benchmark question and one benchmark answer per portion of text data from amongst the plurality of portions of text data, and storing a plurality of benchmark questions and a plurality of benchmark answers respectively in association with the plurality of portions of text data; inputting the plurality of benchmark questions and the plurality of portions of text data into the operating LLM to compute a plurality of comparative answers that respectively correspond to the plurality of benchmark questions and respectively correspond to the plurality of portions of text data; for each one of the plurality of benchmark questions, comparing a respective benchmark answer from amongst the plurality of benchmark answers and a respective comparative answer from amongst the plurality of comparative answers to output a correctness value; and computing and outputting an accuracy score of the operating LLM based on a combination of a plurality of correctness values associated with the plurality of benchmark questions.
Description
TECHNICAL FIELD The disclosed exemplary embodiments relate to computer-implemented systems and methods for automatically evaluating accuracies of large language models (LLMs). BACKGROUND Large Language Models (LLMs) are becoming more commonly used for interactive chatbots. It is recognized that there are many different types of LLMs. Some LLMs require more computational resources (e.g., processing time, processing capability, and memory), while some LLMs require less computational resources. In some cases, smaller LLMs that require less computational resources are less accurate compared to larger LLMs that require more computational resources. In some cases, smaller LLMs are sometimes desired, but may come with the associated trade-off with having less accuracy. SUMMARY The following summary is intended to introduce the reader to various aspects of the detailed description, but not to define or delimit any invention. In at least one broad aspect, there is provided a server system for evaluating an operating large language model (LLM). The server system comprises: a memory storing at least a benchmark LLM and the operating LLM, a network interface, and a processor. The processor is operably coupled to the memory and the network interface. The processor is configured to at least: obtain a plurality of portions of text data; use the benchmark LLM to compute at least one benchmark question and one benchmark answer per portion of text data from amongst the plurality of portions of text data, and store a plurality of benchmark questions and a plurality of benchmark answers respectively in association with the plurality of portions of text data; input the plurality of benchmark questions and the plurality of portions of text data into the operating LLM to compute a plurality of comparative answers that respectively correspond to the plurality of benchmark questions and respectively correspond to the plurality of portions of text data; for each one of the plurality of benchmark questions, compare a respective benchmark answer from amongst the plurality of benchmark answers and a respective comparative answer from amongst the plurality of comparative answers to output a correctness value; and compute and output an accuracy score of the operating LLM based on a combination of a plurality of correctness values associated with the plurality of benchmark questions. In some cases, the plurality of portions of text data are from a group of documents, and the group of documents is associated with an interactive chat knowledge application. In some cases, after determining that the accuracy score of the operating LLM is above a threshold score, automatically integrating the operating LLM in the interactive chat knowledge application; and wherein the interactive chat knowledge application comprises: a chatbot user interface, the operating LLM, and a database comprising the group of documents. In some cases, when the operating LLM has been automatically integrated into the interactive chat knowledge application, the processor is further configured to at least: receive a user-inputted question via the chatbot interface; process the user-inputted question using the operating LLM to output a response derived from one or more documents from the group of documents; and display, via the chatbot interface, the response and one or more citations corresponding to the one or more documents. In some cases, a plurality of operating LLMs are automatically evaluated against the benchmark LLM, and the processor is further configured to at least: identify a given operating LLM with a highest accuracy score from amongst the plurality of operating LLMs, and automatically integrate the given operating LLM into the interactive chat knowledge application. The interactive chat knowledge application comprises: a chatbot user interface, the operating LLM, and a database comprising the group of documents. In some cases, the benchmark LLM is larger than the operating LLM. In some cases, a comparator LLM is used to compare the respective benchmark answer from amongst the plurality of benchmark answers and the respective comparative answer from amongst the plurality of comparative answers to output the plurality of correctness values. In some cases, the comparator LLM is the benchmark LLM. In some cases, the comparator LLM is a secondary benchmark LLM that is more accurate than the operating LLM. In some cases, the correctness value is one of a correct value or an incorrect value, and the accuracy score is computed by: a number of correct values divided by a number of the plurality of benchmark questions. In at least another broad aspect, a method for evaluating an operating large language model (LLM) is provided. The method is executed in a computing environment comprising one or more processors and memory, wherein the memory stores at least a benchmark LLM and the operating LLM. The method comprising: obtaining a plurality of portions of text data; usi