KR-20260063134-A - METHOD FOR EVALUATING RESPONSE OF QUESTION ANSWERING USING AUGMENTED SEARCH GENERATION BASED ON TABULAR DATA USING A LARGE LANGUAGE MODEL AND DEVICE USING THE SAME

KR20260063134AKR 20260063134 AKR20260063134 AKR 20260063134AKR-20260063134-A

Abstract

A method and device for evaluating a response are presented. The response evaluation method comprises the steps of: using a first prompt, transmitting a second prompt to a second large language model to evaluate a response generated by a first large language model-based question-response device, and outputting an error type classified from the error in the response; recording a change start checkpoint and a change end checkpoint in the second prompt based on the error type; calculating the number of changes based on a message received from the second large language model; changing the locations of the change start checkpoint and the change end checkpoint within the range of the number of changes, and changing the second prompt considering the changed locations of the change start checkpoint and the change end checkpoint; and transmitting the changed second prompt to the second large language model to output an evaluation score that evaluates whether the error type has changed and the degree of the error.

Inventors

이정훈

Assignees

(주)모아소프트

Dates

Publication Date: 20260507
Application Date: 20241030

Claims (8)

In a response evaluation method using a response evaluation device, A step of transmitting a second prompt to a second large language model to evaluate a response generated by a first large language model-based question answering device using a first prompt, and outputting an error type classified as an error in the response; A step of recording a change start checkpoint and a change end checkpoint in the second prompt based on the above error type; A step of calculating the number of changes based on the message received from the second large language model; A step of changing the positions of the change start checkpoint and the change end checkpoint within the range of the above number of changes, and changing the second prompt in consideration of the changed positions of the change start checkpoint and the change end checkpoint; and A response evaluation method comprising the step of transmitting the modified second prompt to the second large language model to output an evaluation score that evaluates whether the error type has changed and the degree of error.
In paragraph 1, The step of calculating the number of changes mentioned above is, A step of connecting the output of the second large language model to the input of the first large language model; and A response evaluation method comprising the step of connecting the output of the second large language model to the input of the search augmentation generative model.
In paragraph 2, The step of calculating the number of changes mentioned above is, A step of transmitting a message from the second large language model to the first large language model; and A response evaluation method comprising the step of calculating the number of changes based on a message received from the first large language model.
In paragraph 2, The step of calculating the number of changes mentioned above is, A step of transmitting a message from the second large language model to the search augmentation generation model; and A response evaluation method comprising the step of calculating the number of changes based on a message received from the search augmentation generation model.
In paragraph 1, The step of outputting the above evaluation score is, A response evaluation method comprising the step of transmitting the above evaluation score to the above question-and-answer device.
In a response evaluation device, An error classification unit that uses a first prompt to transmit a second prompt to a second large language model for evaluating a response generated by a first large language model-based question answering device, and outputs an error type classified by the error of the response; A prompt recorder that records a change start checkpoint and a change end checkpoint in the second prompt based on the above error type; A change count control unit that calculates the number of changes based on a message received from the second large language model; A prompt changing unit that changes the positions of the change start checkpoint and the change end checkpoint within the range of the above number of changes, and changes the second prompt considering the changed positions of the change start checkpoint and the change end checkpoint; and A response evaluation device comprising a response evaluation unit that transmits the above-mentioned modified second prompt to the above-mentioned second large language model and outputs an evaluation score that evaluates whether the error type has changed and the degree of error.
A computer-readable recording medium having a program for performing the response evaluation method described in paragraph 1.
A computer program stored on a computer-readable recording medium to perform the response evaluation method described in paragraph 1.

Description

Method for evaluating response of question answering using augmented search generation based on tabular data using a large language model and device using the same The technical field to which the present invention belongs relates to a method and apparatus for evaluating responses. More specifically, the present invention relates to a method for evaluating responses to a query and an apparatus applying tabular data-based search augmentation generation using a large language model, and an apparatus using the same. It provides a method and apparatus for evaluating a response obtained for a query based on a large language model using a search augmentation generation model, using another large language model. The content described in this section merely provides background information regarding the present embodiment and does not constitute prior art. The emergence of language models stemmed from the need to overcome various limitations in the field of natural language processing. Early language processing systems primarily utilized keyword-based search and rule-based approaches; these systems failed to understand context and were limited to processing individual words. Consequently, a need arose for new language models capable of understanding meaning by considering context and processing diverse expressions, much like humans. In particular, with the development of the Internet generating vast amounts of text data, efficiently extracting meaningful information from this data became crucial. To address this, language models evolved to grasp accurate meaning and extract information even from unstructured data. Along with the advancement of conversational AI, the demand for natural human-computer interaction increased, further heightening the need for language models to enhance the performance of natural language processing in areas such as question-answering systems, virtual assistants, and chatbots. Furthermore, as the importance of multilingual processing was highlighted, powerful language models capable of understanding and translating various languages were required. In this context, neural network-based models, particularly those utilizing Transformers, were developed. Subsequently, models like BERT, capable of understanding context bidirectionally, emerged, significantly improving the precision of natural language processing. Ultimately, it can be said that large language models have opened new horizons in natural language processing by learning large volumes of data and parameters to provide natural language understanding and generation capabilities approaching human levels. Large Language Models (LLMs) are deep learning algorithms that perform natural language processing tasks. They are trained using large datasets and can perform text recognition, classification, question answering, and translation. The Transformer model is a representative example. Large Language Models primarily receive prompts as input. Recently, the phenomenon of hallucination or confabulation occurring during the training and generation of deep learning models has become a significant issue. Hallucination or confabulation refers to the phenomenon where a deep learning model generates information regarding a specific topic that is unrelated to reality or differs from the facts. Factors such as a lack of data learned by the model, contradictory information, and difficulty in understanding context can be causes of confabulation. Retrieval Augmented Generation (RAG) technology can be applied to resolve hallucination or falsification phenomena. Retrieval Augmented Generation (RAG) technology combines search-based models with generative language models to improve the quality, accuracy, and variety of generated text. It generates accurate responses by utilizing information retrieved from external sources. LangChain is a representative framework that implements a search-augmented-generation model. LangChain is a framework for developing language model-based applications. It provides interfaces that connect large language models to data and enable interaction with the environment. Through LangChain, documents can be structured, and query-answering, summarizing, and analysis can be performed on the structured data. LangChain provides data sources that allow access to and searching of data from external sources, generates embeddings that convert searched data into vectors, and offers a vector repository to store and manage the generated embedding vectors. In particular, LangChain can locate relevant documents and construct answers based on the found documents through RAG (Retrieval Augmented Generation) technology, which integrates information retrieval and generation. Regarding the application of a search augmentation generation model, Korean Patent No. 10-2648139 provides a server for providing an AI chatbot tutor for personalized learning support based on complete learning, and Korean Patent No. 10-2669422 provides a kiosk system utilizing