KR-20260066932-A - APPARATUS, METHOD AND PROGRAM FOR EVALUATING RESPONSE TEXT BASED ON PROMPT-CHAIN

KR20260066932AKR 20260066932 AKR20260066932 AKR 20260066932AKR-20260066932-A

Abstract

A prompt chain-based response text evaluation device includes an evaluation request receiving unit that receives a user request for evaluating a response text previously output by a large-scale language model; an evaluation criterion generating unit that generates an evaluation criterion prompt including task-specific evaluation items and evaluation criteria based on the user request using the large-scale language model; a prompt chaining performing unit that performs prompt chaining by connecting the evaluation criterion prompt to an evaluation instruction prompt containing the response text; and an evaluation performing unit that inputs the evaluation instruction prompt to the large-scale language model to produce an evaluation result for the response text.

Inventors

백지수

Assignees

주식회사 케이티

Dates

Publication Date: 20260512
Application Date: 20241105

Claims (20)

In a prompt chain-based response text evaluation device, Evaluation request receiver receiving user requests to evaluate response text previously output by a large-scale language model; An evaluation criterion generation unit that generates an evaluation criterion prompt including task-specific evaluation items and evaluation criteria based on the user request using the above-mentioned large-scale language model; A prompt chaining performing unit that performs prompt chaining by connecting the evaluation criteria prompt to an evaluation instruction prompt containing the above response text; and An evaluation execution unit that inputs the evaluation instruction prompt into the above-mentioned large-scale language model to produce an evaluation result for the above-mentioned response text; A prompt chain-based response text evaluation device including
In Article 1, The above evaluation request receiving unit A prompt chain-based response text evaluation device that receives input from a user as a user request for a function of the above-mentioned large-scale language model to be evaluated.
In Article 1, The above response text is generated in a first language, and A prompt chain-based response text evaluation device in which the evaluation criteria generation unit generates the evaluation criteria prompt in a second language different from the first language.
In Paragraph 3, A prompt chain-based response text evaluation device in which the first language is Korean and the second language is English.
In Article 1, The above evaluation performing department A prompt chain-based response text evaluation device that assigns a native speaker role corresponding to the target language of the response text to the evaluation criteria prompt, thereby instructing the evaluation criteria to infer the consideration factors of the native speaker.
In Article 5, The above evaluation instruction prompt is A system prompt including instructions defining the above-mentioned native speaker role and instructions describing the assessment task; and A prompt chain-based response text evaluation device comprising: a user prompt including a directive defining the native speaker role and a directive describing the evaluation task, wherein the user prompt includes vocabulary related to the target language of evaluation.
In Article 6, The above prompt chaining execution unit A prompt chain-based response text evaluation device that performs prompt chaining by connecting the evaluation criteria prompt to the user prompt.
In Article 6, A prompt chain-based response text evaluation device in which the user prompt includes chain of thought information that instructs the evaluation execution unit to perform the process of producing an evaluation result for the response text step by step.
In Article 6, The above user prompt is A prompt chain-based response text evaluation device comprising pre-specified output form information for the format in which the above evaluation result is output.
In Article 1, The above evaluation performing department A prompt chain-based response text evaluation device that calculates the degree of agreement between vocabulary included in the response text using a multi-head attention operation and produces the evaluation result.
In Article 1, The above evaluation performing department A prompt chain-based response text evaluation device that produces the above evaluation result by further considering the query and reference document corresponding to the above response text.
In Article 1, The above evaluation performing department A prompt chain-based response text evaluation device that calculates a score for each evaluation item, a scoring basis for each evaluation item, and a final score based on the above evaluation results.
In a prompt chain-based response text evaluation method, A step of receiving user requests to evaluate response text previously output by a large-scale language model; A step of generating an evaluation criteria prompt including task-specific evaluation items and evaluation criteria based on the user request using the above-mentioned large-scale language model; A step of performing prompt chaining to link the evaluation criteria prompt to an evaluation instruction prompt containing the above response text; and A step of inputting the evaluation instruction prompt into the above-mentioned large-scale language model to produce an evaluation result for the above-mentioned response text; A prompt chain-based response text evaluation method including
In Article 13, The step of receiving the above user request A prompt chain-based response text evaluation method that receives input from a user as a user request for a function of the above-mentioned large-scale language model to be evaluated.
In Article 13, The step of producing the above evaluation results A prompt chain-based response text evaluation method that assigns a native speaker role corresponding to the target language of the response text to the evaluation criteria prompt, thereby instructing the evaluation criteria to infer the consideration factors of the native speaker.
In Article 15, The above evaluation instruction prompt is A system prompt including instructions defining the above-mentioned native speaker role and instructions describing the assessment task; and A prompt chain-based response text evaluation method comprising: a user prompt including a directive defining the native speaker role and a directive describing the evaluation task, wherein the user prompt includes vocabulary related to the target language of the evaluation.
In Article 16, The step of performing the above prompt chaining is A prompt chain-based response text evaluation method that performs prompt chaining by connecting the evaluation criteria prompt to the user prompt.
In Article 13, The step of producing the above evaluation results A prompt chain-based response text evaluation method that calculates the degree of agreement between vocabulary included in the response text using a multi-head attention operation to produce the evaluation result.
In Article 13, The step of producing the above evaluation results A prompt chain-based response text evaluation method that calculates the evaluation result by further considering the query and reference document corresponding to the above response text.
In a computer program stored on a computer-readable recording medium comprising instructions that provide a prompt chain-based response text evaluation method, A large-scale language model receives user requests to evaluate response text output in advance, and Using the above-mentioned large-scale language model, generate an evaluation criteria prompt including task-specific evaluation items and evaluation criteria based on the above-mentioned user requests, and Prompt chaining is performed to link the evaluation criteria prompt to an evaluation instruction prompt containing the above response text, and A computer program stored on a computer-readable recording medium comprising a sequence of instructions that input the evaluation instruction prompt into the large-scale language model to produce an evaluation result for the response text.

Description

Apparatus, Method and Program for Evaluating Response Text Based on Prompt-Chain The present invention relates to a prompt chain-based response text evaluation device, method, and program for a large-scale language model. Automated evaluation technology, which allows AI language models to assess the quality of AI-generated text, is becoming increasingly important in proportion to the utilization of Large Language Models (LLMs). In particular, technology that utilizes LLMs themselves as evaluation tools for generated text is in high demand because it allows for the free scoring of various aspects of the text under evaluation, such as creativity, without the need for model answers written by humans. These automated evaluation technologies are broadly categorized into prompt-based evaluation, which operates existing LLMs as evaluation models, and methods that build dedicated evaluation models by constructing large-scale instruction datasets and fine-tuning open-source LLMs. Among these, prompt-based evaluation refers to a method that relies on high natural language prompt comprehension, such as GPT-4, for evaluation performance. It has the advantage of being usable even in low-resource situations without the need to construct large-scale instruction datasets or fine-tune models. This technology is broadly divided into two branches: the LLM-as-a-Judge method, which directly evaluates individual responses of various forms depending on the prompt configuration method, and the LLM-Evaluation-Harness method, which evaluates models using multiple-choice benchmarks for use on LLM leaderboards. A major limitation of existing prompt-based automatic evaluation technologies is the degradation of evaluation performance for multilingual texts, such as Korean. Although most established LLMs primarily used as evaluation models were developed as multilingual models, performance degradation in instruction-following for languages other than English is observed during various tasks. Transformer decoder language models, such as GPT-4, used as prompt-based automatic evaluation models, predict the next vocabulary word with conditional probability given an input context string. This is because the model learns and infers by predicting, which lowers the probability of generating an accurate response when instructions are input in a language other than English. In fact, most conventional automatic evaluation technologies have been researched for the purpose of automatically evaluating 'English' text generated by language models. Although various Korean generation models have emerged in Korea, there has not been much interest in prompt engineering to enhance the performance of Korean text evaluation. Consequently, it is common practice to use Korean prompts such as "Evaluate the next sentence according to the evaluation criteria" simply for the convenience of Korean users. Furthermore, due to the characteristics of decoder-structured language models, conventional automatic evaluation technologies are not free from the limitation of relying on evaluation items and criteria descriptions artificially defined by humans. For the task of automatic text quality evaluation, if there is a discrepancy between the internal knowledge pre-learned by the evaluation model and the description of evaluation criteria (Rubric) defined by humans, instruction comprehension performance may deteriorate. FIG. 1 is a configuration diagram illustrating a prompt chain-based response text evaluation device according to one embodiment of the present invention. FIG. 2 is a flowchart illustrating a prompt chain-based response text evaluation method performed in a prompt chain-based response text evaluation device according to an embodiment of the present invention. FIG. 3 is a flowchart illustrating a prompt chain-based response text evaluation method for evaluating Korean response text according to an embodiment of the present invention. FIG. 4 is a diagram illustrating an example of a configuration in which a Korean language expert role is defined in a system prompt and a user prompt for evaluating Korean response text according to an embodiment of the present invention. FIG. 5 is a diagram illustrating an example of an evaluation criterion prompt connected to a user prompt for evaluating Korean response text according to an embodiment of the present invention. FIG. 6 is a diagram illustrating instructions for an evaluation procedure, an evaluation target, and a scoring result output format defined in a user prompt for evaluating Korean response text according to an embodiment of the present invention. Embodiments of the present invention are described below with reference to the attached drawings so that those skilled in the art can easily implement the invention. However, the present invention may be embodied in various different forms and is not limited to the embodiments described herein. Furthermore, in order to clearly explain the present invention in the dr