JP-2026074704-A - Reliability evaluation device, reliability evaluation method, and reliability evaluation program

JP2026074704AJP 2026074704 AJP2026074704 AJP 2026074704AJP-2026074704-A

Abstract

[Problem] To provide a reliability evaluation device that can accurately evaluate RAG reliability using appropriate indicators and parameters. [Solution] The reliability evaluation device 1 includes a dataset acquisition unit 11 that acquires learning data associating a set of candidate reference indicators for quantitatively indicating information retrieval reliability and generation reliability with the correctness of the generation results; a model learning unit 12 that learns the parameters of a model with the set of reference indicators as explanatory variables and the correctness of the generation results as the target variable, and records the prediction accuracy; a model update unit 13 that causes the model to be trained again using a set of new explanatory variables from which lower-importance reference indicators have been preferentially excluded; an explanatory variable selection unit 14 that selects the set of reference indicators as the optimal explanatory variables to maximize the prediction accuracy and constitutes a trained model; and an evaluation value output unit 15 that outputs a reliability evaluation value obtained by the trained model for a new prompt and generation results. [Selection Diagram] Figure 1

Inventors

長谷川健人
披田野清良
福島和英

Assignees

ＫＤＤＩ株式会社

Dates

Publication Date: 20260507
Application Date: 20241021

Claims (4)

A dataset acquisition unit acquires training data that associates a set of candidate reference metrics for quantitatively demonstrating the accuracy of information retrieval and generation accuracy of the generation results produced by a generative language model using search extension generation in response to prompts with the correctness of the said generation results. A model learning unit learns the parameters of a model that uses the set of reference indicators mentioned above as explanatory variables and the correctness of the generated results as the objective variable, and also records the prediction accuracy. A model update unit calculates the importance of each of the aforementioned reference indicators, and uses the set obtained by preferentially excluding the reference indicators with lower importance as a new explanatory variable, and then runs the processing of the model learning unit again. The process of the model update unit is repeatedly executed, and a set of reference indicators is selected as the optimal explanatory variables to maximize the prediction accuracy, and the explanatory variable selection unit that constitutes the trained model is performed. A certainty evaluation device comprising: an evaluation value output unit that outputs a certainty evaluation value obtained by inputting the values of the optimal explanatory variables for a new prompt and the generation results by a generative language model using the search extension generation for the new prompt into the trained model.
The certainty evaluation device according to claim 1, wherein the model update unit, from among the reference indicators corresponding to the information retrieval certainty and the generation certainty, retains a predetermined number of reference indicators with the highest importance, and excludes the reference indicator with the lowest importance, thereby creating a new set of explanatory variables and causing the model learning unit to execute the processing again.
Computers The dataset acquisition unit acquires training data that associates a set of candidate reference metrics for quantitatively indicating the accuracy of information retrieval and generation accuracy of the generation results produced by a generative language model using search extension generation in response to prompts, with the correctness of the said generation results. The model learning unit learns the parameters of a model in which the set of reference indicators are used as explanatory variables and the correctness of the generated results is used as the objective variable, and also records the prediction accuracy. The model update unit calculates the importance of each of the reference indicators, and the set obtained by preferentially excluding the reference indicators with lower importance is used as a new explanatory variable, and the model learning unit is executed again. The explanatory variable selection unit repeatedly executes the processing of the model update unit to select the optimal set of reference indicators as explanatory variables so as to maximize the prediction accuracy, thereby constructing a trained model. A certainty evaluation method in which an evaluation value output unit outputs an evaluation value of the obtained certainty by inputting the values of the optimal explanatory variables for a new prompt and the generation result by a generative language model using the search extension generation for the new prompt into the trained model.
A reliability evaluation program for causing a computer to function as a reliability evaluation device according to claim 1 or claim 2.

Description

This invention relates to a technique for evaluating the reliability of generated text in a search enhancement generation task using a generative language model. A generative language model (GPT) is a model that outputs a text as a response to a text input called a prompt. An example is the Generative Pre-trained Transformer (GPT) described in Non-Patent Document 1. Internally, a generative language model processes the smallest units of text called tokens, and generates a response by predicting the tokens that immediately follow the prompt text. A generative language model that has learned from a large amount of text can output natural-sounding text as a response in dialogue and question-answering tasks, achieving high performance. However, generative language models are known to exhibit a phenomenon called hallucination, where false information is output in natural-sounding sentences. Therefore, it is difficult to accurately evaluate the truthfulness of the generated text based solely on its output. For this reason, a method for evaluating the validity of the output of generative language models is desired. One method for quantifying the uncertainty of the output of a generative language model is Semantic Uncertainty, as described in Non-Patent Document 2. In this method, the evaluator first generates N responses to a single prompt and obtains the likelihood of the response sentences. The likelihood of a sentence can be calculated, for example, by taking the geometric mean of the entire sentence using the likelihood of the predicted tokens. Next, the evaluator categorizes the response sentences, considering semantic differences. For example, for the prompt "Where is the capital of France?", "Paris" and "It's Paris." are considered the same because they have the same meaning, while "London" is considered a different category because it has a different meaning. Here, the semantic likelihood p(c|x) of a certain category c for a prompt x to a generative language model is expressed by the following equation. In this case, the Semantic Entropy U se (x), which is an indicator of uncertainty, is given by the following formula. However, C is a set of categories. Japanese Patent Application No. 2024-64828 Specification A. Radford et al., "Improving Language Understanding by Generative Pre-Training," 2018, [online], accessed March 29, 2024, Internet <https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>.L. Kuhn et al., "Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation," International Conference on Learning Representations, 2023.Z. Lin et al., "Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models," Transactions on Machine Learning Research, 2024.L. Breiman, "Random Forests," Machine Learning, 45(1), 5-32, 2001. This is a block diagram showing the functional configuration of a reliability evaluation device in an embodiment.This is a flowchart illustrating the procedure for the reliability evaluation method in the embodiment. An example of an embodiment of the present invention will be described below. In RAG, external knowledge is divided into chunks, and each chunk is vectorized. Let R be the document containing the external knowledge, and let r be the divided chunks. The operation of vectorizing r is denoted as f(r). When a system using RAG receives a prompt x, it creates a new prompt x' in the following steps and outputs an answer y based on x'. (1) The vector vx of the prompt x is calculated using vx = f(x). (2) For each i-th chunk r i , calculate the similarity s i = σ(v x , f(r i )) with the prompt x. Here, σ is a function for calculating the similarity, and for example, cosine similarity can be used. (3) For the original prompt x, the chunk ri with the largest s i is referenced and a new prompt x' is created. In this embodiment, the certainty of the output in the generative language model using RAG is called RAG certainty. RAG certainty is a value between 0 and 1, with a larger value indicating a higher probability of the output. Uncertainty, on the other hand, is an index where the magnitude of the value and the evaluation it represents are inversely related to certainty. Since these are essentially synonymous, both certainty and uncertainty are collectively referred to as certainty. In RAG, the accuracy of the process of searching the database based on prompts affects the quality of the response. Therefore, in order to quantify the reliability of the response by the generative language model, it is necessary to consider the search quality. Therefore, in the certainty evaluation method of this embodiment, an index representing the search quality in RAG is introduced in the evaluation of the certainty or uncertainty of the output by the generated language model. The indicators that can be considered as explanatory variables for calculating RAG certainty can be broadly divided int