CN-121997112-A - Large language model illusion detection method based on graph neural network

CN121997112ACN 121997112 ACN121997112 ACN 121997112ACN-121997112-A

Abstract

The invention discloses a large language model illusion detection method based on a graph neural network, and belongs to the technical field of classification detection. The method comprises the steps of constructing a labeling sample set containing question-answer pairs, and coupling hidden layer states of a specific layer of a large language model with an attention matrix to establish a weighted directed graph, wherein answers in the question-answer pairs are labeled as facts or illusions, inputting the weighted directed graph to an illusion detector based on a graph neural network to conduct supervised learning and training, deploying the illusion detector to complete supervised learning and training, conducting real-time illusion detection on the question-answer pairs, and synchronously providing token-level interpretability analysis. The invention detects whether the content generated by the large language model contains illusion in real time, and can improve the reliability of the large language model.

Inventors

WANG YONGJIE
KONG LINGGANG
ZHONG XIAOFENG
LIU JINGJU
ZHANG YUNLONG
FU HAORAN

Assignees

中国人民解放军国防科技大学

Dates

Publication Date: 20260508
Application Date: 20251206

Claims (10)

1. A method for detecting a phantom of a large language model based on a graph neural network, the method comprising: Step S1, constructing a labeling sample set containing question-answer pairs, and coupling hidden layer states of a specific layer of a large language model with an attention matrix so as to establish a weighted directed graph; Wherein the answers in the question-answer pair are labeled as facts or illusions; S2, inputting the weighted directed graph to a phantom detector based on a graph neural network to perform supervised learning and training; and step S3, deploying a hallucination detector for completing supervised learning and training, carrying out real-time hallucination detection on the question-answer pair, and synchronously providing token-level interpretability analysis.
2. The method for detecting the illusion of the large language model based on the graphic neural network according to claim 1, wherein in the step S1, a weighted directed graph is established, wherein the weighted directed graph is used for representing hidden layer states and attention matrixes of the specific layers of the coupled large language model, nodes in the graph represent the token and hidden layer features thereof in question-answer pair replies, edges among the nodes represent dependency relations among the token, and weight values of the edges are derived from the attention matrixes.
3. The method for detecting the illusion of a large language model based on a graph neural network according to claim 2, wherein in step S1, for a weighted directed graph: For large language models with L-layer converter architecture, the input at layer j comes from hidden layer embedding at layer j-1, denoted as Where t i represents the hidden state of token i , , After passing through the multi-head self-attention mechanism of the j-th layer, the linear projection map is expressed as: Wherein, the 、、 As a matrix of parameters, The query vector is represented as a result of which, And Representing key vectors and value vectors, respectively, the attention matrix being measured by And (3) with The similarity between the two is calculated to capture the relation between the token, and the formalized expression is as follows: wherein d k represents Is used in the manufacture of a printed circuit board, Is a scaling factor, the attention matrix is a lower triangular matrix, denoted a, , For quantifying the directed dependency from token i to token j , Representing the attention value.
4. A method for detecting the illusion of a large language model based on a graphic neural network according to claim 3, wherein in step S2: classifying weighted directed graphs by adopting a graph convolution neural network, extracting global graph characterization by adopting GraphConv architecture and iteratively aggregating node characteristics and neighbor information thereof to realize graph-level classification; Weighted directed graph g= (V, E, X, a), V represents a node set, E represents an edge set, The characteristic matrix of the node is represented, Representing the adjacency matrix if And if the nodes i to j do not have edges, when the graph G is processed by GNN, the feature matrix of the node i at the first layer is expressed as: Wherein D is a diagonal matrix, D ii and D jj represent the ingress of node i and node j, respectively, W ji represents a weight matrix from node j to node i, W and b represent trainable weights and bias parameters, respectively, reLu is a nonlinear activation function; extracting graph characterization by stacking multi-layer graph convolution, and aggregating token-level features into graph-level representation by global averaging: wherein N is the number of nodes, Representing global features of a weighted directed graph Inputting a fully-connected neural network to obtain probabilities of a real label and an illusion label, wherein the expression is as follows: Wherein, the And Representing the weights and offsets of the fully connected layers, Is a two-dimensional vector; the data set is divided into a training set, a verification set and a test set, the training process on the training set accords with the standard supervised training process, and an Adam optimizer and two categories of cross-entcopy are used as objective functions.
5. The method for detecting the illusion of a large language model based on a graphic neural network according to claim 4, wherein in step S3: given weighted directed graph G and its detection prediction Y to identify an interpretation sub-graph The subgraph is used for maximizing Mutual information with Y: Wherein, the Representation of Is used as a reference to the entropy of (a), Is based on subgraph Due to conditional entropy of (2) Maximizing mutual information is equivalent to minimizing conditional entropy: Wherein, the Expressed in subgraph (The parameters are ) Under the condition, the variables Converting the discrete graph optimization problem into a continuous optimization scheme, introducing a learnable mask parameter to nodes and weighted edges, and for each node Definition of learnable scalar parameters Normalized to interval [0, 1] by a sigmoid function to obtain a node importance score: Wherein, the Representing that node v is included in the interpretation sub-graph Probability of (a); Scaling the original node features using node masks to obtain a mask node feature matrix , Representing the original characteristics of the node v, Representing element-level multiplication; The importance of the edges is scored as Constructing a mask adjacency matrix Interpretation of subgraph From a mask feature matrix Sum mask adjacency matrix The definition of the term "a" or "an" is, New mask pattern Comprising consecutive values and edge weights being limited to the interval 0,1 by introducing node mask parameters And edge mask parameters Redefining the optimization objective as a continuous optimization problem with respect to the parameters, the expression of the objective function L is: Wherein, C represents the number of categories, Represented in the original graph The predicted probability for category c under the condition, Expressed in mask subgraph The probability of prediction under the condition that the prediction is not successful, And Regularization coefficients controlling node mask and edge mask sparsity respectively, Representing the L1 norm regularization.
6. A large language model illusion detection system based on a graph neural network, the system comprising: the first processing unit is configured to construct a labeling sample set containing question-answer pairs, and couple hidden layer states of a specific layer of the large language model with an attention matrix so as to build a weighted directed graph; Wherein the answers in the question-answer pair are labeled as facts or illusions; A second processing unit configured to input the weighted directed graph to a graph neural network-based phantom detector for supervised learning and training; And a third processing unit configured to deploy a hallucination detector for performing supervised learning and training, perform real-time hallucination detection on the question-answer pair, and synchronously provide token-level interpretability analysis.
7. The system of claim 6, wherein for weighted directed graphs, weighted directed graphs are used to characterize hidden layer states and attention matrices of particular layers of the coupled large language model, nodes in the graph represent token and hidden layer features thereof in question-answer pair replies, edges between the nodes represent dependency relationships between the token, and weight values of the edges are derived from the attention matrices.
8. The large language model illusion detection system based on graph neural network of claim 7, wherein for weighted directed graphs: For large language models with L-layer converter architecture, the input at layer j comes from hidden layer embedding at layer j-1, denoted as Where t i represents the hidden state of token i , , After passing through the multi-head self-attention mechanism of the j-th layer, the linear projection map is expressed as: Wherein, the 、、 As a matrix of parameters, The query vector is represented as a result of which, And Representing key vectors and value vectors, respectively, the attention matrix being measured by And (3) with The similarity between the two is calculated to capture the relation between the token, and the formalized expression is as follows: wherein d k represents Is used in the manufacture of a printed circuit board, Is a scaling factor, the attention matrix is a lower triangular matrix, denoted a, , For quantifying the directed dependency from token i to token j , Representing the attention value.
9. The large language model illusion detection system based on the graph neural network of claim 8, wherein the second processing unit is specifically configured to: classifying weighted directed graphs by adopting a graph convolution neural network, extracting global graph characterization by adopting GraphConv architecture and iteratively aggregating node characteristics and neighbor information thereof to realize graph-level classification; Weighted directed graph g= (V, E, X, a), V represents a node set, E represents an edge set, The characteristic matrix of the node is represented, Representing the adjacency matrix if And if the nodes i to j do not have edges, when the graph G is processed by GNN, the feature matrix of the node i at the first layer is expressed as: Wherein D is a diagonal matrix, D ii and D jj represent the ingress of node i and node j, respectively, W ji represents a weight matrix from node j to node i, W and b represent trainable weights and bias parameters, respectively, reLu is a nonlinear activation function; extracting graph characterization by stacking multi-layer graph convolution, and aggregating token-level features into graph-level representation by global averaging: wherein N is the number of nodes, Representing global features of a weighted directed graph Inputting a fully-connected neural network to obtain probabilities of a real label and an illusion label, wherein the expression is as follows: Wherein, the And Representing the weights and offsets of the fully connected layers, Is a two-dimensional vector; The data set is divided into a training set, a verification set and a test set, the training process on the training set accords with the standard supervised training process, and an Adam optimizer and two categories of cross-entcopy are used as objective functions.
10. The large language model illusion detection system based on the graph neural network of claim 9, wherein the third processing unit is specifically configured to: given weighted directed graph G and its detection prediction Y to identify an interpretation sub-graph The subgraph is used for maximizing Mutual information with Y: Wherein, the Representation of Is used as a reference to the entropy of (a), Is based on subgraph Due to conditional entropy of (2) Maximizing mutual information is equivalent to minimizing conditional entropy: Wherein, the Expressed in subgraph (The parameters are ) Under the condition, the variables Converting the discrete graph optimization problem into a continuous optimization scheme, introducing a learnable mask parameter to nodes and weighted edges, and for each node Definition of learnable scalar parameters Normalized to interval [0, 1] by a sigmoid function to obtain a node importance score: Wherein, the Representing that node v is included in the interpretation sub-graph Probability of (a); Scaling the original node features using node masks to obtain a mask node feature matrix , Representing the original characteristics of the node v, Representing element-level multiplication; The importance of the edges is scored as Constructing a mask adjacency matrix Interpretation of subgraph From a mask feature matrix Sum mask adjacency matrix The definition of the term "a" or "an" is, New mask pattern Comprising consecutive values and edge weights being limited to the interval 0,1 by introducing node mask parameters And edge mask parameters Redefining the optimization objective as a continuous optimization problem with respect to the parameters, the expression of the objective function L is: Wherein, C represents the number of categories, Represented in the original graph The predicted probability for category c under the condition, Expressed in mask subgraph The probability of prediction under the condition that the prediction is not successful, And Regularization coefficients controlling node mask and edge mask sparsity respectively, Representing the L1 norm regularization.

Description

Large language model illusion detection method based on graph neural network Technical Field The invention belongs to the technical field of classification detection, and particularly relates to a large language model illusion detection method based on a graph neural network. Background The large language model shows remarkable capability in the aspects of intention understanding and natural language generation, can integrate multi-source information to directly generate replies, and omits the complicated information integration process of users. These advantages have driven the wide application of large language models in the fields of information retrieval, network security, etc. However, large language models also generate a smooth, yet contradictory return to reality due to lack of knowledge-a phenomenon known as the illusion of large language models. Notably, this problem is particularly pronounced in the vertical domain, as large language models often require fine-tuning through expertise to improve performance of downstream tasks, while the use of fine-tuning samples containing new knowledge further exacerbates the illusion tendency of large language models. The illusion behavior of large language models not only mislead the user decisions, but also pose a great threat to the application of high risk scenarios. Therefore, to ensure the reliability of the large language model, it is necessary to design a related illusion detection mechanism that actively alerts the user or system when the large language model generated content deviates from reality. For the illusion phenomenon of the large language model, various works have been proposed to provide corresponding detection methods, which can be roughly classified into black box and white box methods depending on whether the internal state of the large language model is accessed. The black box illusion detection method is a post-evaluation, and mainly evaluates the authenticity of the content from the natural language perspective after the reply is generated. For example, the SELFCHECKGPT method uses the same question to interrogate a large language model multiple times and sample multiple replies and then calculate the consistency between them to determine the self-confidence of the large language model in the reply to the question. However, this method cannot avoid the inherent defect of the large language model, namely, excessive confidence, and the large computing resource overhead is caused by multiple sampling. The HaluCheck method realizes automated illusion detection through three subprocesses of fact decomposition, retrieval enhancement and result comparison. Although the authenticity of the detection result can be guaranteed to a great extent through the auxiliary secondary verification of the external resource, the request for retrieving the external resource for each reply inevitably causes unnecessary resource and time loss. In addition, the update management of the external resource library also needs to rely on industry experts for periodic maintenance. The white box illusion detection method is a real-time assessment without any external resources. By analyzing and learning the parameter hidden space of a large language model, an efficient phantom detector is developed and designed. The white box method can strategically decide when to call external resources while realizing real-time detection, thereby optimizing delay and computational overhead in practical application. The Perplexity method is a high-precision illusion detection method using the last layer of output logits of a large language model, which has excellent performance in abstract and translation tasks, but cannot evaluate and quantify the uncertainty of natural language. The Semantic Entropy method fuses the measure of language invariance generated by sharing meaning, and proves that the semantic entropy can improve the accuracy of the prediction model. The SAPLMA method achieves better performance by extracting the internal representation of the question-answer pair and then training a fact classifier to determine if the answer contains an illusion. This indicates that the internal state contains hidden information of whether or not more large language models lie. The white-box approach uses the internal hidden space of a large language model to discriminate the true and illusions of the reply in real time and more efficiently than the black-box approach. However, the existing hidden space-based method uses either hidden layer states alone or attention matrix alone, and has the defect of decoupling the interrelationship between internal states, thereby weakening the overall semantic space characteristics. Considering that the large language model contains semantic features in the pre-training stage, the connection and weight relation between the token are also learned. Thus, the decoupling approach of the existing work cannot refer to the correct pattern learned by the p