CN-122020621-A - Large language model watermark embedding and detecting method based on attention head perception

CN122020621ACN 122020621 ACN122020621 ACN 122020621ACN-122020621-A

Abstract

The invention discloses a large language model watermark embedding and detecting method based on attention head perception, and relates to the technical fields of natural language processing and digital watermarking. The method comprises four core steps of firstly screening attention header sets playing a key role in coding of factual information through a 'pure-pollution-recovery' three-stage causal intervention experiment, secondly training a binary linear factual detector based on the sets to accurately distinguish facts from facts of the token, then generating a dynamic green list for the non-factual token in a model reasoning stage, completing watermark hidden embedding through logits bias intervention and adaptively adjusting bias strength along with text fact density, and finally screening the non-factual token in a repeated embedding flow in a detection stage, and realizing accurate detection of the watermark by counting matching proportion and z fraction quantification abnormality degree. The invention avoids destroying the accuracy of text facts through the attention sensing mechanism, the robustness and concealment of the watermark are improved by the dynamic green list design, the invention is suitable for scenes such as copyright tracing and authenticity verification of the text generated by a large language model, and the engineering feasibility is strong.

Inventors

LIU JIANYI
Shen Wensong
ZHANG RU

Assignees

北京邮电大学

Dates

Publication Date: 20260512
Application Date: 20260123

Claims (5)

1. The large language model watermark embedding and detecting method based on attention head perception is characterized by comprising the following steps: A, screening The method comprises the steps of constructing a factual prompt data set, guiding a large language model to output verifiable factual answers, collecting activation values and prediction probabilities of all attention heads of the model by adopting a 'purity-pollution-recovery' three-stage causal intervention method, calculating an indirect effect of each attention head, and screening factual key attention heads according to the indirect effect; B. Training a factual detector, namely extracting question-answer pairs from a question-answer data set, splicing questions and answers, extracting an activation value of a last token in a factual key attention head, constructing a detection data set, and training the factual detector; C. watermark embedding by means of model reasoning Mechanism captures each in real time Inputting the actual key attention head activation value before generation into an actual detector to obtain A facts score, which determines facts and non-facts based on the score Based on non-facts Generating a dynamic green list by the preamble contextual features of the watermark, and applying bias intervention to the token in the dynamic green list to generate a target text embedded with the watermark; D. watermark detection, namely word segmentation processing is carried out on a text to be detected, an original large language model is input, a random module is closed, an activation value of a factual key attention head is calculated, and non-factual is screened And generating a dynamic green list consistent with the embedding stage, counting the matching proportion of the watermark heads, and detecting whether the text has the watermark or not through the abnormal degree of fractional quantization.
2. The method for embedding and detecting the large language model watermark based on attention head perception as set forth in claim 1, wherein the step A specifically includes: A1. constructing a factual prompt, and outputting an explicit verifiable factual answer in the guidance model; A2. Pure running, namely inputting a fact prompt into a large language model which is not interfered, and acquiring each layer of the model through a hook mechanism in the forward propagation process of the model The activation values of all attention heads in the attention module will be input into the first of the sequences Personal (S) In the first place Layer number The activation values at the individual attention heads are noted as Simultaneously recording reference prediction probability of outputting correct fact answer under model dry pre-state ; A3. pollution operation, namely carrying core fact information in fact prompt The embedded vector is scrambled and output to the subject after the embedded layer Is superimposed with random gaussian noise The noise obeys a mean value of 0 and a variance of Continuing to perform model forward propagation under the condition that the fact information is disturbed to obtain the activation value of each attention head under the disturbed state And record the prediction probability of the correct answer output by the model ; A4. Restoring operation, namely, keeping the disturbance condition of the fact information unchanged, and respectively executing independent restoring operation for each attention head carrying the core fact information token in the model, namely, selecting the first attention head Layer number When the attention head is used as the target head, the activation value of the attention head in the disturbing state is set Replacement to an activation value at clean run time The rest attention heads still keep a disturbing state, the model is enabled to continue to forward propagate after the replacement of the activation value is completed, and the prediction probability of outputting correct fact answers by the model is recorded ; A5. Calculating indirect effect IE based on the prediction probabilities of the three phases A2, A3 and A4, calculating the first Layer number Indirect effect of the individual attention heads: Wherein the method comprises the steps of For the reference prediction probability in the dry pre-state, For the prediction probability in the state that the fact information is disturbed, To recover only the predictive probability behind the target attention header; A6. Critical attention head screening-repeating steps A1-A4 for a plurality of factual cues in a factual cue dataset, indirect effects on each attention head Statistical averaging is performed, and a certain attention head is determined as a de facto critical attention head when the average indirect effect of the attention head is not less than a preset threshold.
3. The method for embedding and detecting the large language model watermark based on attention head perception as set forth in claim 1, wherein the step B specifically includes: B1. Screening high quality question-answer pairs from public data sets Labeling the level factual label, wherein the labeling result is a binary factual label Wherein Is that An index in the sequence of text that is to be displayed, Represent the first Personal (S) In the case of a realistic token, Represent the first Personal (S) Is not in fact And after the question-answer pairs are spliced, extracting each question-answer pair Activation values in a de facto key attention head candidate set , Represent the first Personal (S) In the first place Layer number The activation values at the candidate attention header, For the index of the layer where the attention header is located, Index for candidate attention header; B2. Activating the value As input, a binary linear factual detector is trained, and the output expression of the detector is: Wherein, the Is the first Layer number The detector parameter vectors corresponding to the candidate attention heads, Output by detector Personal (S) Training the goal to minimize the predictive and real fact labels Is a cross entropy loss of (a), the loss function expression is: Wherein, the In order to cross-entropy loss values, For detecting data sets The total number of samples is calculated and, To detector pair number Personal (S) And (3) the actual predictive probability of The subscript t of (2) corresponds one by one; B3. after training is completed, the detector parameters are set Performing L2 normalization to obtain the actual projection vector of the key head Projection vector The dot product result of the activation value of the token is the fact projection score of the token And constructing a 'facts key attention head-facts projection vector' corresponding table as the basis of the facts judgment in the model reasoning stage.
4. The method for embedding and detecting the large language model watermark based on attention head perception as set forth in claim 1, wherein the step C specifically includes: C1. Defining an active projection state representation Wherein To a de facto key attention header set of generated text, Representing history At the attention head A set of activation vectors on the set of activation vectors, Is the first Layer number The actual projection vector corresponding to the actual key attention head Is a transposed vector of (2); Is a preset hash function according to From model vocabulary In generating dynamic candidates Aggregation And meet the following Wherein For presetting candidate proportion parameters for controlling dynamic candidates The scale ratio of the collection; C2. for being judged as non-realistic Intervention in the generation of large language model output without going through Converted original The vector is offset-adjusted to the dynamic green list A kind of electronic device Corresponding to it The components perform bias interventions as follows: Wherein, the Is that Corresponding original The component(s) of the composition, For after the bias is adjusted The component(s) of the composition, Is that Bias strength parameters; the bias strength parameter Density of facts from currently generated text Self-adaptive adjustment is carried out to realize density Defined as facts in text Quantity and text Total The ratio of the numbers.
5. The method for embedding and detecting the large language model watermark based on attention head perception as set forth in claim 1, wherein the step D specifically includes: D1. Collecting at least M pieces of non-watermark text, generating dynamic green list according to the same flow as the text to be detected based on the non-watermark text, calculating matching proportion R of non-actual token and corresponding dynamic green list, counting the matching proportion R of all non-watermark text, and pre-calculating statistical average value of the matching proportion set And standard deviation of 。 D2. Non-facts The set screening is to input the original large language model after word segmentation processing is carried out on the text to be detected, replay the generation process under the condition of closing the random sampling module, and obtain each An activation feature on the factual critical attention head, based on the factual detector of step B, for each Performing linear projection operation on the corresponding activation characteristics to obtain a facts score Will be A kind of electronic device Is judged as non-actual Constructing index sets Wherein Is a factual decision threshold; D3. For the purpose of Each of (a) Based on the result of the activation projection of the preamble token on the actual key attention, generating a corresponding dynamic green list according to the rule of the step C And according to Calculating a matching ratio in which For indicating functions, for characterising the generation of Whether the corresponding dynamic green list is fallen into; D4. Calculating text to be detected Score of Quantifying the degree of abnormality, setting a preset detection threshold Level of significance Corresponding to Score critical value When (when) And is also provided with Score greater than And judging the text to be detected as the text containing the watermark.

Description

Large language model watermark embedding and detecting method based on attention head perception Technical Field The invention relates to the technical field of natural language processing and digital watermarking, in particular to a large language model watermark embedding and detecting method based on attention head perception. Background With the rapid development of large language model technology, the generated text is widely applied in the fields of news creation, academic research, intelligent customer service, creative design and the like, and the content production efficiency is greatly improved. However, the easy replicability, anonymity and quick propagation of large language model generated text also bring about a series of serious problems, such as that pirates unauthorized pirates use the generated text to gain benefits, false information misguide the public by means of generated text diffusion, the source of the generated text is difficult to trace back, responsibility is difficult to define, and the like. In order to solve the problems, the large language model watermarking technology has the core aim of embedding a hidden and verifiable mark into a text on the premise of not influencing the quality of the generated text, thereby realizing the source authentication and tracking of the generated text. The traditional text watermarking method is mostly dependent on the mode of adjusting the occurrence frequency, the word sequence or adding redundant characters and the like of specific words, has poor robustness, is easily affected by attacks such as text editing, restation, translation and the like, has insufficient concealment and is easy to identify manually. In recent years, a large language model watermarking method based on deep learning gradually becomes a research hot spot, and most of the method generates a token sequence in a specific mode through logits vector guidance of intervention model output so as to realize watermark embedding. However, the existing deep learning-based method has the defects that firstly, accurate perception of text semantics is lacking, a fact token and a non-fact token cannot be distinguished, interference is possibly carried out on the token carrying core facts in the watermark embedding process, the fact accuracy of the text is destroyed, secondly, the watermark embedding mode is fixed, the watermark embedding mode depends on a preset static candidate set, the watermark embedding mode is easy to be broken by targeted attack, the robustness is insufficient, thirdly, core logic in an embedding stage is not fully reproduced in the detection process, the capability of resisting semantic attacks such as text editing, restatement and the like is weak, and the detection accuracy is to be improved. The attention mechanism of the large language model is the core of understanding text semantics, different attention heads bear different semantic coding tasks, and part of attention heads have key roles in coding text factual information. However, the existing watermarking technology does not fully mine the semantic perception capability of an attention mechanism and cannot accurately locate an intervening non-factual token region, so that watermark embedding and text quality assurance are difficult to be compatible. Therefore, how to accurately distinguish the facts from the facts by using the attention mechanism to realize deep, hidden and robust watermark embedding and detection is still a technical problem to be solved in the field of large language model watermarks. Disclosure of Invention The invention aims at realizing the technical scheme that the large language model watermark embedding and detecting method based on attention head perception comprises the following steps: A. Screening The method comprises the steps of constructing a factual prompt data set, guiding a large language model to output verifiable factual answers, collecting activation values and prediction probabilities of all attention heads of the model by adopting a 'purity-pollution-recovery' three-stage causal intervention method, calculating an indirect effect of each attention head, and screening factual key attention heads according to the indirect effect; B. Training a factual detector, namely extracting question-answer pairs from a question-answer data set, splicing questions and answers, extracting an activation value of a last token in a factual key attention head, constructing a detection data set, and training the factual detector; C. watermark embedding by means of model reasoning Mechanism captures each in real timeInputting the actual key attention head activation value before generation into an actual detector to obtainA facts score, which determines facts and non-facts based on the scoreBased on non-factsGenerating a dynamic green list by the preamble contextual features of the watermark, and applying bias intervention to the token in the dynamic green list to generate a target t