CN-121997374-A - Semantic embedding space-oriented differential privacy defense method, device, equipment and storage medium

CN121997374ACN 121997374 ACN121997374 ACN 121997374ACN-121997374-A

Abstract

The application provides a differential privacy defense method, device and equipment for semantic embedding space and a storage medium, and relates to the technical field of artificial intelligence safety and privacy protection. The method comprises the steps of obtaining an original text, cleaning the original text, encoding the cleaned text into semantic vectors, then executing L2 normalization processing, calculating cosine similarity of each harmful semantic embedded vector in a normalized semantic vector and a predefined harmful seed library, determining a risk tag, dynamically adjusting disturbance intensity according to the risk tag, carrying out L2 norm clipping on the normalized semantic vectors, superposing isotropic Gaussian noise meeting differential privacy requirements, and carrying out probability sampling on candidate texts through an exponential mechanism based on a preset compliance candidate text pool, wherein the cosine similarity of semantic vectors and candidate text semantic embedments after disturbance is calculated to serve as a utility value, so that a final safety response text is obtained. The application can effectively resist jail-break attacks, prompt word injection, member inference and other parameter independent attacks.

Inventors

LI JUN
WANG TONG
Ma guangchi
CHEN FAQUAN
WU ZENAN
LI MINGJIE
LIU JIZHAO
ZHANG SHUQIN
SHAN FANGFANG

Assignees

中原工学院

Dates

Publication Date: 20260508
Application Date: 20260128

Claims (10)

1. A semantic embedding space-oriented differential privacy defense method, the method comprising: acquiring an original text from a large language model or a user input interface, and sequentially performing system mark removal, blank content filtering, text normalization and special character filtering operations to obtain a cleaned text; Adopting a pre-training sentence coding model to code the cleaned text into a semantic vector, and executing L2 normalization processing on the semantic vector to obtain a normalized semantic vector; Calculating cosine similarity of the normalized semantic vectors and harmful semantic embedded vectors in a predefined harmful seed library, and determining a risk tag according to a comparison result of the cosine similarity and a preset threshold; dynamically adjusting disturbance intensity according to the risk tag, performing L2 norm clipping on the normalized semantic vector, and superposing isotropic Gaussian noise meeting the differential privacy requirement to obtain a post-disturbance semantic vector; And calculating cosine similarity of the post-disturbance semantic vector and the semantic embedding of the candidate text as a utility value based on a preset compliance candidate text pool, and carrying out probability sampling on the candidate text through an exponential mechanism to obtain a final safety response text.
2. The semantic embedded space-oriented differential privacy defense method according to claim 1 is characterized in that the text normalization operation comprises the steps of uniformly converting text into a lower case format and combining a plurality of continuous spaces into a single space, and the special character filtering operation adopts a regular expression to remove HTML labels, coding entities and non-natural language symbols.
3. The method for defending against semantic embedded space differential privacy according to claim 1, wherein the pre-training sentence coding model is Sentence-BERT model or SimCSE model, and if the pre-training sentence coding model does not provide an output pool, the hidden states of all token in the last hidden state of the model are subjected to average pooling to obtain semantic vectors, and an average pooling formula is: in the formula, As a semantic vector, the vector is a vector, For the number of token(s), Is the first Hidden state of the token; and L2 normalizing the semantic vector by the following formula: in the formula, To normalize the semantic vector.
4. The semantic embedding space-oriented differential privacy defense method of claim 1 wherein the cosine similarity of the normalized semantic vector to each harmful semantic embedding vector in a predefined harmful seed library is calculated by the following formula: in the formula, In order to normalize the semantic vector, For the matrix transpose operation, Embedding vectors for normalizing semantic vectors with ith deleterious semantics in a predefined deleterious seed library Cosine similarity of (c); if the maximum similarity is the same as the first threshold value Greater than a preset threshold And judging that the text is at high risk, and otherwise, judging that the text is harmless.
5. The semantic embedding space-oriented differential privacy defense method according to claim 1, wherein the disturbance intensity is dynamically adjusted according to the risk tag, the normalized semantic vector is subjected to L2 norm clipping, isotropic Gaussian noise conforming to differential privacy requirements is superimposed, and a post-disturbance semantic vector is obtained: Normalized semantic vector by the following formula L2 norm clipping is carried out: in the formula, In order to cut out the post-vector, In order to set the threshold value, Is a maximum function; vector after clipping Adding isotropic Gaussian noise: in the formula, In order to post-perturbation semantic vectors, As a gaussian noise vector of the noise, Represented as Obeying the mean value to be 0 and the covariance matrix to be 0 Is a multi-element gaussian distribution of (c), Is that The dimensional identity matrix is used to determine the identity of the object, Is the standard deviation of noise; The calculation formula of the noise standard deviation is as follows: in the formula, For the global sensitivity to be a function of the global sensitivity, In order to relax the term, For the purpose of a privacy budget, Is a noise amplification factor.
6. The semantic embedding space-oriented differential privacy defense method according to claim 1, wherein cosine similarity between the post-disturbance semantic vector and the candidate text semantic embedding is calculated as a utility value based on a preset compliant candidate text pool by the following formula : In the formula, In order to post-perturbation semantic vectors, Embedding each candidate sentence; probability sampling is carried out on the candidate text through an exponential mechanism, and the mode of obtaining the final safety response text comprises the following steps: according to the exponential mechanism, candidate sentences are sampled by the following probability distribution: in the formula, To select candidate sentences Is a function of the probability of (1), As an exponential function with a base of natural constant, For the purpose of a privacy budget, For utility function sensitivity, i is the index of candidate sentences, and m is the total number of candidate sentences; The sampling result is taken as a final safety response text.
7. The semantic embedding space-oriented differential privacy defense method of claim 6, wherein a context prior is introduced to construct an extended utility function, an extended utility value is calculated based on the extended utility function to replace the utility value, and the extended utility function formula is: in the formula, In order to extend the utility value, And As a parameter of the weight-bearing element, The context conditional probabilities are output for the language model.
8. A semantic embedded space-oriented differential privacy defense apparatus, the apparatus comprising: The text preprocessing module is configured to acquire an original text from a large language model or a user input interface, and sequentially execute system mark removal, blank content filtering, text normalization and special character filtering operations to acquire a cleaned text; The semantic vector coding module is configured to code the cleaned text into a semantic vector by adopting a pre-training sentence coding model, and perform L2 normalization processing on the semantic vector to obtain a normalized semantic vector; The risk judging and disturbance deciding module is configured to calculate cosine similarity of the normalized semantic vector and each harmful semantic embedded vector in the predefined harmful seed library, and determine a risk label according to a comparison result of the cosine similarity and a preset threshold; The differential privacy disturbance module is configured to dynamically adjust the disturbance intensity according to the risk tag, perform L2 norm clipping on the normalized semantic vector, and superimpose isotropic Gaussian noise meeting the differential privacy requirement to obtain a post-disturbance semantic vector; the privacy text generation module is configured to calculate cosine similarity of the post-disturbance semantic vector and the candidate text semantic embedding as a utility value based on a preset compliance candidate text pool, and probability sampling is carried out on the candidate text through an exponential mechanism to obtain a final safety response text.
9. An electronic device comprising a processor and a memory communicatively coupled to the processor; The memory stores computer-executable instructions; the processor executes the computer-executable instructions stored by the memory to implement the semantic embedded space oriented differential privacy defense method of any one of claims 1-7.
10. A computer-readable storage medium, having stored therein computer-executable instructions that, when executed by a processor, are configured to implement the semantic embedded space oriented differential privacy defense method of any one of claims 1-7.

Description

Semantic embedding space-oriented differential privacy defense method, device, equipment and storage medium Technical Field The application relates to the technical field of artificial intelligence security and privacy protection, in particular to a differential privacy defense method, device, equipment and storage medium for semantic embedded space. Background With the wide application of large language models (Large Language Models, LLMs) in the fields of intelligent conversations, content generation, etc., the security and privacy risks faced by them are increasingly prominent. An attacker can induce the model to generate illegal, harmful or privacy-revealing information by designing carefully constructed prompt words (prompt), and typical attacks include jail-break attacks (jailbreaking), prompt word injection (prompt injection), member inference attacks (membership inference), attribute inference attacks (attribute inference) and the like. To address the above mentioned threats, the existing defense techniques mainly focus on rule filtering, output rewriting, countermeasure training, and differential privacy mechanisms. The technical scheme which is currently existing in society and is closest to the application mainly comprises the following steps: 1. Output filtering method based on keyword matching and rule system The method sets a rule engine after model output, scans generated texts through a predefined sensitive word stock, a regular expression or a grammar pattern, and intercepts or replaces response once a matching item is detected. For example, google's PERSPECTIVE API, openAI content auditing APIs all employ such mechanisms to jointly determine output security through blacklist keywords (e.g., violence, hate speakers) and semantic classification models. The structural characteristics are as follows: The scheme comprises three layers of structures, namely (1) an input text receiving module, (2) a rule matching engine (comprising a keyword library, a regular expression set and a classification model), and (3) a response intercepting or replacing module. And the system immediately executes matching operation after generating the text, and returns a preset safety response if the rule is triggered. Defect and cause analysis: The core drawback of this approach is that it is based on a surface-sign matching mechanism, which cannot cope with semantically equivalent but lexically different antagonistic inputs. For example, an attacker may bypass keyword detection by way of synonym substitution ("explosive" → "C4"), spelling variation ("b 0 mb"), language confusion ("how to make something go boom"), and so forth. The structural defects are derived from: (1) The rule space is discrete, namely a keyword library cannot cover all semantic equivalent expressions and a large number of semantic dead zones exist; (2) Lack of context awareness-rule systems have difficulty distinguishing between the semantic differences of "discussion violence" and "description violence scenarios"; (3) Passive response mechanisms, filtering only after generation, cannot prevent harmful semantic paths from being generated inside the model. 2. Differential privacy training method based on gradient space or embedded layer The method introduces differential privacy into a model training process, and prevents the model from memorizing individual information in training data by adding Gaussian noise (DIFFERENTIALLY PRIVATE Stochastic GRADIENT DESCENT, DP-SGD) during gradient updating, thereby resisting member inference attack. Representative systems include Google's DP-Framework, meta's Opacus tool library, and the like. The method comprises the steps of (1) a gradient clipping module for clipping the L2 norm of the gradient of each batch of samples, and (2) a noise injection module for adding Gaussian noise to the clipped gradient. The overall flow relies on access to model parameters and back propagation mechanisms. The fundamental limitation of this approach is that its scope is limited to the training phase and parameter space and cannot be used directly for the output defense of the reasoning phase. Its structural defects appear as: (1) The deployment dependence is strong, the deployment dependence is integrated during training and cannot be independently deployed as a post-processing module; (2) The action level is misplaced, wherein disturbance is applied to a gradient or word embedding layer, but not the semantic representation of final output, so that the semantic direction of generated content is difficult to control; (3) The DP-SGD is mainly used for preventing data memory leakage and cannot prevent an attacker from controlling and outputting semantics through prompt words; (4) The performance cost is high, the utility of the model is obviously reduced, retraining is needed, and the deployed model is difficult to adapt. 3. Response replacement system based on semantic similarity Some systems attempt to implement secure resp