CN-121998026-A - Large language model black box countermeasure sample generation optimization method based on expression symbol disturbance

CN121998026ACN 121998026 ACN121998026 ACN 121998026ACN-121998026-A

Abstract

The invention belongs to a challenge-resisting technology in the field of natural language processing, and relates to a large language model black box challenge sample generation optimization method based on expression symbol disturbance. According to the method, a disturbance representation mode of the expression symbol insertion is constructed, a mechanism combining continuous coding and discrete insertion is adopted, and the insertion position and the insertion type of the expression symbol are modeled as a combined black box optimization problem which can be solved in a continuous space. The decoded insertion instruction is acted on the original text, and model input is built by combining with a pre-designed prompt word template, so that the antagonistic text with interference effect on the target large language model is generated. On the basis, a particle swarm optimization algorithm is introduced as a global search strategy, a search process is guided by combining an elite solution retention mechanism, and a resistance sample capable of inducing a model to generate misprediction is obtained through fitness calculation and iterative optimization. The method and the device can be used for evaluating the robustness and the safety of the large language model.

Inventors

WANG YONGKAI
HOU YAQING

Assignees

大连理工大学

Dates

Publication Date: 20260508
Application Date: 20260120

Claims (5)

1. The large language model black box countermeasure sample generation optimization method based on the expression symbol disturbance is characterized by comprising the following steps: the method comprises the steps of (1) utilizing a large language model to be attacked to carry out initial reasoning on an input sentence, obtaining an original prediction type label of the input sentence, and confirming that a prediction result is correct so as to ensure that the sentence can be used as an effective attack object; Constructing continuous coding vectors for describing the insertion positions and the insertion types of the emoticons according to a preset emotion symbol set and a preset text length, designing corresponding decoding rules, and mapping the continuous vectors into executable 'emotion symbol-position' insertion combinations; Step (3) inserting the corresponding expression symbol into the appointed position of the original text according to the expression symbol-insertion position instruction combination obtained in the step (2) so as to construct candidate antagonistic text; And (4) updating and iterative searching the disturbance coding vector by utilizing a particle swarm optimization algorithm, and searching an antagonistic sample capable of inducing the large language model to make a misprediction on the input text by continuously adjusting the insertion position and the type of the expression symbol, thereby realizing effective attack on the target model.
2. The method of optimizing generation of large language model black boxes against samples based on emoji disturbance according to claim 1, wherein the step (1) includes the steps of: (1.1) designing a prompt word template for emotion classification or logical reasoning tasks according to task format requirements of a target large language model; (1.2) recording the original input sentence as Final input text embedded in the prompt word template to generate a model, the corresponding real label being noted as Inputting the final text into a large language model to be attacked, and outputting a group of candidate class label sequences by the model And corresponding category probability distribution Representing the model in The outputs are ordered from high to low according to their probability values, so that the tag with the highest probability: as a final predictive label for the model; (1.3) if and only if the model predictive tag is consistent with the real tag, namely: Then the input sentence is input A target sentence deemed to be available for the challenge sample construction and entering a subsequent challenge sample generation process; if the condition is not satisfied, discarding the sentence, and selecting a new input sample.
3. The method of optimizing generation of large language model black boxes against samples based on emoji disturbance according to claim 1, wherein the step (2) comprises the steps of: (2.1) determining the length of the solution vector to be encoded based on the length of the sentence and the ratio of the predefined insertion expressions The formula is as follows: ; Wherein, the The number of tokens representing the input sentence, Representing the emoji insertion ratio relative to the sentence length, Representing rounding up the length of the solution vector; (2.2) constructing a solution vector and dividing the coding subspace by the length of the solution vector according to the formula Constructing continuous solution vectors The vector is composed of two sub-vectors of equal length, a first partial position-coded component An insertion position for coding the emoticon, a second partial emoticon coding component The type used for encoding the emoticons; The division relationship is as follows: ; (2.3) decoding the continuous vector into a discrete set of "insert instructions" each emoticon insert operation being performed by a pair of elements in the vector Together with the description, wherein, Representing a position encoded subvector in a solution vector Is the first of (2) The number of components of the composition, Representing a position encoded subvector in a solution vector Is the first of (2) Calculating the actual insertion position and the corresponding expression symbol according to the following formula; Insertion position The calculation formula is as follows: ; Emoticon type The calculation formula is as follows: ; Wherein, the The representation is rounded down, eventually resulting in a set of insert instructions: , representing the number of elements in the inserted instruction set, wherein, For a look-up table of emoticons indexed to a particular emoticon, Is the length of the expression index table.
4. The method of optimizing large language model black box countermeasure sample generation based on emoji disturbance according to claim 1, wherein the step (3) comprises the steps of: (3.1) according to the insertion instruction set obtained in the step (2), each emoticon is selected Injecting an original text sequence Corresponding insertion position in a housing Generating an antagonistic text Namely, the input sentence after disturbance; (3.2) according to the prompt word template constructed in the step (1.1), the antagonistic text obtained in the step (3.1) Filling the text into a template to form a complete prompt word text which can be directly input into a target large language model; And (3.3) inputting the prompt word text into a large language model to be attacked, obtaining an output result, extracting a predicted probability value of the model on an original label according to the mode in the step (1.2), defining the probability value as an adaptive value of a current antagonism example, measuring the influence degree of the disturbance scheme on the predicted result, and participating in subsequent optimization calculation as an adaptive function in a particle swarm optimization algorithm.
5. The method of optimizing large language model black box countermeasure sample generation based on emoji disturbance according to claim 1, wherein the step (4) comprises the steps of: (4.1) randomly initializing a population, and constructing an initial population consisting of a plurality of solution vectors: Wherein, the For population scale, each solution vector is generated by adopting a random real number coding mode; (4.2) executing a particle swarm algorithm to obtain an optimal solution of the particle swarm algorithm; (4.2.1) calculating an fitness value of each particle; (4.2.1.1) first decoding the solution vector corresponding to the particle into an antagonistic sentence sample according to step (2) and step (3); (4.2.1.2) the antagonistic text Inputting the output results of the model on all categories into a large language model to be attacked, and obtaining the corresponding probability values of the model Combining the real label obtained in the step (1) Taking the probability of the model to the real label under the antagonistic sentence as the fitness value of the current particle Wherein the fitness value The smaller the contrast text is, the more the confidence of the model to the original category can be weakened, namely the stronger the attack effect is; (4.2.2) updating the individual historical optimal position of the particles ) And the global historical optimal position of the population ) The specific updating steps are as follows: (4.2.2.1) if the first population is, each particle Initializing a location for a particle, population Is at present If not, go to (4.2.2.2) and (4.2.2.3); (4.2.2.2) if the fitness of each particle Then update If the current position is not the current position, not updating; (4.2.2.3) if the fitness of each particle Then update If the current position is not the current position, not updating; (4.2.3) sorting all particles in the population of particles according to the fitness function value calculated in step (4.2.1); (4.2.4) if the fitness value of the particle is the first 80%, performing a location update according to formula (5): ; Wherein, the Is the first The first iteration The particles are at the first The number in the dimension of the number is, Is the first The first particle recorded in the history iteration process The optimal position value in the individual dimensions, The globally optimal solution recorded for the particle swarm in the historical iteration process is at the first Position values in the individual dimensions; selecting probability for random numbers with value range of 0-1 The value of (2) is in the range of 0 to 1, Expressed in terms of Is the mean value of Is a gaussian distribution of variance; (4.2.5) if the fitness value of the particle is not the first 80%, performing an elite solution guide-based structured location update operation on the particle, selecting the first from the current population based on the particle fitness ranking result The method comprises the steps of forming an elite solution set by particles with optimal personal adaptability, randomly selecting one elite particle from the elite solution set as a teacher solution, and then executing fragment-level cross learning operation on the particles to be updated by taking 'position-expression' paired coding elements as basic updating units; Particle-setting In the first place The solution vector of the generation is composed of position code sub-vectors And expression coding sub-vector Composition, solution vector corresponding to teacher particle is Then its pair-wise segment based cross update rule can be expressed as: ; Wherein, the Is a random number uniformly distributed on the interval [0,1], Is the crossover probability; Representing a continuous segment randomly selected in the insertion point index dimension, for each selected segment, the "position-expression" pair elements of the particle within that segment are simultaneously copied from the teacher particle to the particle The corresponding solution vector is used for ensuring the structural consistency between the inserted position ordering information and the corresponding expression type; (4.2.6) if the termination condition is reached, outputting an optimal solution of the particle swarm algorithm The termination condition comprises that the maximum iteration times are reached or a resistance text which can enable the target model predictive label to change is found, namely the attack is successful, and if the termination condition is not met, the steps (4.2.1) - (4.2.5) are repeatedly executed.

Description

Large language model black box countermeasure sample generation optimization method based on expression symbol disturbance Technical Field The invention belongs to the technical field of countermeasure attack in natural language processing, and relates to a large language model black box countermeasure sample generation optimization method based on expression symbol disturbance. Background In recent years, a large language model such as BERT, chatGPT, llama, qwen based on a transducer architecture has been developed in the field of natural language processing. The model is used for realizing leading performance expression continuously in various application scenes such as emotion analysis, machine translation, logical reasoning, open domain dialogue and the like by pre-training on massive corpora and combining with a fine-tuning strategy of a downstream task. As technology advances, many traditional natural language processing tasks are gradually unified below the text generation paradigm, such that large language models surpass the original task-specific models in multiple tasks. Although large language models have strong language understanding and generating capabilities, the reliability and robustness of the large language models still have defects, and especially in high-risk scenes such as medical treatment, finance, industrial control and the like, the sensitivity to input disturbance can cause serious consequences. Therefore, research on the attack resistance method aiming at the large language model and evaluation of the defending capability thereof are of great significance for identifying the vulnerability of the model and improving the safety thereof. Challenge-combating attacks is a widely existing and well-studied security threat in machine learning systems. By carefully constructing input perturbations, an attacker can induce the model to produce erroneous or more aggressive outputs, which are often imperceptible or acceptable to humans. Natural language challenge attacks mainly include two broad categories, perturbation-based methods, and hint-based attacks. Disturbance-based attacks can act on different text levels. At the sentence level, an attacker changes its semantics by rewriting or replacing the entire sentence. On the word level, the attack method generally adopts synonym replacement, keyword deletion and other modes to interfere with the discrimination result of the model. At the character level, an attacker performs tiny editing on a single character, thereby interfering with the coding process of a word segmentation device or a model and causing recognition errors. In addition to the above-described text-perturbation-based attacks, hint-level attacks bypass the security constraints of the model by designing special inputs, or induce the model to generate content that does not conform to the security policy at the time of reasoning, such risks are particularly pronounced in large language models that require strict instruction compliance. However, existing attack-countering methods still focus mainly on the design of perturbations at the standard text unit level, such as character, word or sentence level substitution and editing, with relatively inadequate attention to perturbations in non-standard symbol forms. Emoticons are a special symbol form widely used in modern digital communication, and are commonly used for expressing emotion or speech information, and the expression form of the emoticon is between character-level noise and symbols with certain semantic meanings. However, existing natural language models have relatively limited system modeling capabilities for such symbols during the training phase, such that the model may exhibit instability in processing the input containing the emoji. In addition, emoji are typically encoded into specific Unicode symbols in the model process and participate in reasoning as conventional input units, adding to the complexity of existing character or vocabulary rule based detection and filtering methods in processing such inputs. At the same time, emoticons may appear in different locations without significantly affecting the syntactic structure or text readability, such that existing defense mechanisms face higher uncertainties in coping with such non-standard symbol perturbations. Although emoticons provide a carrier for constructing a robust, natural text-engaging vector against sexual disturbances, searching for an effective emoticon insertion scheme in a large and discrete insertion space is still difficult. The evolutionary computation (Evolutionary Computation, EC) can perform global optimization on discontinuous and non-microminiatable spaces which are difficult to solve, and has higher applicability compared with a gradient-based optimization method. By maintaining a plurality of candidate solutions at the population level and performing iterative updating by means of a random search strategy, the method of evolutionary computatio