CN-122020175-A - Large language model safety evaluation data generation method based on semantic similarity variation

CN122020175ACN 122020175 ACN122020175 ACN 122020175ACN-122020175-A

Abstract

The application relates to the technical field of artificial intelligence safety, in particular to a method for generating large-scale language model safety evaluation data based on semantic similarity variation, which comprises the steps of collecting and preprocessing safety evaluation seed data; the method comprises the steps of obtaining deep semantic representation through a transducer model, carrying out knowledge enhancement semantic analysis by combining an external knowledge base, executing multi-level semantic similarity variation of word, phrase and sentence levels, realizing antagonistic variation by utilizing a generated antagonistic network, enhancing semantic diversity through a variation self-encoder, carrying out multidimensional quality assessment and screening on a generated sample, and constructing and incrementally updating a safety evaluation data set. According to the application, through deep semantic analysis and a multi-level intelligent variation strategy, the automatic generation of large-scale, high-quality and high-contrast safety evaluation data is realized, and the coverage, effectiveness and efficiency of safety evaluation of a large-scale language model are remarkably improved.

Inventors

Gan Maozhao

Assignees

深圳艾钜思科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260204

Claims (8)

1. A method for generating safety evaluation data of a large language model based on semantic similarity variation is characterized by comprising the following steps: step S1, collecting safety evaluation seed data and preprocessing the safety evaluation seed data to obtain a structured seed sample set; Step S2, based on the structured seed sample set obtained in the step S1, a pre-trained transducer model is adopted to encode sample texts in the structured seed sample set, a contextualized word vector representation of each word is firstly obtained, and then sentence-level deep semantic representation is generated through CLS token representation, average pooling or attention weighted pooling strategies; Step S3, based on the sentence-level deep semantic representation obtained in the step S2, constructing an external knowledge base comprising a general knowledge map, a domain specific knowledge base, a synonym dictionary or a paraphrasing dictionary and an entity relation database, linking and expanding entities and concepts in a text by the external knowledge base, and carrying out weighted fusion on the expanded entity semantic vector and an original word vector to generate a knowledge-enhanced semantic representation; Step S4, based on the knowledge-enhanced semantic representation obtained in the step S3, performing multi-level semantic similarity mutation to generate a multi-level mutation sample pool; Step S5, a generating countermeasure network of a generator-discriminator framework is constructed based on the multi-level variation sample pool obtained in the step S4, wherein the generator takes seed sample semantic representation and random noise as input, and generates countermeasure candidate samples through an encoder-variation module-decoder structure; Step S6, constructing a variational self-encoder based on the multi-level variant sample pool obtained in the step S4 and the challenge variant sample obtained in the step S5, mapping the multi-level variant sample pool and the challenge variant sample into hidden space distribution through the encoder, reconstructing the multi-level variant sample pool and the challenge variant sample from the hidden space sample through a decoder, introducing attack type and threat level condition variables to realize controlled generation, and generating a diversity sample of semantic smooth transition through hidden space linear interpolation or spherical interpolation; Step S7, based on the countermeasure variation sample obtained in the step S5 and the diversity sample obtained in the step S6, performing multidimensional quality assessment and screening to obtain a high-quality variation sample; Step S8, based on the high-quality variation samples obtained in the step S7 and the original seed samples of the step S1, organizing a data set according to attack types, threat levels, target behaviors and difficulty levels, recording the quantity proportion and quality distribution of the seed samples and the variation samples, establishing a periodic acquisition new attack case as a seed sample, and re-executing the step S1-S7 to generate an incremental updating mechanism of the new variation samples, carrying out version control on the data set to form a dynamic updated safety evaluation data set; And S9, based on the safety evaluation data set obtained in the step S8, performing experimental verification on a large-scale language model, calculating attack success rate, detection rate, false alarm rate and safety risk coverage, and verifying the quality, diversity, evaluation effectiveness and generation efficiency of the safety evaluation data set through a comparison experiment with the existing data generation method.
2. The method of claim 1, wherein the seed data comprises a public safety test data set, a library of known attack cases, attack samples in a safety research treatise, and an attack pattern found by an actual safety audit; The preprocessing comprises data cleaning, attribute labeling, word segmentation, part-of-speech labeling and dependency analysis.
3. The method according to claim 1, wherein the multi-level semantically similar mutation in step S4 comprises: candidate word replacement, word vector spatial interpolation and antagonistic word replacement are carried out on the word level through semantic similarity screening, so that word variants are generated; Based on the result of the word variation, rewriting phrases through a Seq2Seq model, and recombining clauses based on the dependency relationship to generate phrase variation or clause variation; and generating sentence variants through multi-language back translation, syntax transformation and a semantic rewrite model based on the phrase variants or the clause variants, and forming a multi-level variation sample pool.
4. A method according to claim 3, wherein said step S7 comprises: Calculating semantic similarity of the challenge mutation sample, the diversity sample and a seed sample; Screening grammar fluency samples from the challenge mutation samples and the diversity samples through language model confusion; testing attack success rates of the challenge mutation sample and the diversity sample on a target language model, and calculating internal diversity of the challenge mutation sample and the diversity sample set; and calculating the comprehensive quality scores of the challenge mutation sample and the diversity sample based on the semantic similarity, the grammar fluency sample screening result, the attack success rate and the diversity in the sample set, and screening to obtain a high-quality mutation sample.
5. The method of claim 1, wherein the Transformer model in step S2 includes BERT, roBERTa, and XLM-RoBERTa, and the strategy for generating the sentence-level deep semantic representation matches the selected model.
6. The method according to claim 1, wherein the generating an countermeasure network in step S5 includes the following steps: ; Wherein, the To distinguish between true samples and fight loss of the generated samples, To generate MSE losses for the semantic representation of the samples and seed samples, Generating a gap loss between the sample attack success rate and the target attack success rate; The value of (2) is dynamically adjusted based on the semantic similarity threshold and the attack success rate threshold in the step S7, and the higher the threshold is, the more the threshold is The larger.
7. The method according to claim 1, characterized in that the further comprises: Based on the experimental verification result of the step S9, obtaining the protection capability score of the target language model, and dynamically adjusting the variation intensity of the step S4, the number of countermeasure training iterations of the step S5 and the hidden space sampling range of the step S6 to realize the matching of the variation difficulty and the protection capability of the model.
8. The method according to claim 1, characterized in that the further comprises: Expanding and generating multi-mode security evaluation data based on the sentence variants in the step S4 and the diversity samples in the step S6; The method for generating the multi-mode security evaluation data by expansion comprises the steps of associating a text sample with an image sample, generating a malicious image through text semantic guidance, adding imperceptible countermeasures to the image, calculating cross-mode semantic similarity of the text and the image, and forming a text-image joint attack sample.

Description

Large language model safety evaluation data generation method based on semantic similarity variation Technical Field The invention relates to the technical field of artificial intelligence safety, in particular to a large-scale language model safety evaluation data generation method based on semantic similarity variation. Background With the wide application of large language models in various industries, the safety problem of the large language models is increasingly prominent. According to 2024 statistics, the number of globally deployed AI models exceeds 1000 tens of thousands, but 99.9% of deep learning models are vulnerable to attack, 73% of enterprise AI systems have unrepaired security holes, more than 1000 AI attack resistant events occur daily, and the average loss caused by a single attack reaches 150 tens of thousands of dollars. Major security threats faced by current large language models include prompt injection attacks, countering sample attacks, model jail-breaking, sensitive information leakage, illusion generation, etc., and OWASP LLM Top has explicitly listed the ten major security risks faced by large models. In the prior art, the safety evaluation method has the defects that firstly, test data are insufficient, the scale of a public safety test data set is limited, for example, the AdvBench data set only contains about 500 harmful behavior samples, the HarmBench data set contains about 300 standardized test cases, the comprehensive safety evaluation requirement cannot be met, secondly, the diversity is poor, the artificially constructed test samples are limited to a specific mode, the depth change of a semantic layer is lacking, the test coverage is narrow and easy to be identified and filtered by a model safety protection mechanism, thirdly, the generation efficiency is low, the traditional manual construction or simple rule transformation method takes a long time, an expert manually constructs a high-quality countermeasure sample for 15-30 minutes on average, fourthly, the semantic retention capability is weak, the existing data enhancement method such as simple synonym replacement, back translation and the like is difficult to accurately maintain the semantics of original attack intention, the variant is easy to lose the attack effect or generate semantic offset, fifthly, the antagonism is insufficient, the generated test data is lack of enough antagonism and the protection capability of the model to deal with complex attack is difficult to be effectively evaluated. The existing part of data enhancement technologies proposed in academia and industry, such as rule-based transformation, translation-based method, and language model-based rewrite, still have the limitations that the rule-based method depends on manual design rules, has poor flexibility and is difficult to capture complex semantic relationships, the translation-based method is easy to introduce translation errors, has limited semantic retention capacity and high calculation cost, and the generated sample attack effect is unstable due to the lack of deep understanding of security attack semantics due to the lack of language model rewrite. Therefore, a technical scheme capable of automatically and efficiently generating large-scale, high-quality and semantically diversified safety evaluation data is urgently needed, so that comprehensiveness and effectiveness of safety evaluation of a large-scale language model are improved. Disclosure of Invention Therefore, the invention aims to provide a large-scale language model safety evaluation data generation method based on semantic similarity variation, so as to overcome the problems in the prior art. In order to achieve the above purpose, the invention adopts the following technical scheme: The application provides a method for generating large language model safety evaluation data based on semantic similarity variation, which comprises the following steps: step S1, collecting safety evaluation seed data and preprocessing the safety evaluation seed data to obtain a structured seed sample set; Step S2, based on the structured seed sample set obtained in the step S1, a pre-trained transducer model is adopted to encode sample texts in the structured seed sample set, a contextualized word vector representation of each word is firstly obtained, and then sentence-level deep semantic representation is generated through CLS token representation, average pooling or attention weighted pooling strategies; Step S3, based on the sentence-level deep semantic representation obtained in the step S2, constructing an external knowledge base comprising a general knowledge map, a domain specific knowledge base, a synonym dictionary or a paraphrasing dictionary and an entity relation database, linking and expanding entities and concepts in a text by the external knowledge base, and carrying out weighted fusion on the expanded entity semantic vector and an original word vector to generate a knowledge-