CN-121996968-A - Multi-dimensional feature and attack resistance based large language model robustness visual diagnosis method, system and equipment

CN121996968ACN 121996968 ACN121996968 ACN 121996968ACN-121996968-A

Abstract

The invention discloses a large language model robustness visual diagnosis method based on multidimensional features and attack resistance. The method aims at breaking through the traditional evaluation and only relying on a single aggregation index mode, and generating structural feature-countermeasure instruction-robustness diagnosis data comprising prompt words and corpus to be evaluated by constructing a multidimensional text feature exploration system covering vocabulary, syntax, semantics and structural layers and combining a large-scale countermeasure disturbance mechanism and a task self-adaptive quantization strategy. On the basis, an interactive visual analysis system is constructed, and through bidirectional linkage of the feature statistical view and the semantic projection view, a user is supported to realize progressive exploration from macroscopic feature screening to microscopic semantic attribution under the double visual angles of the prompting words and the corpus, so that the root cause of model weakness is deeply diagnosed. The method can identify the key feature combination affecting the stability of the model, thereby providing basis for the directional optimization of the model and improving the diagnosis depth of the robustness assessment.

Inventors

LI JIE
YANG SHUO

Assignees

天津大学

Dates

Publication Date: 20260508
Application Date: 20260128

Claims (10)

1. A large language model robustness visual diagnosis method based on multidimensional features and attack resistance, which is characterized by comprising the following method steps: step 1, aiming at prompt words and corpus to be evaluated, respectively extracting high-dimensional feature vectors covering a vocabulary layer, a syntax layer, a semantic layer and a structural layer by combining a rule-based statistical algorithm with the semantic understanding capability of a large language model, and converting unstructured text into measurable feature data; Step 2, introducing diversified text disturbance mechanisms, simulating language noise and expression differences in a real application scene, so as to perform systematic robust pressure test on a large language model, constructing an anti-attack device for implementing an anti-attack strategy for evaluating the anti-interference capability of the model, wherein the anti-attack strategy comprises the following disturbance of character level exchange disturbance, synonym replacement disturbance and random deletion disturbance on an original prompt word instruction, setting an anti-attack device corresponding to each disturbance, generating an anti-attack instruction simulating corresponding language noise by the anti-attack device, constructing a full-array cross input matrix which is sequentially arranged according to the original prompt word instruction, the anti-attack instruction and a corpus, inputting each group of samples in the matrix into the large language model to be tested, and obtaining paired output of the large language model to be tested under the original prompt word instruction and the disturbance state; step 3, selecting an adaptive basic consistency measurement function according to the downstream task type, and calculating a basic consistency score; on the basis, three levels of robustness scores of the sample level, the feature level and the model overall are further aggregated and calculated; Step 4, setting a user interface, setting a semantic projection view for showing feature distribution and robustness association feature statistics view, a semantic cluster and fragile cluster and a dynamic control panel for configuring an attack strategy by a user in the user interface, and mapping high-dimensional data into visual codes; and step 5, adopting a forward feature screening, reverse semantic lasso attribution and a dynamic attack strategy recalculation mechanism to support bidirectional iterative interaction between a feature space and a semantic space by a user, and completing a complete diagnosis process from macroscopic anomaly discovery to microscopic root cause analysis.
2. The large language model robustness visual diagnosis method based on multidimensional features and challenge according to claim 1, wherein step 1 comprises the following method steps: dividing text features of prompt word texts and corpus texts to be evaluated into four layers, namely a vocabulary layer, a syntax layer, a semantic layer and a structural layer, wherein the features of each layer comprise: The vocabulary layer features reflect the selection, composition and statistical properties of the vocabulary in the text, interfere with robustness by affecting the word segmentation accuracy and attention distribution of the model, and comprise vocabulary complexity, negative word density and length; The syntactic layer features reflect the grammatical structure and logical relation inside the sentence, including grammatical complexity, sentence pattern, language, tense; the semantic layer features reflect the deep meaning and emotion colors of the text, including semantic ambiguity, emotion intensity and ambiguity; the structural layer features reflect the macro organization form and style of the text, including the existence of thinking chain structures, diversity of sentence patterns and use case indexes; step 1-2, extracting multidimensional features of the prompt word text and the corpus text in a vocabulary layer, a syntax layer, a semantic layer and a structural layer according to the following method: setting a set of general prompt engineering construction protocol, wherein the feature extraction prompt engineering protocol configures a corresponding reasoning guiding strategy aiming at different feature types of prompt word texts and corpus texts: For each feature extraction task, firstly carrying out role instantiation on a large language model, and simultaneously, explicitly injecting academic definition and judgment standard of the feature in an instruction; Aiming at continuous features requiring deep understanding of semantic ambiguity, emotional intensity and ambiguity, adopting a thinking chain reasoning strategy, namely, forcing a model to generate an intermediate reasoning process before outputting a final score, precisely positioning and wrapping all tag attribute fragments in an original text by using a specific tag, and then executing arithmetic operation based on a positioning result to convert qualitative text analysis into quantitative numerical value output; Aiming at the discrete classification characteristics of sentence pattern distribution, genre, use case indexes, a few sample guiding strategy based on context learning is adopted, wherein a group of example pairs comprising input text, classification results and criteria are embedded in an input stream to cover all typical categories of the characteristics; In order to realize automatic processing, all feature extraction instructions force constraint models to output to a standard JSON format, and a verification module is arranged to conduct type check and range constraint on key values returned by a large language model, so that the generated multidimensional feature vectors of prompt word texts and corpus texts are guaranteed to have data integrity.
3. The visual diagnosis method for robustness of large language model based on multidimensional feature and attack resistance according to claim 2, wherein in step 1-2, the method for extracting the multidimensional feature of the prompt word text comprises the following method steps: the LLM is utilized to rewrite the original prompt word instruction text into a compressed version of reserved core information, and the redundancy of the prompt word is calculated according to the following formula: , wherein, In order to suggest the redundancy of the words, In order to suggest the number of original text characters of a word, The number of characters after compressing the prompting word; identifying a negative word in the prompt word text by using the LLM, and calculating the proportion of the sentence length occupied by the negative word; Constructing a prompt word emotion intensity classifier, and inputting a prompt word text into the LLM to perform five-level scoring of 0-4, wherein 0 represents extreme negative, 4 represents extreme positive and 2 is neutral; utilizing LLM to recognize ambiguous words in the prompt word text, and calculating the ratio of the ambiguous words in the total word number to reflect the ambiguity of the text; calculating complexity scores of the prompt words respectively by adopting a plurality of vocabulary complexity algorithms, and carrying out weighted fusion to obtain a final complexity index; Calculating the density of words with fuzzy semantics in the prompt word text by using words with fuzzy semantics in the LLM statistical prompt word text so as to reflect the semantic ambiguity of the prompt word; Setting: Average sentence length for the prompt word text; the average depth of the dependency tree for the prompt word text; the subordinate clause duty ratio of the text of the prompting word; the side-by-side structure ratio of the text of the prompting word; A normalized result of a Min-Max normalization method is adopted for the average sentence length of the prompt word text; a normalized result of a Min-Max normalization method is adopted for the average depth of the dependency tree of the prompt word text; A normalized result of a Min-Max normalization method is adopted for the duty ratio of the subordinate clauses of the prompt word text; the normalized result of the Min-Max normalization method is adopted for the parallel structure duty ratio of the prompt word text; 、、、 The value range is 0, 1; using LLM based on dependency syntax analysis, the grammatical complexity of the hint word text is calculated by: ; Wherein: Grammar complexity for the hint word text; To correspond to each weight coefficient, satisfy And (2) and ; Detecting and counting the number of grammar errors existing in the prompt word text by using LLM to reflect grammar fluency; Counting the total number of the word elements or the total number of the characters in the prompt word text by using LLM; Using LLM to recognize sentence pattern category of the prompt word text to reflect sentence pattern distribution in the prompt word text, wherein the sentence pattern category comprises a imperative sentence, a questionable sentence and a statement sentence; detecting whether the text of the prompt word contains a thought chain structure or format which is inferred by the guiding model step by using LLM; using LLM to count the use frequency and type distribution of the stateful verbs in the prompt word text to reflect the stateful characteristics in the prompt word text; Constructing a prompt word text classifier, and dividing the prompt word text into the following style categories of no obvious style, polite, crudely/toxic, spoken language/informal, formal/written and special semantics by using LLM (logical level management); recognizing tenses of the prompt word text by using LLM, wherein the tenses comprise present time, past time and future time; identifying whether the prompt word text contains examples by using LLM, and classifying the contained examples into useless cases, single cases or multiple cases; Utilizing a name visual angle adopted in the LLM recognition prompt word text, wherein the name visual angle comprises a first name, a second name or a third name; And converting each feature result obtained by each prompt word text based on the prompt word feature extraction method into a multidimensional feature vector.
4. The visual diagnosis method for robustness of large language model based on multidimensional feature and attack resistance according to claim 2, wherein in step 1-2, the method for extracting the multidimensional feature of corpus text comprises the following method steps: The original corpus text is rewritten into a compressed version with reserved core information by using LLM, and corpus feature redundancy is calculated according to the following formula: , wherein, For the redundancy of the features of the corpus, For the number of original text characters of the corpus feature, Compressing the number of characters for the corpus features; Utilizing LLM to count the polar word duty ratio with obvious emotion tendency in the corpus text; Counting the number or the duty ratio of punctuation marks in the corpus text by using LLM to reflect the punctuation density; counting the number or the duty ratio of special characters and expressions in the corpus text by using LLM; Respectively calculating corpus complexity scores by adopting various vocabulary complexity algorithms, and carrying out weighted fusion to obtain a final complexity index; respectively calculating corpus diversity scores by adopting various vocabulary diversity algorithms, and carrying out weighted fusion to obtain final vocabulary diversity indexes; Calculating the density of words with fuzzy semantics in the corpus text by using LLM to calculate the semantic ambiguity of the corpus text; using LLM based on dependency syntax analysis, the grammatical complexity of the corpus text is calculated by: Setting: The average sentence length of the corpus text; The average depth of the dependency tree for the corpus text; the subordinate clause duty ratio of the corpus text; The parallel structure ratio of the corpus text; a normalized result of a Min-Max normalization method is adopted for the average sentence length of the corpus text; a result normalized by a Min-Max normalization method is adopted for the average depth of the dependency tree of the corpus text; normalized results of the Min-Max normalization method are adopted for the duty ratio of the subordinate clauses of the corpus text; The normalized result of the Min-Max normalization method is adopted for the parallel structure duty ratio of the corpus text; 、、、 The value range is 0, 1; using LLM based on dependency syntax analysis, the grammatical complexity of the corpus text is calculated by: ; Wherein: the grammar complexity of the corpus text; To correspond to each weight coefficient, satisfy And (2) and ; Counting the total number of the word elements or the total number of the characters in the corpus text by using LLM; the sentence pattern category of the corpus text is identified by using LLM to reflect the sentence pattern distribution in the corpus text, wherein the sentence pattern category comprises pray sentences, questionnaires and statement sentences; Constructing a corpus and text classifier, and dividing the corpus and text into the style categories of no significant style, polite, crudely/toxic, spoken language/informal, formal/written language and special semantics by using LLM (logical level management), wherein the style features of the corpus and text are reflected; Recognizing tenses of corpus texts by using LLM, wherein the tenses comprise present time, past time and future time; The method comprises the steps that a person viewing angle adopted in the corpus text is identified by using LLM, wherein the person viewing angle comprises a first person, a second person or a third person; identifying language culture backgrounds to which corpus texts belong by using LLM, wherein the language culture backgrounds comprise American culture, european culture, latin culture and east Asia culture; Identifying specific fields of corpus texts by using LLM, wherein the specific fields comprise finance, science and technology, education, sports and art; And converting each feature result obtained by each corpus text based on the corpus feature extraction method into a multidimensional feature vector.
5. The large language model robustness visual diagnosis method based on multidimensional features and challenge according to claim 1, wherein step 2 comprises the following method steps: Step 2-1, constructing a plurality of attack-countermeasure devices based on TextAttack frames for evaluating the anti-interference capability of the model, wherein each attack-countermeasure device is used for realizing a corresponding attack-countermeasure strategy; the character level exchange attack resisting device is used for carrying out perturbation operation on characters in an input text, including random exchange of adjacent characters, random insertion, deletion or replacement of characters, simulating the false touch or spelling error of a keyboard generated by a user during quick input, and testing the error correction and understanding capability of a model on nonstandard spelling input by introducing character level noise; The random deleting attack countermeasure device is used for randomly removing word elements in the text according to a preset probability threshold value to simulate scenes of incomplete instruction information, missing key modifier or communication packet loss, and evaluating the context recovery and reasoning capacity of the model under the condition of partial information loss; The simple data enhancement attack resistance device is used for comprehensively utilizing four operations of synonym replacement, random insertion, random exchange and random deletion by adopting an EDA algorithm, and greatly increasing sentence pattern diversity of samples to simulate unstructured language noise environment on the premise of keeping original semantic profiles by mixing various disturbance modes; The substitution challenge-against attacker based on word embedding is used for constructing a high-dimensional semantic space by utilizing a pre-training word vector, calculating cosine similarity among words, selecting words with the closest vector space distance for substitution, generating challenge samples with high similarity of the semantic meaning but completely different word distribution, and testing the generalized boundary of the model on synonymous expression; the word level exchange attack resistance device is used for randomly exchanging the positions of adjacent or non-adjacent words in a text on the premise of not damaging the basic syntax tree structure of a sentence, simulating the common word order inversion in spoken language expression or grammar inversion phenomenon of a non-native language user, and testing the tolerance of a model to the word order noise; The universal synonym replacement countermeasure attacker is used for searching and replacing real words in the text based on the universal synonym dictionary, and evaluating the response consistency of the model to different vocabulary selections; The substitution challenge attacker based on the semantic knowledge base is used for systematically substituting the structured English vocabulary database according to the synonym set, the superword or the hyponym relation of the words, ensuring the correctness of the generated challenge sample on language logic by using the language knowledge base verified by the expert, thereby carrying out a deeper semantic consistency test; Step 2-2, aiming at each original prompt word instruction in the original prompt word instruction set, enabling the attack against the attacker constructed in the step 2-1 to be automatically called to generate an opposite instruction, and in the process of generating the opposite instruction, enabling a semantic retention mechanism to be introduced, and enabling the generated opposite data to be automatically regenerated until valid if the semantic similarity between the disturbance text generated by each opposite attacker and the original prompt word instruction is lower than a preset threshold value or the disturbance text cannot be generated, wherein each record contains the original prompt word instruction and various opposite instructions corresponding to the original prompt word instruction, so that a complete opposite instruction sample set is formed; Step 2-3, in order to comprehensively evaluate the performance of the model in different attack dimensions, constructing a full-arrangement cross input matrix which is arranged according to the sequence of the original prompt word instruction, the countermeasure instruction and the corpus according to the following method: the original prompting word instruction sample set is set to be U in size, the corpus sample set is set to be V in size, and the original prompting word instruction sample is combined with the countermeasure instructions generated by the W countermeasure attackers to construct the corpus sample set to be V in size Inputting the A-th group of samples in the matrix into the large language model to be tested to obtain the output of the large language model to be tested under the alpha-th original prompt word instruction And output under the beta-th countermeasure instruction Output (I) And (3) with Pairing every two to build up , ) As basic data for calculating a robustness consistency index in a subsequent step, wherein, Representing the sequence number of the original prompt word; Representing a countermeasure instruction sequence number; Representing the output of the large language model to be tested under the alpha-th original prompt word instruction; And the output of the large language model to be tested under the beta countermeasure instruction is represented.
6. The visual diagnosis method of large language model robustness against attack based on multidimensional features according to claim 1, wherein in step 3, the method for calculating the basic consistency score by selecting the adapted basic consistency metric function according to the downstream task type comprises the following method steps: in order to adapt to the output form difference of the large language model in different application scenes, setting a basic consistency index as various different indexes, wherein the basic consistency index comprises a label consistency index and a semantic similarity index; Aiming at the classified task with a fixed label set, a label consistency index is used as a measurement index, and the output of the large language model to be tested under the alpha-th original prompt word instruction is carried out And at alpha original prompter instruction via alpha Output under challenge instructions generated by a challenge strategy Analyzing to extract the predictive labels, if the two output predictive labels are identical, keeping the robustness of the judging model under the disturbance, marking as 1, otherwise judging as fragile, marking as 0, and the mathematical expression of the label consistency measuring function is as follows: ; Wherein: the instruction sequence number of the original prompt word is represented; Representing a countermeasure instruction sequence number; Representing the output of the large language model to be tested under the alpha-th original prompt word instruction; indicating the alpha-th original prompt word instruction of the large language model to be tested Outputting under the combined input of the countermeasure instruction corresponding to the countermeasure strategy; Representing a tag consistency score of the large language model to be tested; representing a prediction tag obtained after task analysis is carried out on the output text; indicating a function, taking 1 when the condition in the brackets is true, otherwise taking 0; Aiming at the generation type task of the open text, a semantic similarity index is adopted as a measurement index, under the scene, a pre-trained sentence embedded model is utilized to output a large language model to be tested under the alpha-th original prompt word instruction And at the alpha-th primitive hint word instruction and the alpha-th primitive hint word instruction Outputting under countermeasure instruction corresponding to countermeasure policy Mapping to higher dimensional semantic vectors And And calculating cosine similarity between two vectors, wherein the calculation result is a continuous value between [0, 1], the closer the value is to 1, the higher the semantic retention of the model output before and after disturbance is represented, and the mathematical expression of the measurement function of the semantic similarity is as follows: ; Wherein: scoring for semantic similarity; representing the output by A high-dimensional semantic vector is obtained through sentence embedding model mapping; representing the output by A high-dimensional semantic vector is obtained through sentence embedding model mapping; representing a cosine similarity function; representing the two norms of the vector; The tag consistency score and the semantic similarity score obtained by the method are taken as basic robustness scores.
7. The multi-dimensional feature and attack resistant large language model robustness visual diagnosis method according to claim 1, wherein in step 3, the method for aggregating and calculating three levels of robustness scores of sample level, feature level and model overall comprises the following method steps: To support the analytical requirements from macro assessment to micro diagnosis, a hierarchical robust computational model is constructed as follows: Firstly, defining a variable set, namely setting a prompt word instruction set to be evaluated as The corpus is The adopted countermeasure instruction set is Wherein, the method comprises the steps of, In order to hint the word instruction sequence number, For the corpus sequence number, To counter instruction sequence numbers, an The instruction number of the prompt words, the corpus number and the countermeasure instruction number are respectively, for any combination , A basic robustness score calculated in the step 3-1; Based on the definition, sequentially setting three levels of robustness indexes of sample level robustness, feature level robustness and model overall robustness; the sample level robustness index is used for quantifying the average stability of a single sample in the face of diversified attacks, and the calculation formula of the sample level robustness index of the prompt word is as follows: ; ; Wherein: instruction expressed in fixed hint word Under the condition, for all corpus With countermeasure instruction Is a basic robustness score of (1) Performing average to obtain a sample level robustness score; Expressed in a fixed corpus Under the condition, for all prompt word instructions With countermeasure instruction Is a basic robustness score of (1) Performing average to obtain a sample level robustness score; The feature level robustness index is used for diagnosing the influence of specific linguistic features on model robustness, setting L as a specific feature dimension, delta as a value or a value interval of the feature, and defining a set A sample subscript set with delta is selected for all the satisfying characteristics L; the calculation formula of the feature level robustness index is as follows: ; Wherein: Represents a set of sample indices satisfying the characteristic dimension L with a value delta, Representation of The number of elements in the collection; The feature level robustness index calculates the average robustness of a sample subset with specific feature attributes, thereby revealing the correlation between the feature and the model vulnerability; The overall robustness index of the model is used for evaluating the overall anti-interference capability of the model under the current test set, and the calculation formula of the overall robustness index of the model is as follows: ; Wherein: the index provides a macroscopic reference for the performance of the model for the arithmetic average value of the basic scores of all test cases.
8. A multi-dimensional feature and attack-resistant large language model robustness visual diagnosis system for implementing the multi-dimensional feature and attack-resistant large language model robustness visual diagnosis method according to claims 1-7, characterized in that the system comprises a model robustness diagnosis system, a multi-dimensional diagnosis data storage system, an interactive visual analysis system and a display which are connected in sequence: The model robustness diagnosis system is used for evaluating the robustness of a large language model and comprises a high-dimensional feature vector extraction module, a text disturbance mechanism module and a robustness measurement module; The high-dimensional feature vector extraction module is used for respectively extracting high-dimensional feature vectors covering a vocabulary layer, a syntax layer, a semantic layer and a structural layer aiming at prompt words and corpus to be evaluated and combining a rule-based statistical algorithm with the semantic understanding capability of the large language model, and converting unstructured text into measurable feature data; The text disturbance mechanism module is used for introducing diversified text disturbance mechanisms, simulating language noise and expression differences in a real application scene, so as to carry out systematic robust pressure test on a large language model, constructing an anti-attack device for implementing an anti-attack strategy for evaluating the anti-interference capability of the model, wherein the anti-attack strategy comprises the following disturbance of character level exchange disturbance, synonym replacement disturbance and random deletion disturbance on an original prompt word instruction; The robustness measurement module is used for selecting an adaptive basic consistency measurement function according to the downstream task type and calculating basic consistency scores; The multidimensional diagnosis data storage system is used for structurally storing a multidimensional diagnosis data set containing prompt word text and corpus text to be evaluated, and feature data, countermeasure instruction data and robustness index data corresponding to the prompt word text and corpus text to be evaluated; the interactive visual analysis system is used for converting abstract robustness measurement into visual representation data through a multi-view collaboration mechanism and outputting the visual representation data to a display for display, supports dynamic interaction and exploration among a feature space, a semantic space and an countermeasure policy space by a user, and comprises one or a combination of a feature statistics view module, a semantic projection view module and a dynamic control panel module, wherein: The feature statistics view module is used for showing feature distribution and robustness association feature statistics view, provides a global overview and screening entrance from micro feature interval to macro distribution trend, and comprises one or a combination of a double-view feature distribution panel sub-module, a bidirectional robustness visual encoder and a feature interval screening and linkage sub-module: the double-view characteristic distribution panel submodule is used for dividing a user interface into an upper part and a lower part, wherein the upper part interface is a prompt word characteristic region, and the lower part interface is a corpus characteristic region; the features extracted from the prompt word text and the corpus text comprise continuous features and discrete features, wherein the continuous features are real-valued indexes which can be quantized and divided into boxes, and the discrete features are category-type or Boolean indexes; The bidirectional robust visual encoder is used for visually displaying each characteristic dimension by adopting a bidirectional bar graph, wherein the left side length of the bar graph maps the number of samples falling into the characteristic interval to reflect the data distribution density, the right side length of the bar graph maps the average robust score of the samples in the interval, the positive and negative correlation influence of the characteristic value on the stability of the model is visually represented by the comparison of the left bar and the right bar, and the characteristic is reflected to be a fragile point if the sample quantity is large but the robust bar is extremely short; The feature interval screening and linkage submodule is provided with an interactive response interface, when a user clicks one or more feature bars and submits the feature bars, the clicked feature is identified as a filtering condition, linkage update among views is triggered, only sample data with the feature attribute is reserved and highlighted, and analysis logic for hypothesis re-verification is supported; The semantic projection view module is used for displaying semantic projection views of semantic clusters and fragile clusters, and is used for displaying a distribution structure of samples in semantic space based on a dimension reduction method and supporting a user to find local fragile modes and conduct reverse attribution, and comprises one or more of a high-dimensional semantic space mapping sub-module, a robust thermodynamic staining sub-module, a lasso selection and reverse attribution module, a dynamic control panel sub-module, an attack strategy dynamic configuration sub-module, a real-time data recalculation and rendering sub-module and an analysis state backtracking sub-module; The high-dimensional semantic space mapping submodule is used for acquiring text embedding by utilizing a pre-training language model and mapping the text embedding to a two-dimensional coordinate space by combining a dimension reduction algorithm, and respectively drawing a prompt word scatter diagram and a corpus scatter diagram on the left side and the right side of an interface; The robust thermodynamic staining sub-module is used for staining the scattered points according to the sample level robustness, and comprises a color mapping module, a dynamic analysis module and a dynamic analysis module, wherein the color mapping module guides a user to identify a low-robustness sample group which intensively appears under a specific semantic topic through visual observation; the lasso selection and reverse attribution submodule is used for supporting a lasso selection tool, after a user circles a sample cluster with abnormal color, the lasso selection and reverse attribution submodule extracts IDs of all samples in the cluster, and the update of the feature statistics view module is triggered reversely; The dynamic control panel module is used for user configuration management attack strategy and analysis state backtracking, and is used as a state center of the visual analysis system, and comprises one or more of the following submodules: the attack strategy dynamic configuration submodule is used for integrating an attack countermeasure in the central area and listing all available attack countermeasure modes; The real-time data recalculation and rendering submodule is used for recalculating the sample level and feature level robustness scores under the current strategy combination after the user-selected attack resisting mode is changed and submitted; The analysis state backtracking submodule is used for automatically recording the feature screening, projection circling and attack configuration operation sequences of the user, and the user backtracks to any previous analysis state through the history record panel to ensure the reproducibility and the contrast of the exploration process.
9. The large language model robust visual diagnostic system based on multidimensional features and challenge-against of claim 8, wherein a bi-directional data mapping and event response mechanism is established between a feature statistics view module and a semantic projection view module, supporting a feature driven and semantic driven complementary progressive analysis flow, wherein: Adopting a forward feature verification path to support a user to verify diagnosis hypothesis from macro feature distribution; when a user identifies a characteristic interval with low robustness trend in a characteristic statistical view and submits screening, triggering global data filtering, highlighting sample points with characteristic attributes in a semantic projection view, and carrying out desalination treatment on the rest irrelevant sample points; the mechanism enables a user to intuitively observe the distribution form of a specific feature sample in a semantic space, and confirms whether the sample with the feature is concentrated on a certain type of specific subject or uniformly distributed in the whole semantic space, so as to verify whether the feature is a dominant factor causing the failure of the specific semantic subject; The method comprises the steps of adopting a reverse semantic attribution path to support a user to conduct root cause diagnosis from microscopic semantic anomalies, when the user finds a sample cluster with abnormal colors in a semantic projection view through visual observation, carrying out circle selection by using a lasso tool, extracting indexes of all samples in the cluster, reversely triggering local updating of a feature statistical view, calculating and only displaying a feature distribution histogram of the circled sample cluster, and assisting the user to quickly identify key feature combinations which cause the fragility of the semantic cluster by comparing the distribution difference of the locally selected samples and the global samples in each feature dimension so as to finish the depth attribution of the fragility of the model.
10. An apparatus for a large language model robustness visual diagnosis method based on multi-dimensional features and challenge, comprising a memory and a processor, characterized in that the memory is used for storing a computer program, the processor is used for executing the computer program and realizing the large language model robustness visual diagnosis method steps based on multi-dimensional features and challenge according to any one of claims 1 to 7 when executing the computer program.

Description

Multi-dimensional feature and attack resistance based large language model robustness visual diagnosis method, system and equipment Technical Field The invention relates to the field of deep learning-based natural language processing visual analysis, in particular to a large language model robustness visual diagnosis method, system and equipment based on multidimensional features and attack resistance. Background Currently, large language model (Large Language Model, LLM) robustness assessment refers to the ability of a model to maintain output consistency or performance stability in the face of input perturbations (e.g., hint word changes, resistance attacks). With the rapid landing of large language models in various fields, their performance is highly dependent on user-provided hints (promts), however studies have shown that even small linguistic perturbations to the hinting words can lead to significant fluctuations in model output. This lack of robustness has become a core bottleneck that limits the reliability and security of LLM in critical applications. In order to ensure the reliability of the model, the performance of the model under different characteristics and attacks needs to be quantified through a systematic evaluation framework and a visual analysis means, so that weak points are identified and optimization is guided. To systematically evaluate model robustness, the research community developed a series of standardized benchmarks and frameworks. Early evaluation methods focused on the overall stability of the model on a particular task, such as testing the model stability by multi-level perturbation of instructions, or building a multi-level capability assessment framework. In pursuit of more realistic assessment results, researchers have also proposed "one-time multi-set" assessment paradigms, or constructed large-scale, multi-dimensional predictive datasets. Along with the deep research, students begin to pay attention to the sensitivity of a model to specific attributes of prompt words, and the influence of the design multidimensional of the prompt words on evaluation is verified through large-scale experiments, so that the evaluation is advocated to be carried out by adopting diversified prompt configurations, and a prompt sensitivity index is provided for quantifying stability. Some studies have studied the impact of linguistic attributes in depth, such as researchers studying linguistic attributes that promote success, quantifying the sensitivity of models to "spurious features", and studying robustness to format style changes. In addition, evaluation of malicious scenes is also increasingly emphasized, researchers evaluate vulnerability under 'prompt injection', and decomposed attacks are designed to avoid detection. However, most of these studies remain reporting an overall score after polymerization, lacking a fine-grained approach to exploration. Visual analysis aims to aid users in understanding and evaluating the behavior and performance of large language models using visual means. Existing related work can be mainly generalized into three directions of model internal mechanism visualization, specific task visualization and performance analysis based on improvement. In terms of visualization of the mechanism inside the model, this type of work is directed to opening the "black box" of the LLM. For example, exBERT and BertViz reveal inter-word dependencies through a visual self-attention mechanism, LIT provides an extensible platform that supports embedding and saliency map exploration, while LayerFlow allows experts to explore the evolution process of text embedding layer by layer. These systems serve mainly for the purpose of interpretability, focusing on understanding "how" the model works. In the visualization of a particular task, work is focused on a particular security scene. AdversaFlow designed a visual "team test" system, jailbreakHunter and JailbreakLens provide an analysis system for "jail break" attacks, LLM compilers support side-by-side comparison of multiple LLMs. However, the analytical dimension of such work is highly limited to specific tasks and is difficult to generalize to general robust diagnostics. In terms of Prompt-based performance analysis, the systems Prompterator and PromptAid, etc., aim to assist the user in optimizing instructions through visualization to promote task performance, but often lack in-depth analysis of the corpus context interaction impact. With the wide application of large language models in critical fields, ensuring robustness against diversified inputs has become an urgent issue. While the existing robustness research provides various quantitative indexes and evaluation references, in practical application, a simple 'measurement' cannot meet the requirement of a developer on deep 'diagnosis' of model vulnerability. The core challenge of current robustness studies is not the absence of tools, but the lack of a diagnostic theoretica