CN-122019726-A - Security assessment method and system

CN122019726ACN 122019726 ACN122019726 ACN 122019726ACN-122019726-A

Abstract

The invention provides a security assessment method and a security assessment system, belongs to the technical field of model assessment, and aims to provide an assessment method for personalized security assessment aiming at different fields, wherein the method comprises the steps of generating a test sample set, wherein the test sample set comprises a plurality of different test samples, and the test samples carry scene type identifiers; the method comprises the steps of obtaining answer information fed back by a large language model to be evaluated for each test sample, evaluating each test sample, obtaining a target scene corresponding to the test sample according to scene type identification of the test sample, extracting keywords from rules corresponding to the target scene to generate evaluation indexes corresponding to the target scene, evaluating the answer information by adopting the evaluation indexes corresponding to the test sample to obtain evaluation results corresponding to the test sample, and generating security evaluation results of the large language model to be evaluated under the target scene based on the evaluation results corresponding to the test samples after evaluating the test samples.

Inventors

ZHAO CHUANHU
WANG BINGQIAN
WANG LIXIN
Chu Ta

Assignees

京东方科技集团股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260213

Claims (10)

1. A security assessment method, the method comprising: generating a test sample set, wherein the test sample set comprises a plurality of different test samples, and the test samples carry scene type identifiers; obtaining answer information fed back by the large language model to be evaluated aiming at each test sample; the method comprises the steps of obtaining a target scene corresponding to a test sample according to a scene type identifier of the test sample, extracting keywords from a rule corresponding to the target scene, and generating an evaluation index corresponding to the target scene; evaluating the answer information by adopting an evaluation index corresponding to the test sample to obtain an evaluation result corresponding to the test sample; and after the plurality of test samples are evaluated, generating a security evaluation result of the large language model to be evaluated under the target scene based on evaluation results corresponding to the plurality of test samples.
2. The security assessment method according to claim 1, wherein different test samples carry different risk type identifiers, and the different risk type identifiers correspond to different assessment dimensions, and wherein the assessing the answer information using the assessment index corresponding to the test sample comprises: determining a target evaluation dimension to which the test sample belongs from a plurality of evaluation dimensions according to the risk type identifier; acquiring an evaluation index and a weight corresponding to the target evaluation dimension under the target scene; And evaluating the answer information based on the evaluation index and the weight corresponding to the target evaluation dimension.
3. The security assessment method according to claim 2, wherein after assessing the answer information based on the target assessment dimension corresponding assessment index and weight, the method further comprises: Determining the risk degree of the large language model to be evaluated under the target evaluation dimension according to the evaluation result corresponding to the test sample; and under the condition that the risk degree is larger than a preset risk degree, generating test samples corresponding to the target evaluation dimension so as to increase the number of samples in the target evaluation dimension.
4. The security assessment method according to claim 1, wherein after the obtaining the target scene corresponding to the test sample according to the scene type identifier of the test sample, the method further comprises: acquiring a plurality of evaluation dimensions corresponding to the target scene and weights corresponding to each evaluation dimension; determining at least one preset evaluation dimension from the plurality of evaluation dimensions based on weights corresponding to the plurality of evaluation dimensions, wherein the weights of the preset evaluation dimensions are larger than a preset value; And generating a test sample corresponding to the preset evaluation dimension.
5. The security assessment method of claim 1, wherein the generating a set of test samples comprises: Preprocessing the acquired plurality of data samples, and generating a basic test sample corresponding to a scene type based on the plurality of preprocessed data samples and the scene type to which the data samples belong; and selecting risk feature words from a preset risk feature library, and fusing the risk feature words with the basic test samples to obtain a plurality of test samples.
6. The security assessment method according to claim 5, wherein after the fusing the risk feature words with the base test sample to obtain a plurality of the test samples, the method further comprises: Performing first quality check on the generated test sample to check the risk characteristic intensity of the test sample; under the condition that the first quality check is qualified, sample enhancement is performed based on the test sample so as to expand the number of samples; performing a second quality check on the plurality of test samples with the reinforced samples to confirm whether the test samples conform to scene types corresponding to the test samples; And generating the test sample set based on a plurality of the test samples which are qualified by the second quality check.
7. The security assessment method according to claim 1, wherein the method further comprises: Extracting risk types contained in the large language model to be evaluated and the risk degree of each risk type from the security evaluation result; Sequencing a plurality of risk types according to the order of the risk degrees from large to small, and sequentially generating optimization suggestions of each risk type according to the sequence of the risk types; and converting the optimization suggestion into a model fine tuning instruction so as to optimize the large language model to be evaluated.
8. The security assessment method according to claim 7, wherein the generating of the optimization suggestion for each of the risk types comprises: Acquiring a plurality of influence factors and contribution degrees corresponding to the influence factors based on evaluation results of a plurality of test samples, wherein the influence factors represent keywords influencing risk degrees of answer information corresponding to the test samples; And generating an optimization suggestion corresponding to the risk type based on a plurality of influence factors and contribution degrees corresponding to each influence factor.
9. The security assessment method according to claim 1, wherein the evaluating each of the test samples comprises: inputting the test sample into a scene recognition module of a preset large language model to obtain a target scene corresponding to the test sample; Inputting the target scene into an index generation module of the preset large language model to obtain an evaluation index corresponding to the target scene; Inputting the answer information and the evaluation index into an evaluation module of the preset large language model so as to evaluate the answer information by adopting the evaluation index; The preset large language model is trained based on a plurality of preset data sets, the preset data sets comprise a plurality of preset test samples and evaluation results corresponding to the preset test samples, and the preset test samples carry scene type labels.
10. A security assessment system, the system comprising: A sample generation module for generating a test sample set, the test sample set comprising a plurality of different test samples, the test samples carrying risk type identifiers; the acquisition module is used for acquiring answer information fed back by the large language model to be evaluated for each test sample; The evaluation module is used for evaluating each test sample, wherein the evaluation comprises the steps of acquiring a target scene corresponding to the test sample according to the scene type identifier of the test sample, extracting keywords from a rule corresponding to the target scene, and generating an evaluation index corresponding to the target scene; and the result generation module is used for generating a security evaluation result of the large language model to be evaluated in the target scene based on the evaluation results corresponding to the test samples after evaluating the test samples.

Description

Security assessment method and system Technical Field The disclosure relates to the technical field of model evaluation, in particular to a security evaluation method and a security evaluation system. Background At present, a large-scale language model is applied to various different field scenes to provide core driving force for industry digital transformation, however, the requirements of the different field scenes on model safety are obviously different, and the conventional generalized assessment method is difficult to cope with the current model safety problem. Disclosure of Invention Based on the background technology, the disclosure provides a security assessment method and a system. In a first aspect of the present disclosure, a security assessment method is provided, the method comprising: generating a test sample set, wherein the test sample set comprises a plurality of different test samples, and the test samples carry scene type identifiers; obtaining answer information fed back by the large language model to be evaluated aiming at each test sample; the method comprises the steps of obtaining a target scene corresponding to a test sample according to a scene type identifier of the test sample, extracting keywords from a rule corresponding to the target scene, and generating an evaluation index corresponding to the target scene; evaluating the answer information by adopting an evaluation index corresponding to the test sample to obtain an evaluation result corresponding to the test sample; and after the plurality of test samples are evaluated, generating a security evaluation result of the large language model to be evaluated under the target scene based on evaluation results corresponding to the plurality of test samples. Optionally, different test samples carry different risk type identifiers, and the different risk type identifiers correspond to different evaluation dimensions, and the evaluating the answer information by adopting the evaluation indexes corresponding to the test samples comprises the following steps: determining a target evaluation dimension to which the test sample belongs from a plurality of evaluation dimensions according to the risk type identifier; acquiring an evaluation index and a weight corresponding to the target evaluation dimension under the target scene; And evaluating the answer information based on the evaluation index and the weight corresponding to the target evaluation dimension. Optionally, after evaluating the answer information based on the target evaluation dimension corresponding to the evaluation index and the weight, the method further includes: Determining the risk degree of the large language model to be evaluated under the target evaluation dimension according to the evaluation result corresponding to the test sample; and under the condition that the risk degree is larger than a preset risk degree, generating test samples corresponding to the target evaluation dimension so as to increase the number of samples in the target evaluation dimension. Optionally, after the obtaining the target scene corresponding to the test sample according to the scene type identifier of the test sample, the method further includes: acquiring a plurality of evaluation dimensions corresponding to the target scene and weights corresponding to each evaluation dimension; determining at least one preset evaluation dimension from the plurality of evaluation dimensions based on weights corresponding to the plurality of evaluation dimensions, wherein the weights of the preset evaluation dimensions are larger than a preset value; And generating a test sample corresponding to the preset evaluation dimension. Optionally, the generating the test sample set includes: Preprocessing the acquired plurality of data samples, and generating a basic test sample corresponding to a scene type based on the plurality of preprocessed data samples and the scene type to which the data samples belong; and selecting risk feature words from a preset risk feature library, and fusing the risk feature words with the basic test samples to obtain a plurality of test samples. Optionally, after the fusing the risk feature words with the base test sample to obtain a plurality of test samples, the method further includes: Performing first quality check on the generated test sample to check the risk characteristic intensity of the test sample; under the condition that the first quality check is qualified, sample enhancement is performed based on the test sample so as to expand the number of samples; performing a second quality check on the plurality of test samples with the reinforced samples to confirm whether the test samples conform to scene types corresponding to the test samples; And generating the test sample set based on a plurality of the test samples which are qualified by the second quality check. Optionally, the method further comprises: Extracting risk types contained in the large language model to