CN-122020646-A - Back door detection method based on code model input sample in black box scene

CN122020646ACN 122020646 ACN122020646 ACN 122020646ACN-122020646-A

Abstract

The invention discloses a back door detection method of a code model input sample based on a black box scene, which comprises the steps of designing disturbance based on the input sample, analyzing grammar and lexical of the input sample through an abstract grammar tree analysis tool, identifying disturbance nodes capable of inserting disturbance, designing a disturbance adding strategy based on the disturbance nodes, generating a disturbance sample list based on the disturbance nodes and the disturbance adding strategy, respectively inputting the input sample and the disturbance sample list into the code model, calculating corresponding disturbance trend deviation indexes according to code tasks to be performed, calculating the disturbance trend deviation indexes obtained in advance after adding disturbance according to the code tasks to be performed, calculating dynamic normalization threshold values of the disturbance trend deviation indexes, and judging the input sample to be a malicious input sample if the disturbance trend deviation indexes calculated according to the input sample of the corresponding code tasks are larger than the dynamic normalization threshold values. The invention can detect malicious input samples on the premise of not making additional assumptions on the back gate trigger.

Inventors

HUANG HAIPING
YU ZHENQI
CHANG SHUYU
YU LE
SHA LETIAN
LIU FENGRUI
TANG JUN
WU MIN

Assignees

南京邮电大学

Dates

Publication Date: 20260512
Application Date: 20251219

Claims (10)

1. A back door detection method based on a code model input sample in a black box scene is characterized by comprising the following steps: acquiring an input sample of a code model, and designing a disturbance based on the input sample; Carrying out grammar and lexical analysis on the input sample through an abstract grammar tree analysis tool so as to identify disturbance nodes capable of inserting disturbance; Designing a disturbance adding strategy based on the disturbance and the disturbance node; based on the disturbance node, gradually adding disturbance to the input sample by combining the disturbance adding strategy to generate a disturbance sample list; Respectively inputting the input sample and the disturbance sample list into a code model, and calculating corresponding disturbance trend deviation indexes according to code tasks to be performed; calculating a disturbance trend deviation index obtained in advance after adding disturbance according to a code task to be performed, and calculating a dynamic normalization threshold value of the disturbance trend deviation index; And if the disturbance trend deviation index calculated according to the input samples of the corresponding code tasks is larger than the dynamic normalization threshold value, judging that the input samples are malicious.
2. The back door detection method based on code model input samples in black box scenarios of claim 1, wherein designing perturbations based on the input samples comprises designing code logic dimension perturbations based on code logic dimensions of the input samples Designing code readability dimension perturbation based on the code readability dimension of the input sample ; The code logical dimension perturbation Breaking the grammar structure and execution logic of the input sample code from the surface to the deep layer, wherein the code logic dimension perturbs Including a first retentive disturbance And a first destructive perturbation The first retentive disturbance Performing equivalent structure conversion in the input sample code segment on the premise of keeping the semantic and grammatical correctness of the input sample code, wherein the equivalent structure conversion comprises cyclic equivalent conversion, branch equivalent conversion, calculation equivalent conversion and constant equivalent exchange, and the first destructive perturbation comprises a first destructive perturbation The method comprises the steps of destroying basic logic executed by input sample codes, wherein the destroyed mode comprises conditional expression inversion, cyclic control logic modification, variable assignment logic change and function behavior tampering; perturbation of the code readability dimension Involves identifier modification, affects the semantics and readability of input sample code, and the code readability dimension perturbs Including a second retentive disturbance And a second destructive perturbation The second retentive disturbance Synonym replacement from the set of synonyms in the wordnet dictionary, the second destructive perturbation And (3) performing irrelevant word replacement, and selecting words with the same part of speech in the wordnet dictionary for random replacement.
3. The back door detection method based on code model input samples in black box scenarios according to claim 2, wherein designing a disturbance addition strategy based on the disturbance and disturbance nodes comprises based on the first retentive disturbance And a second retentive disturbance Generating an addition mode And by adding means Generating a corresponding disturbance sample based on the first destructive disturbance And a second retentive disturbance Generating an addition mode And by adding means Generating a corresponding disturbance sample, wherein the adding mode The expression of (2) is as follows: , where j is the disturbance node sequence number, For the input sample segment on the jth disturbance node, the value range is 1 to m-1, and m is the adding mode The total number of the applied disturbance node sequence numbers; The adding mode The expression of (2) is as follows: , where k is the perturbation node sequence number, And (3) taking the value range of m to n as the input sample segment on the kth disturbance node, wherein n is the total number of the disturbance nodes.
4. A back door detection method based on code model input samples in black box scenarios according to claim 3, wherein generating disturbance samples comprises the steps of: Defining code models A model of the posterior door implanted during the training phase; Using abstract syntax tree parsing tools, code models Input samples of (a) Performing lexical and grammatical analysis to generate an abstract grammar tree, traversing all nodes in the abstract grammar tree, identifying disturbance nodes suitable for being inserted with disturbance, and storing the disturbance nodes as a code disturbance position list by using triples Wherein the triples comprise line numbers, abstract syntax tree node types and disturbance types, and the abstract syntax tree node types comprise expressions, control flows, assignment nodes and identifier nodes; For code perturbation position list In the code, control flow, assignment node, adding code logic dimension disturbance For code disturbance location list Identifier node in (c), adding code readability dimension perturbation Generating a disturbance sample The expression is: , wherein i is the number of the disturbance samples, the value range is 1 to n, and n is the total number of the disturbance samples; for the ith perturbation sample, the expression is as follows: , As a function of the added perturbation, The adding mode of the ith disturbance sample.
5. The back door detection method based on the code model input sample in the black box scene according to claim 4, wherein the code tasks comprise a code classification task and a code generation task, and calculating the corresponding disturbance trend deviation index according to the code tasks to be performed comprises calculating a first disturbance trend deviation index according to the code classification task and calculating a second disturbance trend deviation index according to the code generation task.
6. The back door detection method based on code model input samples in black box scenarios as recited in claim 5, wherein calculating a first disturbance trend deviation index based on code classification tasks comprises inputting the samples Disturbance sample list Respectively inputting the labels into a code model M to obtain an original label output by the code model M Disturbance tag list According to the original label Disturbance tag list Calculating a first disturbance trend deviation index, wherein the original label Disturbance tag list The expressions of (2) are as follows: 。
7. The back door detection method based on code model input samples in black box scene as claimed in claim 6, wherein the first disturbance trend deviation index is calculated by the following formula : , In the formula, A perturbation tag for the ith perturbation sample, Is a function of the intensity of the jump, The identification function is used for judging whether two adjacent disturbance labels are equal or not, and the value of the identification function is as follows: , Jump intensity function The calculation formula of (2) is as follows: , In the formula, Tamper-indicating label From jumping to Is a probability of (2).
8. The back door detection method based on code model input samples in black box scenarios as recited in claim 7, wherein calculating a second disturbance trend deviation index from the code generation task comprises inputting the samples Disturbance sample list Input to the code model M to obtain an original generation segment output by the code model M And generating a code segment list after disturbance Generating segments from the original And generating a code segment list after disturbance Calculating a second disturbance trend deviation index, wherein the segments are originally generated And generating a code segment list after disturbance The expressions of (2) are as follows: 。
9. The back door detection method based on the code model input sample in the black box scene as claimed in claim 8, wherein the second disturbance trend deviation index is calculated by the following formula : , In the formula, Generating code segments after perturbation for the ith perturbation sample The score of the score is calculated, Is an identification function, and represents that the code segments are generated corresponding to the adjacent two disturbance judgment Whether the score remains unchanged, the function is identified The value of the method is as follows: 。
10. the back door detection method based on the code model input sample in the black box scene according to claim 9, wherein the dynamic normalization threshold calculation formula is as follows: , In the formula, In order to dynamically normalize the threshold value, Adding disturbance trend deviation index mean value after disturbance to the pre-acquired clean sample, Adding disturbance trend deviation index standard deviation after disturbance to a pre-acquired clean sample, Is a threshold shift parameter, wherein the clean samples are non-malicious samples.

Description

Back door detection method based on code model input sample in black box scene Technical Field The invention relates to the technical field of backdoor defense, in particular to a backdoor detection method based on a code model input sample in a black box scene. Background In the past decade, deep learning-based code models have been advancing in the field of software engineering task processing, exhibiting remarkable performance, and particularly excellent performance in the scenes of code understanding class tasks, such as defect detection, code clone detection, code search and the like. The excellent performance promotes the wide application of the code model, and various artificial intelligence programming assistants based on NCM, represented by domestic Tengxun cloud AI code assistant CodeBuddy and foreign Amazon CodeWhisperer, are deeply penetrated into various links of code development. In practical application, because of huge cost for developing and training a code model with better performance, most developers and users can directly choose to use the code model pre-trained by a third party, which is machine learning service. However, to enhance the ability of code models in various code intelligence tasks, model trainers often acquire large-scale code data sets from the internet or third-party data providers to achieve better model performance. Research shows that the code model is likely to be embedded into the hidden back door by an attacker accidentally in the training process, and the infected code model can be normal in clean input, but when a malicious sample with a trigger is input, target labels expected by the attacker can be output, so that the attacks are difficult to identify for an ordinary user, and the machine learning, namely the service, is endangered. One detection scheme against such attacks is to determine if there are triggers on the test data, detect it before it is input to the code model and then filter it. The defense mode can not only cooperate with other backdoor defense means, but also provide priori knowledge of the trigger sample in the comprehensive defense flow, thereby helping downstream defense links to spread statistical analysis on the backdoor sample and weakening the influence brought by the backdoor more efficiently. At present, although a plurality of defense methods exist, most of the defense methods need to access and even modify the code model weight under the white box setting, and the defense method cannot be applied to machine learning service scenes, and a few of black box defense have hidden assumptions on back door triggers and are easy to bypass by advanced back door attacks. Therefore, it is needed to find a method suitable for a black box scene, which can detect malicious input samples without excessively assuming a trigger and only obtaining the output result of a code model. Disclosure of Invention The invention aims to provide a back door detection method for a code model input sample based on a black box scene, which is used for adding disturbance to the code model input sample, detecting the output difference of the code model before and after the disturbance is added to judge whether the input sample is a malicious sample or not, and detecting the effect of the malicious input sample by only acquiring the output result of the code model on the premise of not making additional assumption on a back door trigger. The invention is realized by the following technical scheme. The invention provides a back door detection method based on a code model input sample in a black box scene, which comprises the following steps: acquiring an input sample of a code model, and designing a disturbance based on the input sample; Carrying out grammar and lexical analysis on the input sample through an abstract grammar tree analysis tool so as to identify disturbance nodes capable of inserting disturbance; Designing a disturbance adding strategy based on the disturbance and the disturbance node; based on the disturbance node, gradually adding disturbance to the input sample by combining the disturbance adding strategy to generate a disturbance sample list; Respectively inputting the input sample and the disturbance sample list into a code model, and calculating corresponding disturbance trend deviation indexes according to code tasks to be performed; calculating a disturbance trend deviation index obtained in advance after adding disturbance according to a code task to be performed, and calculating a dynamic normalization threshold value of the disturbance trend deviation index; And if the disturbance trend deviation index calculated according to the input samples of the corresponding code tasks is larger than the dynamic normalization threshold value, judging that the input samples are malicious. Optionally, designing the perturbation based on the input sample includes designing a code logical dimension perturbation based on a code logical dimension of the inpu