Search

CN-121997329-A - Software vulnerability detection method based on feature fusion of pre-training model

CN121997329ACN 121997329 ACN121997329 ACN 121997329ACN-121997329-A

Abstract

The invention discloses a software vulnerability detection method based on feature fusion of a pre-training model, which comprises the following steps of 1) obtaining a software code to generate a training set of the model, wherein the software code is an original code of software, 2) segmenting and preprocessing the obtained software code to be detected, 3) converting the software code into Tokens sequences by using byte to code BPE after segmenting and preprocessing input data, 4) feature fusion, 5) inputting feature fusion vector representation into a code vulnerability detection model based on deep learning to train the model, 6) testing the model, detecting the vulnerability by using the validated model, and obtaining whether a code segment contains the vulnerability according to a detection result. The invention provides a software vulnerability detection method based on feature fusion of a pre-training model, which ensures the detection accuracy of the model and simultaneously gives consideration to the detection performance of the model.

Inventors

  • LI TAO

Assignees

  • 扬州慧鉴网安信息技术有限公司

Dates

Publication Date
20260508
Application Date
20251205

Claims (10)

  1. 1. The software vulnerability detection method based on the feature fusion of the pre-training model is characterized by comprising the following steps of: 1) Acquiring a software code, and generating a training set of a model, wherein the software code is an original code of software; 2) Cutting and preprocessing the acquired software code to be detected; 3) After the input data is segmented and preprocessed, the byte pair code BPE is used for converting the software codes into Tokens sequences; 4) Feature fusion Extracting Tokens sequences of embedded vectors using an encoder-only architecture pre-training model and a decoder-only architecture pre-training model, respectively; splicing the embedded vectors from two different sources to obtain a feature fusion vector; 5) Inputting the feature fusion vector representation into a code vulnerability detection model based on deep learning, and training the model; 6) And testing the model, detecting the loopholes by using the verified model, and acquiring whether the code segment contains the loopholes or not according to the detection result.
  2. 2. The method for detecting software vulnerabilities based on feature fusion of a pre-training model according to claim 1, wherein in the step 1), the training set is generated based on synthetic training set sample data of a large language model, specifically comprising the following steps: 1.1 Sample expansion of the training dataset; Based on the original codes, the large language model is guided to generate data in a zero sample prompting and few sample prompting mode, and expanded samples are obtained; 1.2 Sample filtration; Filtering the sample to obtain an extended sample conforming to the specification; 1.3 Sample variation; 1.3.1 Slicing the extended sample codes meeting the specification, and reserving code sentences related to the loopholes; 1.3.2 After the code sentences of the relevant vulnerability characteristics are reserved, mutation is carried out on the sentences irrelevant to the vulnerability, so that the diversity of the data set samples is enhanced; 1.4 Mixing the original sample, the extended sample data meeting the specification and the variation sample to obtain the synthetic training set sample data with balanced categories.
  3. 3. The software vulnerability detection method based on pre-training model feature fusion of claim 2, wherein the sample filtering comprises the steps of: 1.1.1 Code similarity detection Detecting code data generated by the large model by using Simian tools, removing data samples with similarity higher than a set threshold value, and setting the threshold value to be 0.9 to eliminate similar codes when the data samples are configured Simian; 1.1.2 Sample filtration The method comprises the steps of performing pseudo-labeling treatment on a data sample to be detected through differences among three detection tools, and comparing and correcting with an original label, wherein the method comprises the following steps: (a) For a specific vulnerability type, a small number of data set samples with labels are used for respectively detecting by three vulnerability detection tools to obtain the accuracy of each vulnerability type, and the accuracy is used as the prediction confidence degree of a detection tool for a certain unknown sample for the specific vulnerability type; The three vulnerability detection tools are static analysis tool Fortify, bounded model checking tool JBMC for Java code, and AIDetectVul tool; (b) Weights are distributed for the three detection tools aiming at different vulnerability types; (c) After the weight of each category is obtained, detecting the generated data sample by using three detection tools, calculating and predicting the weighted summation score of each category, and taking the category with the highest score in the results as the final detection result of the test sample; (d) And comparing the detection result with the original label category, and correcting error annotation possibly existing in the generated data sample to obtain a data set meeting the standard.
  4. 4. The software bug detection method based on feature fusion of pre-training models according to claim 1, wherein in the step 2), the slicing of the acquired software code to be detected includes: For the input original codes, resolving the original codes to construct an abstract syntax tree AST; resolving the software code into function-level code granularity; traversing abstract syntax tree AST and identifying function definition nodes, decomposing the whole code into a plurality of independent functions, and obtaining code segmentation of function-level code granularity; 2.3 Traversing the generated abstract syntax tree, obtaining syntax tree nodes possibly causing code loopholes, and reserving a loophole function according to the number of loophole lines in the tag for data with loopholes.
  5. 5. The method for detecting software vulnerabilities based on feature fusion of a pre-training model according to claim 1, wherein in the step 3), the conversion is performed in such a way that each character in the software code is initially regarded as a single Token, in each iteration, the algorithm counts the occurrence frequencies of all adjacent character pairs, and selects the character pair with the highest frequency to be combined into a new Token, and the process is repeated until a preset vocabulary size or number of combinations is reached.
  6. 6. The method for detecting software vulnerabilities based on feature fusion of a pre-training model according to claim 1, wherein in the step 5), a Transformer architecture model of a 6-layer encoder structure is adopted as the code vulnerabilities detection model.
  7. 7. The software vulnerability detection method based on pre-training model feature fusion of claim 6, wherein in step 5), the training method is as follows: The training process is as follows: 5.1 A) self-attention mechanism; 5.1.1 Firstly, multiplying an input code feature vector matrix X with three different training weight matrices W Q 、W K and W V respectively to convert the code feature vector matrix X into a Query matrix Q, key matrix K and a Value matrix V; ; 5.1.2 Calculating the dot product of the Query matrix Q and the Key matrix K to obtain an attention score matrix S; ; 5.1.3 To stabilize the training process, the attention score matrix S is divided by the Key vector dimension A normalized attention score is obtained for the square root of (2); ; 5.1.4 The normalized attention score matrix S' is multiplied by the Value matrix V after being subjected to Softmax operation, and the weighted Value matrix is obtained and is the output result of the self-attention mechanism; ; 5.2 Multi-headed attention mechanisms; Enhancing the expressive power of the model by running multiple attention mechanisms in parallel; respectively carrying out linear transformation on an input sequence to obtain a plurality of groups of Query, key and Value matrixes, respectively calculating attention weights and multiplying the Value matrixes to obtain weighted Value matrixes, and then carrying out weighted matrix on all heads The final output of the multi-head attention is obtained through a linear change after splicing, and the calculation formula is as follows: Wherein W O is an output weight matrix; 5.3 Residual connection and layer normalization are carried out on the output result of the multi-head attention layer; 5.4 Inputting the normalization result into a feedforward neural network for further processing data, and then carrying out residual connection and layer normalization again; 5.5 Step 5.1) to step 5.4) are that the input sequence is operated by one encoder layer, the output processed by the one encoder layer is sent to the next encoder layer, and the process is repeated to obtain the final encoder output; 5.6 After the output results are subjected to pooling operation, the feature dimension is mapped to a single output node through a full connection layer for final classification tasks.
  8. 8. The method for detecting software vulnerabilities based on feature fusion of a pre-training model according to claim 1, wherein in step 2), preprocessing the acquired software code to be detected comprises reducing redundant information in the software code data; Redundant information in the data is reduced by removing operations including notes, standardized spaces, and line breaks.
  9. 9. An electronic device, characterized in that, Comprising the following steps: One or more processors; And Storage means for storing one or more programs, Wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-8.
  10. 10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any one of claims 1 to 8.

Description

Software vulnerability detection method based on feature fusion of pre-training model Technical Field The invention relates to an information security technology, in particular to a software vulnerability detection method based on feature fusion of a pre-training model. Background The diversity and complexity of software lead to insufficient generalization capability of the existing model based on rules and deep learning, the rule method is difficult to exhaust all vulnerability types, the maintenance cost is high, the data-driven deep learning model is limited by the scale and diversity of training data and is easily influenced by data deviation, and the vulnerability generalization based on a large model is strong, but the detection efficiency is low, the cost is high, and the requirement of vulnerability detection in the industrial field is not met, so how to improve the generalization of the model while considering the performance becomes a problem to be solved urgently. Disclosure of Invention Aiming at the defects in the prior art, the invention provides a software vulnerability detection method based on feature fusion of a pre-training model. The technical scheme adopted for solving the technical problems is that the software vulnerability detection method based on the feature fusion of the pre-training model comprises the following steps: 1) Acquiring a software code to form a training set, wherein the software code is an original code of software; 2) Code analysis and pretreatment; 2.1 For the inputted training set codes, resolving the original codes into abstract syntax tree form; analyzing the source codes by using Tree-sitter and constructing an abstract syntax Tree AST; 2.2 Parsing the software code into function level code granularity; traversing abstract syntax tree AST and identifying function definition nodes, decomposing the whole code into a plurality of independent functions, and obtaining code segmentation of function-level code granularity; 2.3 Traversing the generated abstract syntax tree to obtain syntax tree nodes possibly causing code loopholes, and reserving a loophole function according to the number of loophole lines in the tag for data with loopholes; 2.4 Reducing redundant information in the software code data; finally, redundant information in the code data is reduced by removing operations including notes, standardized spaces and line-wrapping symbols; 3) Splitting the software code data processed in the step 2), and converting the software code into Tokens sequences by using byte pair coding (BPE); The conversion method is that each character in the software code is regarded as a single Token initially, in each iteration step, the algorithm counts the occurrence frequency of all adjacent character pairs, and the character pair with the highest frequency is selected to be combined into a new Token, and the process is repeated until the preset vocabulary size or the combination times are reached; 4) Feature fusion Extracting the embedded vectors of the Tokens sequence using an encoder-Only architecture (Encoder-Only architecture) pre-training model and a decoder-Only architecture pre-training model, respectively; splicing the embedded vectors from two different sources to obtain a feature fusion vector; 5) Inputting the feature fusion vector representation into a code vulnerability detection model based on deep learning, and performing model training; 6) And testing the model, detecting the loopholes by using the verified model, and acquiring whether the code segment contains the loopholes or not according to the detection result. According to the scheme, in the step 1), the training set is generated based on the synthetic training set sample data of the large language model, and the method specifically comprises the following steps: 1.1 Sample expansion of the training dataset; Based on the original codes, the large language model is guided to generate data in a zero sample prompting and few sample prompting mode, and expanded samples are obtained; 1.2 Sample filtration; Filtering the sample to obtain an extended sample conforming to the specification; 1.3 Sample variation; 1.3.1 Slicing the extended sample codes meeting the specification, and reserving code sentences related to the loopholes; 1.3.2 After the code sentences of the relevant vulnerability characteristics are reserved, mutation is carried out on the sentences irrelevant to the vulnerability, so that the diversity of the data set samples is enhanced; 1.4 Mixing the original sample, the extended sample data meeting the specification and the variation sample to obtain the synthetic training set sample data with balanced categories. According to the scheme, the sample filtering comprises the following steps: 1.1.1 Code similarity detection Detecting code data generated by the large model by using Simian tools, removing data samples with similarity higher than a set threshold value, and setting the threshold value t