CN-121980572-A - Reentrant vulnerability detection method and device based on large language model combined generalization
Abstract
The disclosure relates to a method and a device for detecting reentrant vulnerabilities based on large language model combination generalization. The method comprises the steps of obtaining an input program, inputting the program to be input into a preset large language model, outputting a hidden layer and model weights, freezing the model weights, inputting the hidden layer into a preset re-entrant vulnerability detection model based on the frozen model weights, processing the hidden layer, and outputting a classification label of the program, wherein the preset re-entrant vulnerability detection model processing method comprises the steps of decomposing the hidden layer according to preset factors, determining a characteristic vector of each branch corresponding to the preset factors, determining gating weights of the characteristic vectors of each branch, carrying out weighted fusion on the characteristic vectors of all branches according to the gating weights, obtaining a comprehensive characteristic vector, and carrying out classification calculation on the comprehensive characteristic vector to determine the classification label of the input program. The method can improve accuracy and reliability of reentrant vulnerability detection.
Inventors
- Wu Faguo
- ZHOU YING
- WEI JIACHENG
- YU QI
- ZHANG XIAO
- ZHENG ZHIMING
Assignees
- 北京航空航天大学
Dates
- Publication Date
- 20260505
- Application Date
- 20251223
Claims (10)
- 1. The method for detecting the reentrant vulnerability based on the large language model combined generalization is characterized by comprising the following steps of: acquiring a program to be input; inputting the program to be input into a preset large language model, outputting a hidden layer and model weights, and freezing the model weights; Inputting the hidden layer into a preset re-entrant vulnerability detection model for processing based on the frozen model weight, and outputting a classification label of the program, wherein the classification label is used for representing whether the re-entrant vulnerability exists in the program; the method for processing the preset reentrant vulnerability detection model comprises the following steps: Decomposing the hidden layer according to preset factors, and determining the feature vector of the branch corresponding to each preset factor; Determining the gating weight of the feature vector of each branch, and carrying out weighted fusion on the feature vectors of all branches according to the gating weight to obtain a comprehensive characterization vector; and carrying out classification calculation on the comprehensive characterization vector to determine the classification label of the program.
- 2. The method for detecting the re-entrant vulnerabilities based on the combined generalization of a large language model according to claim 1, wherein the preset factors comprise four branches of external calls, state updates, data dependencies and sequence factors, the decomposing the hidden layer according to the preset factors, and determining feature vectors of branches corresponding to each preset factor comprises: decomposing the input program according to four branches of external calling, state updating, data dependence and sequence factors to obtain a hidden state function corresponding to each preset factor; the feature vector of the branch corresponding to each preset factor is determined by adopting the following formula: In the formula, Feature vectors of branches corresponding to each preset factor; is an input program; is a hidden state function of a preset factor, Respectively corresponding to external call Status update Data dependence Order factor ; The gating weight of the branch corresponding to each preset factor is determined by adopting the following formula: In the formula, Is a preset factor Is used to determine the gating weight of the (c), ; Branch scores corresponding to preset factors; as a function of the temperature parameter(s), ; For traversing four factor branches, normalizing the denominator; And (5) the total score of the branches corresponding to all the preset factors.
- 3. The method for detecting the re-entrant vulnerabilities based on the combined generalization of a large language model according to claim 2, wherein the determining the gating weight of the feature vector of each branch and the weighting fusion of the feature vectors of all branches according to the gating weight are performed to obtain a comprehensive characterization vector, comprises the following formula: In the formula, Is a comprehensive token vector of the input program P.
- 4. A method of large language model combinatorial generalization based re-entrant vulnerability detection as claimed in any one of claims 1 to 3, wherein said classifying the comprehensive token vector to determine the classification tag of the program comprises: the comprehensive characterization vector is classified and calculated by adopting the following formula to obtain a composite function : In the formula, Is a weight vector; Is a bias term; the inner product of the weight and the score is used for representing the evaluation of the fusion representation by the classifier; The following formula is adopted for the compound function Performing transformation to determine prediction probability To be based on the prediction probability Determining a class label for the program: In the formula, Normalization of Sigmoid activation functions.
- 5. The large language model combined generalization based reentrant vulnerability detection method of claim 4, wherein said classifying the comprehensive characterization vector to determine a classification label of the program further comprises: Determining the overall loss function of the classification calculation by adopting the following formula : In the formula, Representing the desire to average the loss for all samples in a training batch; Represents cross entropy loss for measuring vulnerability probability of model prediction A difference from the real tag y; representing Jacobian matrix alignment loss, wherein the Jacobian matrix alignment loss is used for restraining consistency of gating weight and actual contribution degree of each branch; Represents an adjustable hyper-parameter for controlling the relative intensity of the alignment term in the total loss, ; Wherein the alignment loss of the Jacobian matrix is determined by adopting the following formula : In the formula, Representation of And For measuring the divergence of (2) And Is a difference in (2); representing an actual gating weight distribution; Representing a theoretical ideal weight distribution, determined using the following formula: In the formula, Representing each branch Is used for the detection of the sensitivity of (a), , Is a gradient operator, representing the pair of Trainable parameters of individual branches The derivative is obtained by the method, Representation and input program The log probability corresponding to the true label of (c), Representing the measurement of the gradient vector by the L2 norm Is of a size of (2); is the sum of all branch sensitivities and is used to normalize the sensitivity of each branch.
- 6. The large language model combinatorial generalization based re-entrant vulnerability detection method of claim 5, further comprising: Training the preset re-entrant vulnerability detection model through a dataset comprising at least one of a synthetic dataset, an external invocation dataset, and a data-dependent dataset.
- 7. The large language model combined generalization based re-entrant vulnerability detection method of claim 6, wherein said dataset is said synthetic dataset that satisfies the following condition: In the formula, And Representing two input programs; And (3) with Are Lipschitz constants; And Representing a composite function corresponding to the two input programs; And Representing corresponding constituent composite functions of two input programs Four preset factors of (2) Is a function of (2); Representing test distribution Is a support set of (2); Representing the dimensions of the potential space; representing potential factors; Representation of With respect to potential factors Jacobian matrix of (a); As a point of reference to the reference, ; Is an integral representation.
- 8. The large language model combined generalization based reentrant vulnerability detection method of claim 6, wherein the data set is the external call data set, and the method for constructing the external call data set comprises: acquiring a preset quality sample; Expanding a data set based on the preset quality sample; And carrying out consistency check and preference refining on the extended data set, and determining an external calling data set.
- 9. The large language model combined generalization-based re-entrant vulnerability detection method of claim 6, wherein said dataset is said data-dependent dataset, said data-dependent dataset construction method comprising: determining a dependency rule table of the data dependency data set, wherein the dependency rule table comprises a dependency direction, legal data connection points and structural constraints; generating positive and negative samples of two variants of dependence and non-dependence at the same time based on the dependence rule table and a preset contract code; And carrying out semantic verification on the positive and negative samples, and obtaining a data dependent data set based on the verified samples.
- 10. The utility model provides a but, reentrant vulnerability detection device based on big language model combination formula generalization which characterized in that includes: the input module is configured to acquire a program to be input; The preprocessing module is configured to input the program to be input into a preset large language model, output a hidden layer and model weights and freeze the model weights; The processing module is configured to input the hidden layer into a preset re-entrant vulnerability detection model for processing based on the frozen model weight, and output a classification label of the program, wherein the classification label is used for representing whether the program has the re-entrant vulnerability or not; The preset reentrant vulnerability detection model comprises: the feature extraction unit is configured to decompose the hidden layer according to preset factors and determine feature vectors of branches corresponding to each preset factor; The self-adaptive fusion unit is configured to determine the gating weight of the feature vector of each branch, and perform weighted fusion on the feature vectors of all branches according to the gating weight to obtain a comprehensive characterization vector; And the classification unit is configured to perform classification calculation on the comprehensive characterization vector and determine a classification label of the program, wherein the classification label is used for characterizing whether the program has the reentrant vulnerability.
Description
Reentrant vulnerability detection method and device based on large language model combined generalization Technical Field The disclosure relates to the technical field of large language models, in particular to a method and a device for detecting reentrant vulnerabilities based on large language model combination generalization. Background The large language model (Large Language Models, LLMs) can show excellent wide capability through the pre-training of a massive text corpus, and can realize a better man-machine interaction process especially in the artificial intelligence field. However, in the professional vertical field of re-entrant vulnerability detection, the update training of the existing large language model does not meet the use requirements of users. Disclosure of Invention In order to solve the technical problems, the disclosure provides a method and a device for detecting reentrant vulnerabilities based on large language model combination generalization, which aims to solve the problems in the prior art. In a first aspect of the present disclosure, a method for detecting a reentrant vulnerability based on a large language model combined generalization is provided, where the method for detecting a reentrant vulnerability based on a large language model combined generalization includes: acquiring a program to be input; Freezing the model weight, inputting the hidden layer into a preset re-entrant vulnerability detection model for processing based on the frozen model weight, and outputting a classification label of the program, wherein the classification label is used for representing whether the re-entrant vulnerability exists in the program; the method for processing the preset reentrant vulnerability detection model comprises the following steps: Decomposing the hidden layer according to preset factors, and determining the feature vector of the branch corresponding to each preset factor; Determining the gating weight of the feature vector of each branch, and carrying out weighted fusion on the feature vectors of all branches according to the gating weight to obtain a comprehensive characterization vector; and carrying out classification calculation on the comprehensive characterization vector to determine the classification label of the program. In some embodiments of the present disclosure, the preset factors include four branches of external call, state update, data dependency and order factor, the decomposing the hidden layer according to the preset factors, and determining feature vectors of branches corresponding to each preset factor includes: decomposing the input program according to four branches of external calling, state updating, data dependence and sequence factors to obtain a hidden state function corresponding to each preset factor; the feature vector of the branch corresponding to each preset factor is determined by adopting the following formula: In the formula, Is an input program; is a hidden state function of a preset factor, Respectively corresponding to external call, state update, data dependence and sequence factors; the gating weight of the branch corresponding to each preset factor is determined by adopting the following formula: In the formula, Is a preset factorIs used to determine the gating weight of the (c),;Branch scores corresponding to preset factors; as a function of the temperature parameter(s), ;To traverse the four factor branches, a denominator is used to normalize the denominator. In some embodiments of the present disclosure, the determining the gating weight of the feature vector of each branch, and weighting and fusing the feature vectors of all branches according to the gating weight, to obtain a comprehensive characterization vector, includes the following formula: In the formula, Is a comprehensive token vector of the input program P. In some embodiments of the disclosure, the classifying calculating the comprehensive characterization vector to determine a classification label of the program includes: the comprehensive characterization vector is classified and calculated by adopting the following formula to obtain a composite function : In the formula,Is a weight vector; Is a bias term; the inner product of the weight and the score is used for representing the evaluation of the fusion representation by the classifier; The following formula is adopted for the compound function Performing transformation to determine prediction probabilityTo be based on the prediction probabilityDetermining a class label for the program: In the formula, Normalization of Sigmoid activation functions. In some embodiments of the present disclosure, the classifying calculation of the comprehensive characterization vector, determining a classification label of the program, further includes: Determining the overall loss function of the classification calculation by adopting the following formula : In the formula,Representing the desire to average the loss for all samples