CN-122020663-A - Software vulnerability discovery method, device, computer equipment and medium based on cooperation of static analysis and size model

CN122020663ACN 122020663 ACN122020663 ACN 122020663ACN-122020663-A

Abstract

The embodiment of the invention provides a software vulnerability mining method, device, computer equipment and medium based on cooperation of static analysis and a size model, wherein the method comprises the following steps of identifying an attack surface, classifying the attack surface, generating a code line annotation node set, performing first-stage filtering on the code line annotation node set based on rules, performing second-stage filtering on the annotation node set after primary filtering based on a twin neural network, generating a basic data flow graph by taking high-confidence nodes in the high-confidence attack surface set as tracking starting points, acquiring a front node set through static analysis, expanding the front node set through a large language model and a path expansion algorithm, generating a data flow graph after completion, detecting vulnerabilities based on the data flow graph after completion, and obtaining a mining result of the software vulnerability. According to the scheme, the false alarm rate is reduced and the coverage rate and accuracy of vulnerability mining are improved through cooperation of the large language model and the twin neural network.

Inventors

CHEN ZHIHAO
Hong Ruiqi
SHI LIN
ZHANG LI

Assignees

北京航空航天大学

Dates

Publication Date: 20260512
Application Date: 20260127

Claims (10)

1. A software vulnerability discovery method based on static analysis and size model cooperation is characterized by comprising the following steps: Identifying an attack surface, classifying the attack surface, generating a code line annotation node set, performing first-stage filtering on the code line annotation node set based on rules, generating a first-time filtered annotation node set, and performing second-stage filtering on the first-time filtered annotation node set based on a twin neural network, so as to generate a high-confidence attack surface set; Taking a high-confidence node in the high-confidence attack surface set as a tracking starting point, carrying out data flow analysis to generate a basic data flow graph, acquiring a front edge node set through static analysis based on the basic data flow graph, expanding the front edge node set through a large language model and a path expansion algorithm, and generating a completed data flow graph; and detecting the loopholes based on the complemented data flow graph based on the multi-agent loophole verification method to obtain the mining result of the software loopholes.
2. The method for mining software vulnerabilities based on collaborative analysis and size models according to claim 1, wherein identifying and classifying an attack surface generates a set of code line annotation nodes, first filtering the set of code line annotation nodes based on rules, generating a set of once filtered annotation nodes, comprising: Carrying out static scanning on a target software item, identifying all potential input points based on a preset attack surface classification system, and generating corresponding code line annotation nodes for each potential input point; Encapsulating identifiers of variables or functions in the code line annotation nodes and complete text line information of the encapsulated variables or functions in source codes to generate a code line annotation node set; Setting a heuristic rule set, wherein the heuristic rule set comprises one or more rules of eliminating literal constants and variables statically inferred to be of a safety basic type, eliminating variable definition nodes positioned in an exception handling structure and eliminating nodes only comprising built-in function calls without system influence; And carrying out first-stage filtering on risks of the code line annotation node set based on the heuristic rule set, eliminating low-risk nodes, and generating an annotation node set after one-time filtering.
3. The software vulnerability discovery method of claim 1, wherein the second filtering the first filtered annotation node set based on twin neural network to generate a high confidence attack surface set comprises: vectorizing the identifier content and the complete source code of each node in the annotation node set after the primary filtering to generate a composite semantic feature vector corresponding to each node; constructing a twin neural network based on a sub-network shared by two weights; Inputting the composite semantic feature vector into the trained twin neural network, reasoning nodes in the once filtered annotation node set, and calculating the average semantic distance between the output vector of each node and a preset vulnerability data source point reference set; And judging the nodes with the average semantic distance smaller than a preset judgment threshold value as high-risk nodes, and generating the high-confidence attack surface set after combining all the high-risk nodes.
4. The software vulnerability discovery method of claim 3, wherein constructing a twin neural network based on two weight sharing sub-networks comprises: Constructing two sub-networks with shared weights, wherein each sub-network takes a plurality of layers of bidirectional long-short-term memory networks as a core feature extractor, and the twin neural network is used for extracting bidirectional long-distance dependent semantic features from code sequences and outputting high-dimensional dense vectors; constructing training data consisting of a positive sample pair and a negative sample pair based on a vulnerability sample marked as a real attack surface and a non-vulnerability safety noise sample, and training the twin neural network based on the training data; And optimizing the twin neural network by using a contrast loss function, so that Euclidean distance of a positive sample pair in a feature space is reduced, and the distance of a negative sample pair is increased to be beyond a preset marginal threshold.
5. The software vulnerability discovery method based on static analysis and size model cooperation of claim 1, wherein based on the basic data flow graph, a front node set is obtained through static analysis, the front node set is expanded through a large language model and a path expansion algorithm, and a completed data flow graph is generated, comprising: identifying all nodes with zero degree of departure and insensitive operation from the basic data flow graph to form a front edge node set; performing static analysis on each front edge node in the front edge node set to extract the code context information of the front edge node, inputting the code context information into a large language model for semantic reasoning and prediction, and generating an implicit successor node set; Merging the implicit successor node set into the basic data flow graph, and establishing a directed edge from the leading edge node to the corresponding successor node to obtain a data flow graph after preliminary expansion; And re-executing static analysis to mine the propagation path of the newly added node based on the primarily expanded data flow graph, and iteratively executing expansion until the termination condition is met, so as to generate the completed data flow graph.
6. The method for mining software vulnerabilities based on collaborative analysis and size models according to claim 5, wherein for each leading edge node in the set of leading edge nodes, performing static analysis to extract code context information of the leading edge node, inputting the code context information into a large language model for semantic reasoning and prediction, generating a set of implicit successor nodes, comprising: for each leading edge node in the leading edge node set, the following steps are executed to obtain an implicit successor node until all the leading edge nodes are processed to generate an implicit successor node set: Judging whether the current leading edge node is expanded or not, and judging whether the number of the nodes of the current propagation path reaches a preset threshold value or not; If the node is not expanded and does not reach a preset threshold, extracting code line annotation node information corresponding to the current leading edge node and the upper and lower Wen Yuyi of a code segment where the current leading edge node is located, and constructing structured prompt information through the code line annotation node information and the context semantics; Inputting the prompt information into a large language model, analyzing the large language model from multiple angles to obtain possible flow directions of data, and predicting to obtain one or more implicit successor nodes according to the possible flow directions of the data; and eliminating the predicted repeated nodes in the implicit successor nodes.
7. The software vulnerability discovery method based on cooperation of static analysis and size model according to any one of claims 1 to 6, wherein the vulnerability verification method based on multiple agents detects vulnerabilities based on the completed data flow graph to obtain the discovery result of the software vulnerabilities, comprising: Configuring a first agent as a sensitive operation analyzer, and inputting the complemented data flow graph to the first agent; Scanning end nodes of all paths in the completed data flow graph in the first agent, analyzing whether the end nodes form sensitive operations or not through a universal defect enumeration knowledge base, and outputting a first structural report, wherein the first structural report comprises identifications of the sensitive operation nodes and defect types corresponding to the sensitive operation nodes; Configuring a second agent as a vulnerability validator, and inputting the completed data flow graph and the first structured report to the second agent; In the second agent, reversely tracing from the sensitive operation node to a data stream source point for each of the sensitive operation nodes identified in the first structured report, analyzing context semantics of a complete propagation path, checking whether data purification processing or condition constraints exist on the complete propagation path, and outputting a second structured report, wherein the second structured report comprises a determination result and a determination reason of vulnerability availability for each of the analyzed complete propagation paths; And obtaining the mining result of the software vulnerability based on the second structural report.
8. A software vulnerability discovery apparatus based on static analysis and size model cooperation is characterized in that the software vulnerability discovery apparatus comprises: The method comprises the steps of constructing a reliability node module, identifying an attack surface, classifying the attack surface, generating a code line annotation node set, performing first-stage filtering on the code line annotation node set based on rules, generating a first-time filtered annotation node set, and performing second-stage filtering on the first-time filtered annotation node set based on a twin neural network, so as to generate a high-confidence attack surface set; the data flow reconstruction module is used for taking the high-confidence nodes in the high-confidence attack surface set as tracking starting points, carrying out data flow analysis to generate a basic data flow graph, acquiring a front edge node set through static analysis based on the basic data flow graph, expanding the front edge node set through a large language model and a path expansion algorithm, and generating a completed data flow graph; and the vulnerability verification module is used for detecting the vulnerability based on the multi-agent vulnerability verification method and based on the completed data flow graph to obtain the mining result of the software vulnerability.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the software vulnerability discovery method based on a cooperation of static analysis and size model of any one of claims 1-7 when executing the computer program.
10. A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for executing the software vulnerability discovery method based on the cooperation of static analysis and size model according to any one of claims 1 to 7.

Description

Software vulnerability discovery method, device, computer equipment and medium based on cooperation of static analysis and size model Technical Field The present invention relates to the field of data mining technologies, and in particular, to a software vulnerability discovery method, device, computer device, and medium based on static analysis and size model collaboration. Background Software vulnerabilities seriously threaten system security, possibly resulting in data leakage, service interruption and huge economic loss. In the existing vulnerability discovery technology, static analysis is widely used because of its feature of global reasoning without executing code. Among them, the stain analysis is one of the core technologies of static analysis, and the basic principle is to track whether the unreliable input data can be propagated and touch sensitive operation, so as to detect the serious vulnerabilities of code injection, SQL injection and the like. In vulnerability detection processes based on taint analysis, three key steps are typically involved, namely, first identifying externally controllable inputs as attack surfaces, then tracking the propagation paths of those inputs along the data stream, and finally verifying whether the data has flowed to sensitive operations and triggering vulnerabilities. With the expansion of the software scale and the increase of the complexity, the classical flow is faced with serious challenges in practical application. At present, two types of methods are mainly adopted in the industry and academia for loopholes mining, wherein the first type is a traditional static analysis tool based on rules and configuration. Representative tools include CodeQL, semgrep and FlowDroid, and the like. Such tools typically rely on extensible configuration or API models to automatically discover attack surfaces and track paths through control flow graphs and data flow graphs. Some tools such as PyCG and PyAnalyzer also introduce pointer analysis or object modeling to enhance support for specific language characteristics. The second category is aided analysis methods based on large language models. With the development of deep learning technology, researches such as IRIS and LLift for vulnerability detection by using LLM have recently appeared. The method utilizes the semantic reasoning capability of the large model to assist in identifying high-risk Source and Sink and deducing the missing propagation path to a certain extent so as to attempt to make up for the shortages of traditional static analysis in semantic understanding. Despite the advances made in vulnerability mining in the prior art, the following significant drawbacks remain in the face of modern large-scale, highly dynamic software code: The attack surface identification has a large number of false positives, and when a traditional static analysis tool identifies a program entry point, it is often difficult to distinguish which are true attack surfaces which can actually introduce malicious data, and which are harmless internal calls or constants. As project sizes increase, the number of potential entry points increases explosively, and existing automation tools lack efficient pruning mechanisms. This results in wasted analysis and calculation resources and generates a large number of false positives, greatly increasing the cost of manual auditing. Data flow graph breaking results in a missing report that modern programming languages (e.g., python) have a high degree of dynamics, including a large number of dynamic types, alias mechanisms, dynamic distribution, and implicit information flows. It is difficult for conventional static analyzers to accurately resolve these complex semantic relationships, resulting in the built dataflow graphs often being fragmented, discontinuous. When the data propagation chain breaks in the intermediate link, the analysis tool cannot track the final sensitive operation, thereby causing serious vulnerability reporting omission. The limitation of a single technical path is that the simple dependence on static analysis is easy to encounter the problem of path explosion, and the analysis is difficult to complete in a reasonable time, while the simple dependence on a large language model is limited by the size of a context window, so that the whole code warehouse is difficult to globally infer, and the problem of illusion exists, and the possibility of pinching an inexistent loophole or path is overcome. The prior art lacks a collaborative mechanism capable of effectively combining the accuracy of static analysis with the semantic reasoning capability of a large model. Disclosure of Invention In view of the above, the embodiment of the invention provides a software vulnerability discovery method based on the cooperation of static analysis and a size model, so as to solve the technical problems that in the prior art, in large-scale software vulnerability detection, the number of false alarms