CN-121980569-A - Vulnerability restoration method and system integrating code graph representation and knowledge retrieval enhancement
Abstract
The invention relates to a vulnerability restoration method and system integrating code graph representation and knowledge retrieval enhancement, wherein the method comprises the steps of obtaining a vulnerability data set containing a plurality of vulnerability instances; the method comprises the steps of carrying out semantic extraction on a vulnerability instance, obtaining a corresponding vulnerability cause and a vulnerability restoration strategy, carrying out structural storage, constructing a vulnerability knowledge base, obtaining target vulnerability codes to be restored and vulnerability positioning information, converting the target vulnerability codes into code attribute graphs, pruning the code attribute graphs according to the vulnerability positioning information, extracting semantic feature vectors of the target vulnerability codes, carrying out similarity calculation on the semantic feature vectors and knowledge items in the vulnerability knowledge base respectively, retrieving most relevant vulnerability knowledge, forming enhancement information, fusing the enhancement information with basic information, generating prompt words, and driving a large language model to generate candidate restoration patches. Compared with the prior art, the method and the device realize automatic bug repair with higher quality and higher reliability.
Inventors
- GAO YATING
- HU WEI
- Xiang Zhenglin
- LI GAOLEI
- CHEN XIUZHEN
Assignees
- 国家电网有限公司信息通信中心
- 上海交通大学
Dates
- Publication Date
- 20260505
- Application Date
- 20251127
Claims (10)
- 1. A vulnerability restoration method integrating code graph representation and knowledge retrieval enhancement is characterized by comprising the following steps: obtaining a vulnerability data set comprising a plurality of vulnerability instances, wherein the vulnerability instances comprise vulnerability codes and patch codes; Semantic extraction is respectively carried out on each vulnerability instance in the vulnerability data set, corresponding vulnerability causes and repairing strategies are obtained, structured storage is carried out, and a vulnerability knowledge base is constructed; obtaining target vulnerability codes to be repaired and corresponding vulnerability positioning information, converting the target vulnerability codes into code attribute graphs, pruning the code attribute graphs according to the vulnerability positioning information, and generating a pruning code structure diagram; Extracting semantic feature vectors of target vulnerability codes, respectively calculating the similarity of the semantic feature vectors and each knowledge item in a vulnerability knowledge base, and searching out the most relevant vulnerability knowledge; And forming enhancement information according to the generated pruning code structure diagram and the most relevant vulnerability knowledge, and fusing the enhancement information with basic information generated based on the target vulnerability code and the corresponding vulnerability positioning information to generate a prompt word so as to drive a large language model to generate candidate repair patches.
- 2. The method for bug fixes with fusion code graph representation and knowledge retrieval enhancement according to claim 1, wherein the process of obtaining bug causes and fixes strategies comprises: Constructing a vulnerability cause prompt according to the vulnerability codes and the corresponding vulnerability positioning information in the vulnerability instance, and inputting the vulnerability cause prompt into the large language model to acquire the vulnerability cause, wherein the vulnerability cause prompt is used for prompting the large language model to extract root causes of the vulnerability from the vulnerability codes and the corresponding vulnerability positioning information; and constructing a repair strategy prompt according to the obtained vulnerability cause and the corresponding patch code, and inputting the repair strategy prompt into the large language model to obtain the repair strategy, wherein the repair strategy prompt is used for prompting the large language model to extract knowledge related to the repair strategy according to the vulnerability cause and the corresponding patch code.
- 3. The method for bug fixes with merging of code graph representations and knowledge retrieval enhancement of claim 1, wherein the generating of the code attribute graph comprises: The object vulnerability code is converted into a unified graph structure containing abstract syntax trees, control flow graphs and program dependency graphs by adopting Joern tools or equivalent static analysis tools.
- 4. The method for bug fixes with integrated code graph representation and knowledge retrieval enhancement of claim 1, wherein pruning the code attribute graph according to bug positioning information comprises: Marking a node set related to the vulnerability in the code attribute graph according to the vulnerability positioning information Sum edge set ; Slave node set Starting from each vulnerability node in the group along the edge In a forward and backward traversal of vulnerability edges to identify a forward set of associated nodes Node set associated with backward direction Deleting nodes which do not belong to the node set in the code attribute graph Nor does it belong to the forward association node set Node set associated with backward direction And the nodes and the edges connected with the nodes, and obtaining a pruning code structure diagram related to the loopholes.
- 5. The method for bug fixes with fusion of code graph representation and knowledge retrieval enhancement of claim 1, wherein the similarity calculation expression is: In the formula, For the semantic feature vector of the target vulnerability code, For knowledge items in the vulnerability knowledge base, As the weight coefficient of the light-emitting diode, For target vulnerability code And knowledge item The degree of mixing similarity between the two, For target vulnerability code And knowledge item The semantic similarity of the codes between them, For target vulnerability code And knowledge item Vulnerability reasons similarity between.
- 6. The method for bug fixes with integrated code graph representation and knowledge retrieval enhancement as claimed in claim 5, wherein the code embedding model is adopted to extract semantic feature vectors of the target bug codes, and the target bug codes are determined by calculating L2 distance And knowledge item And calculating the semantic similarity of codes by a K neighbor search method.
- 7. The method for bug fixes with fusion of code graph representation and knowledge retrieval enhancement as claimed in claim 5, wherein the target bug codes are calculated by using BM25 algorithm And knowledge item And (3) carrying out normalization processing on the vulnerability cause similarity calculation results.
- 8. The method for bug fixes with fusion of code graph representations and knowledge retrieval enhancement of claim 5, wherein the method is implemented by adjusting weight coefficients To balance the contribution of code semantics and vulnerability reasons in the search.
- 9. The method for bug fixes with fusion of code graph representation and knowledge retrieval enhancement according to claim 1, wherein the constructing process of the basic information comprises: By setting the role identity of the model, the model is guided to enter a correct thinking mode required by executing a task, so that the function of the model related to code safety is activated, and the target vulnerability code to be repaired and corresponding vulnerability positioning information are introduced.
- 10. A vulnerability restoration system integrating a code graph representation and knowledge retrieval enhancement, comprising a memory and a processor, wherein the memory stores a computer program, and the processor invokes the computer program to execute the steps of the method according to any one of claims 1-9.
Description
Vulnerability restoration method and system integrating code graph representation and knowledge retrieval enhancement Technical Field The invention relates to the technical field of bug fixes, in particular to a bug fix method and a bug fix system with fusion code graph representation and knowledge retrieval enhancement. Background In recent years, large language model (Large Language Model, LLM) based methods have shown significant potential in the field of automated vulnerability remediation, and typical methods include VRepair for performing migration learning on vulnerability remediation data using a transducer model. VulRepair improves the overall repair rate of vulnerabilities through extensive pre-training and the use of byte-to-code. The VUL-RAG enhances the model's understanding and repair capabilities of complex codes by introducing a search enhancement generation (RETRIEVAL-Augmented Generation, RAG) mechanism. Although these methods have progressed in terms of versatility and automation, they generally rely on large-scale labeled vulnerability samples or external knowledge corpora, and the training and retrieval costs are high. Meanwhile, the model has insufficient semantic understanding of the structured code, and patches with correct grammar and logic errors are easy to generate. Code representation (Code representation) is a key technology for converting source code into proper format, and by accurately expressing the semantics and structure of the code, hidden vulnerability patterns in the code can be effectively extracted and analyzed. The code representation method based on the graph structure integrates more structure information into the graph structure to represent the program semantics. Common graph structures include control flow graphs (Control Flow Graphs, CFGs), data flow graphs (Data Flow Graphs, DFGs), call graphs (CALL GRAPHS, CGS), and Program dependency graphs (Program DEPENDENCE GRAPHS, PDGS). Grammar and semantic information representing the code used by a worker such as SCPG employ a graph neural network to extract feature detection vulnerabilities, vulMaster uses an abstract grammar tree to enhance the understanding of grammar semantics. However, the existing method focuses on static structure modeling, codes are input in a pure text form, and the codes are difficult to be deeply fused with knowledge understanding and reasoning capability of LLM, so that limitations exist in coping with loopholes with complex structures. Knowledge enhancement generation (knowledges-Augmented Generation, KAG) provides a new idea for solving the problem of insufficient Knowledge of models. By introducing external domain knowledge in the generation process, the model can promote understanding and reasoning capability of the vulnerability context without retraining. RAGFix and SOSecure have validated the mechanism for code repair and secure code generation. However, the existing KAG method often has difficulty in accurately searching high-correlation knowledge in a vulnerability scene, and the lack of combination of structured code information limits further improvement of repair accuracy. Therefore, it is needed to explore a novel vulnerability repair framework combining graph structure representation and knowledge enhancement generation mechanism, and fully utilize external knowledge to enhance understanding of a large language model on vulnerability causes and repair strategies while improving understanding capability of a model structure, so as to realize automatic vulnerability repair with higher quality and higher reliability. Disclosure of Invention The invention aims to overcome the defects of the prior art and provide the vulnerability restoration method and system with the functions of integrating the code graph representation and the knowledge retrieval enhancement, which fully utilize the external knowledge to enhance the understanding of a large language model on the vulnerability cause and the restoration strategy while improving the understanding capability of the model structure, thereby realizing the automatic vulnerability restoration with higher quality and higher reliability. The aim of the invention can be achieved by the following technical scheme: a vulnerability restoration method integrating code graph representation and knowledge retrieval enhancement comprises the following steps: obtaining a vulnerability data set comprising a plurality of vulnerability instances, wherein the vulnerability instances comprise vulnerability codes and patch codes; Semantic extraction is respectively carried out on each vulnerability instance in the vulnerability data set, corresponding vulnerability causes and repairing strategies are obtained, structured storage is carried out, and a vulnerability knowledge base is constructed; obtaining target vulnerability codes to be repaired and corresponding vulnerability positioning information, converting the target vulnerability codes into cod