CN-122020660-A - Code vulnerability semantic tracing and automatic repairing method based on multi-mode large model

CN122020660ACN 122020660 ACN122020660 ACN 122020660ACN-122020660-A

Abstract

The invention discloses a code vulnerability semantic tracing and automatic repairing method based on a multi-mode large model, which comprises the steps of firstly establishing a unified semantic representation space through multi-mode code-vulnerability semantic representation learning; the method comprises the steps of identifying a complete propagation path of a taint vulnerability through taint propagation path graph neural reasoning, generating a high-quality safety repair code through semantic equivalent repair codes, and determining root causes and influence ranges through vulnerability root cause tracing and influence domain analysis. According to the method, the static code structure, the dynamic execution track, the vulnerability knowledge base and the multi-mode information intended by the developer are fused, the code-vulnerability semantic association map is constructed, the vulnerability root source tracing, the stain propagation path tracking and the automatic generation of the semantically equivalent repair code are realized, the software safety development efficiency can be remarkably improved, and the safety risk is reduced.

Inventors

LI YIPENG
SHAO XINQING
WU HAO
ZHANG PING
ZHOU HONGWEI

Assignees

江苏润和软件股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260122

Claims (6)

1. A code vulnerability semantic tracing and automatic repairing method based on a multi-mode large model is characterized by firstly establishing a unified semantic representation space through multi-mode code-vulnerability semantic representation learning, then recognizing a complete propagation path of a taint vulnerability through taint propagation path graph neural reasoning, then generating a high-quality safety repairing code through semantic equivalence repairing codes, and finally determining a root cause and an influence range through vulnerability root cause tracing and influence domain analysis, and specifically comprises the following steps: S1, multi-mode code-vulnerability semantic representation learning, respectively designing special feature extractors aiming at four modes of source codes, execution tracks, vulnerability knowledge base and context information, constructing a unified code-vulnerability semantic representation space, and providing a unified semantic representation basis for subsequent stain propagation analysis, restoration code generation and root cause tracing; s2, performing neural reasoning on the stain propagation path diagram, and learning fusion characteristics based on the multi-mode code-vulnerability semantic representation Aiming at spot vulnerabilities of SQL injection, XSS and command injection, constructing a fusion representation of a program data flow graph and a control flow graph, and tracking a complete propagation path from a source point to a dangerous function of user input through a graph neural network; S3, generating a semantically equivalent repair code based on the vulnerability position and type identified by the neural reasoning of the stain propagation path diagram; S4, vulnerability root tracing and influence domain analysis, namely reversely tracing the root cause of the design defect or the coding error from the vulnerability triggering point based on the vulnerability position identified by the semantic equivalent repair code generation module, and providing deeper guidance for a repair scheme; S5, model training strategies, and optimizing model performance by adopting multi-stage training strategies.
2. The method for tracing code vulnerability semantics and automatically repairing based on multi-mode large model as claimed in claim 1, wherein the step S1 comprises: s11, static code structure coding For source code, multi-level code is used to express learning, and features of four levels of morphology, grammar, semanteme and architecture are extracted at the same time Control flow graph And dataflow graphs ; Firstly, extracting Token-level lexical features by using a code pre-training model, and setting a source code Token sequence as Wherein For the length of the sequence, Represent the first Obtaining, by the code encoder, a context representation: Wherein the method comprises the steps of Is the first The context of the Token embeds the vector, For the dimension of the code feature, The feature matrix is Token level feature matrix; For AST node level grammar characteristics, coding AST by adopting a graphic neural network, wherein the AST comprises Each node, the node characteristic matrix is The adjacent matrix is Aggregating node neighborhood information through a multi-layer graph rolling network: Wherein the method comprises the steps of Is the first The node representation of the layer is such that, In the form of a degree matrix, As a matrix of weights that can be learned, To activate the function by After the layer graph is rolled, AST node characteristics are obtained ; For function-level semantic features, pooling and aggregating all AST node features in the function: Wherein the method comprises the steps of As a function of The set of AST nodes involved is a set of, Is a function-level semantic feature vector; Finally, multi-level code features are fused through an attention mechanism: Wherein the method comprises the steps of For the average pooling of Token-level features, For the average pooling of AST node features, Is a learnable fusion weight, meets the following conditions , The feature vector is the static code feature vector after fusion; S12, dynamic execution track coding Extracting call stack sequence, memory state change and I/O operation sequence for program execution track, setting execution track to include Each time step Is the call stack of (1) Wherein In order to call the stack depth, Is the first Layer function name; the timing characteristics of the execution trace are extracted using a bi-directional LSTM network. The call stack for each time step is first encoded as a fixed dimension vector: Wherein the method comprises the steps of For the function name to be embedded in, Is a time step Is characterized in that, To execute feature dimensions; the sequence is performed by a bi-directional LSTM process: Wherein the method comprises the steps of And Forward and backward LSTM at the first The hidden state of the individual time steps, In order to perform the trajectory feature vector, Hidden layer dimensions for LSTM; S13, vulnerability knowledge base encoding Constructing a knowledge graph containing CWE classification, CVE vulnerability information and attack modes for the vulnerability knowledge base, wherein the knowledge graph comprises Personal entity A strip relationship edge; learning entity and relation vector representation by TransE graph embedding method, transE method for mapping triples in knowledge graph Modeled as Wherein The embedded vectors are respectively a head entity, a relation and a tail entity; for each triplet, a distance function is defined: the training goal is to minimize the distance of the positive sample triples and maximize the distance of the negative sample triples, wherein the negative samples are generated by randomly replacing the head entity or the tail entity, and the loss function is as follows: Wherein the method comprises the steps of As a set of positive sample triples, As a set of negative-sample triples, Obtaining embedded vector of entity and relation by optimizing loss function, obtaining vulnerability type entity Its embedded vector is Wherein Embedding dimensions for knowledge; For the association of code fragments with vulnerability types, similarity is calculated: Wherein the method comprises the steps of For projection matrices, the code feature space is mapped to the knowledge space, Semantic similarity of codes and vulnerability types; s14, cross-modal semantic alignment and fusion To establish semantic association between code, execution trace and vulnerability knowledge, a cross-modal alignment mechanism is designed to project features of different modalities into a unified code-vulnerability semantic space: Wherein the method comprises the steps of , , In order to project the matrix of the light, In order to unify the dimensions of the semantic space, The projected feature vector; firstly, taking code features as Query, and taking execution track and vulnerability knowledge features as Key and Value: Wherein the method comprises the steps of For a learnable query, key, value projection matrix, Representing a vector concatenation operation; for each attention head ( ) Single head attention was calculated: Wherein the method comprises the steps of , , , Is the first A projection matrix of the individual heads is provided, For the dimension of each head, Calculating the attention weight of the code feature to the execution track and vulnerability knowledge by the Softmax function, wherein the larger the weight is, the stronger the relevance between the modal information and the code is; Will be The outputs of the attention heads are spliced and then fusion representation is obtained through an output projection matrix: Wherein the method comprises the steps of In order to output the projection matrix, In order to pay attention to the number of heads, The integrated multi-mode code-vulnerability semantic representation integrates information of three modes of code structure, execution behavior and vulnerability knowledge, and provides a unified semantic basis for subsequent analysis.
3. The method for tracing code vulnerability semantics and automatically repairing based on multi-mode large model as claimed in claim 2, wherein the step S2 comprises: firstly, analyzing source codes through a static analysis tool to construct a data flow chart and a control flow chart; data flow graph In the middle node Representing variables in a program, edges Representing data dependency relationships between variables, control flow graph In the middle node Representing basic blocks, edges Representing control flow transitions between basic blocks; construction of a fusion graph, namely fusing a data flow graph and a control flow graph into a unified program graph Wherein In order to obtain the fused node set, For each variable node in the data flow graph, the fusion method is that Finding basic blocks in a control flow graph that contain definitions or uses of the variables Creating variable nodes in a fusion graph And basic block node And establishing an edge from the variable node to the basic block node and an edge from the basic block node to the variable node; for the control flow edge in the control flow graph, the corresponding basic inter-block control flow edge is reserved in the fusion graph; The method comprises the steps of identifying a taint source point and a sink point, firstly identifying the taint source point and a dangerous sink point, and collecting the taint source points Obtained by matching a list of predefined dangerous input functions, a collection of dangerous sinks Obtained by matching a list of predefined hazard output functions for each point of origin of a stain Initializing a stain tag: Wherein the method comprises the steps of As an initial stain vector, the initial stain vector, As a dimension of the smudge feature, Representing an all 1 vector. For non-source point nodes, the initial stain vector is an all-zero vector; The graph neural network propagation mechanism is used for propagating taint information on the fusion program graph through the multi-layer graph neural network, the graph neural network adopts a message transmission mechanism, wherein each node aggregates information from neighbor nodes, updates self taint representation and is used for variable nodes The stain status is updated as: Wherein the method comprises the steps of Is the first Layer variables Is indicated by the stain of (a), As a dimension of the smudge feature, As a variable Is defined by a set of neighboring nodes of the network, As a weight matrix of the node itself, For a weight matrix of the neighbor aggregation, In order to activate the function, Indexing the layer of the neural network, and the updated formula represents nodes By multi-layer propagation, the taint information is propagated gradually from the source point to the whole program graph, finally reaching the dangerous sink point; Through the process of After layer propagation, for dangerous sinks The stain strength was calculated: Wherein the method comprises the steps of As a weight vector of the weight vector, For the offset scalar quantity, Representing the source point To sink point The probability that a smeared propagation path exists, Layer number of propagation for the graph neural network when And when the method is used, judging that the stain loopholes exist, and extracting the complete propagation path.
4. The method for tracing code vulnerability semantics and automatically repairing based on multi-mode large model as claimed in claim 3, wherein step S3 comprises: The method comprises the steps that a sequence-to-sequence model is adopted in the generation of a repair code, and the multi-mode fusion characteristic and the vulnerability type of the vulnerability code are used as input to generate a repaired safety code; let the input vulnerability code be characterized as that the vulnerability type is The repair generator adopts a converter decoder architecture to generate repair codes by Token in an autoregressive mode, wherein the converter decoder is formed by stacking a plurality of decoder layers, and each decoder layer comprises three sub-modules: Masking self-attention layer-self-attention computation of the generated prefix sequence to enable each position to focus on all Token generated before, use masking mechanism to ensure that the first Token is generated Only the front is visible in the case of Token The consistency of autoregressive generation is ensured by a Token; the cross attention layer is used for taking the output of the encoder as Key and Value, taking the hidden state of the decoder as Query and calculating cross attention so that the generated Token can pay attention to the relevant part of the input vulnerability code; the feedforward neural network layer is used for carrying out nonlinear transformation on the attention output and enhancing the expression capacity of the model; Each sub-module comprises residual connection and layer normalization to improve training stability and gradient flow, and after passing through the multi-layer decoder, the hidden state of the decoder is obtained: Wherein the method comprises the steps of As an embedded vector of the vulnerability type, As a result of the context information, The state matrix is hidden for the decoder, In order to repair the code length, Is a hidden state dimension; Token generation mechanism for each position of the hidden state of the decoder The probability of generation of each Token in the vocabulary is calculated by linear transformation and Softmax function: Wherein the method comprises the steps of Is the first The number of tokens to be generated is set, For a sequence of prefixes to be generated, Is the first The hidden state of the individual positions is used, To generate a weight matrix for the light source, In order to be of the size of the vocabulary, Adopting greedy decoding or beam searching strategy when generating, selecting Token with highest probability as output; in order to ensure the semantic equivalence, multiple constraint losses are introduced, the semantic equivalence losses are realized through contrast learning, and contrast learning sample pairs are firstly constructed, wherein the positive sample pairs are the original codes And corresponding repair code Negative sample pair is the original code And other code segments sampled randomly , The negative samples are derived from different vulnerability codes or fix codes, semantically uncorrelated with the original code, and the negative samples are obtained by random sampling from a training dataset, the number Setting 5-10; The semantically equivalent loss function is defined as: Wherein the method comprises the steps of For the semantic representation of the original code, In order to repair the semantic representation of the code, Is the first The number of negative-sample codes is represented, As cosine similarity function: , Is the negative number of samples; the total loss function is defined as: Wherein the method comprises the steps of In order to generate a loss of power, In order to be a loss of semantic equivalence, In order to achieve a loss of safety, In order to test for the pass loss, Is a weight coefficient.
5. The method for tracing code vulnerability semantics and automatically repairing based on multi-mode large model as claimed in claim 4, wherein step S4 comprises: Adopting reverse data flow analysis and control flow analysis to construct a vulnerability propagation dependency graph; Reverse data flow analysis from vulnerability trigger point Initially, traversing backward along the edge of the dataflow graph, tracking all the variable definitions and usage locations that may affect the point, embodying a depth-first search DFS algorithm from Starting, traversing along the reverse direction of the data dependent edge, marking all the accessed variable nodes, and for each variable node If there is a slave To the point of Data dependent path of (2), then Added to the root cause candidate set; Reverse control flow analysis, namely starting from the basic block where the vulnerability triggering point is located, traversing along the reverse direction of the edge of the control flow graph, tracking all control sentences possibly controlling the execution of the point, starting from the basic block where the triggering point is located, traversing along the reverse direction of the edge of the control flow, marking all the accessed basic blocks, and performing a DFS algorithm on each basic block If there is a slave Control-dependent path to trigger point basic block, then Adding the control statement in the root cause candidate set; For vulnerability trigger points The data and control dependencies are tracked back, identifying all code locations that may affect the point: Wherein the method comprises the steps of Representing a data-dependent delivery closure, Representing a control-dependent delivery closure, Is a root cause candidate set; the true root cause is determined by analyzing the code change history, design documents, and developer intent. The root cause scoring function comprehensively considers a plurality of factors: Wherein: Scoring the code complexity; scoring the change frequency; scoring for design document consistency; scoring for developer intent matches; is a weight coefficient, satisfies - Is context information; Determining the true root cause: Analyzing and evaluating the range of code modules affected by the vulnerability by using an influence domain, identifying all affected code modules from root positions by adopting a breadth-first search BFS graph traversal algorithm, and particularly realizing that the code modules are from the root positions Starting at the module, traversing the module dependency graph using BFS Accessing all reachable module nodes, for each module If there is a slave From the module to the module Path of (2) Marked as an affected module; Wherein the method comprises the steps of In order to make a code module dependent graph, Is the affected set of modules; The risk level quantization formula is: Wherein the method comprises the steps of In order to be able to influence the number of modules, The vulnerability availability is scored for the purpose of, The data asset exposure face is scored, Is a weight coefficient, satisfies , Is the comprehensive risk level.
6. The code vulnerability semantic tracing and automatic repairing method based on the multi-mode large model is characterized in that step S5 comprises the steps of pre-training a multi-mode encoder on a code understanding task, learning code semantic representation, fine-tuning on a vulnerability detection data set, learning vulnerability pattern recognition, training on a repairing code generation data set, learning semantic equivalent repairing, and performing end-to-end joint optimization to balance detection accuracy and repairing quality.

Description

Code vulnerability semantic tracing and automatic repairing method based on multi-mode large model Technical Field The invention relates to the technical field of software security and code analysis, in particular to a code vulnerability semantic tracing and automatic repairing method based on a multi-mode large model, which is suitable for various application scenes such as enterprise code security audit, open source software supply chain security analysis, devSecOps automatic assembly line, intelligent contract security audit, legacy system security reinforcement and the like. Background With the continuous expansion of the scale and the continuous increase of the complexity of software systems, code security holes have become a major factor in threatening the security of software systems. Traditional vulnerability detection methods rely mainly on rule-based static analysis tools and pattern matching techniques, which, although capable of quickly identifying known vulnerability patterns, have the following limitations: 1. the semantic understanding capability is insufficient, and the traditional method is mainly based on keyword matching, regular expressions or simple grammar rules, so that the semantic logic and execution flow of codes cannot be deeply understood. For complex logic loopholes, combined loopholes or loopholes needing cross-function and cross-module analysis, the detection accuracy is low, and the false alarm rate are high. 2. The root cause of the vulnerability is difficult to trace, and the existing tools can only locate the vulnerability triggering point and cannot trace the root cause of the vulnerability. One vulnerability may originate from design defects, coding errors, improper configuration, or third party dependency problems, and the lack of root cause analysis capability results in a repair scheme that addresses the symptoms without addressing the root causes, and similar vulnerabilities may occur repeatedly. 3. The quality of the repairing scheme is low, and most vulnerability detection tools only provide vulnerability position and type information and lack automatic repairing capability. Even if some tools provide repair suggestions, simple patch replacement or regular repair is often adopted, the semantic equivalence of codes is not considered, new bug is possibly introduced or original functions are destroyed, and the repair quality is difficult to guarantee. 4. The multi-modal information is underutilized, and the detection and repair of the code loopholes need multi-modal data such as comprehensive static code structures (AST, control flow graph and data flow graph), dynamic execution behaviors (call stack, memory state and I/O operation), loopholes knowledge base (CWE classification, CVE loopholes information and attack mode), context information (code annotation, submission history and development document) and the like. The existing method generally only utilizes single-mode information, and cannot fully play the synergistic effect of multi-mode data. 5. The trace of the taint propagation path is incomplete, namely, for taint vulnerabilities such as SQL injection, XSS, command injection and the like, the complete propagation path from a source point to a dangerous function needs to be traced by a user input. The traditional method adopts simple data flow analysis, and cannot process complicated control flow conversion, function call and data structure operation, so that a stain propagation chain breaks, and a leak is detected. 6. Vulnerability impact scope evaluation is missing, namely after the vulnerability is found, the code module scope, the data asset exposure surface, the possible attack path and the risk level of the vulnerability impact need to be evaluated so as to determine the repair priority. Existing tools lack systematic impact domain analysis capabilities and are difficult to support security decisions. 7. And after knowledge updating, the vulnerability detection rule and the pattern library need to be manually maintained, and the updating period is long and the timeliness is poor in the face of the newly disclosed vulnerability type and attack technique. Lack of automatic learning and knowledge updating mechanisms makes it difficult to cope with rapidly evolving threat environments. 8. Cross-language vulnerability pattern migration is difficult-different programming languages, although grammatically different, have similar vulnerability patterns (e.g., buffer overflows, null pointer dereferences, race conditions, etc.). The existing method is usually designed aiming at a single language, and cannot effectively migrate cross-language universal vulnerability knowledge. In recent years, a large-scale pre-training model shows strong capability in code understanding and generating tasks, but the direct application to vulnerability detection and repair still faces challenges, namely the general code model lacks knowledge of the security field, understanding of