CN-121980571-A - Micro patch vulnerability matching quantization method, system, equipment and medium based on self-encoder
Abstract
The invention discloses a micro patch vulnerability matching quantization method, a system, equipment and a medium based on a self-encoder, which relate to the technical field of intelligent code analysis and comprise the steps of obtaining multi-source semantic data, carrying out analysis and unified representation processing to generate unified semantic representation with consistent dimensionality, carrying out unsupervised feature learning and compression process of a self-encoder model, generating a low-dimensional semantic embedded vector, calculating the matching distance between the patch and the vulnerability by adopting a distance measurement function, quantifying the difference degree of the patch and the vulnerability in a semantic space, constructing a multi-dimensional matching scoring index according to the matching distance, generating a comprehensive matching score by a weighted fusion mechanism, and sequencing and screening the matching results of the patch and the vulnerability according to the comprehensive matching score. According to the invention, a three-dimensional matching evaluation system from depth semantics to surface layer logic is constructed, and the accuracy and reliability of matching the micro patch and the loophole are obviously improved.
Inventors
- ZHOU ZEYUAN
- LI KUN
- CAO GANG
- YAN BINYUAN
- FU JUN
- TAO JIAYE
- HAN JIAXUAN
- Ban Qiucheng
- JIANG ZAINENG
- ZHOU LINYAN
Assignees
- 贵州电网有限责任公司
Dates
- Publication Date
- 20260505
- Application Date
- 20251218
Claims (10)
- 1. The micro patch vulnerability matching quantization method based on the self-encoder is characterized by comprising the following steps of, The method comprises the steps of obtaining multi-source semantic data, analyzing and uniformly representing the multi-source semantic data, and generating uniform semantic representation with uniform dimensionality; Inputting the unified semantic representation into a self-encoder model, generating a low-dimensional semantic embedded vector through an unsupervised feature learning and compression process of the self-encoder model, calculating the matching distance between the patch and the vulnerability by adopting a distance measurement function, and quantifying the difference degree of the patch and the vulnerability in a semantic space; And constructing a multi-dimensional matching score index according to the matching distance, generating a comprehensive matching score through a weighted fusion mechanism, and sequencing and screening matching results of patches and vulnerabilities according to the comprehensive matching score.
- 2. The method for quantifying micro-patch vulnerability matching based on self-encoder as recited in claim 1, wherein the parsing and unified representation process comprises performing lexical analysis on patch codes and vulnerability context codes to generate a continuous lexical unit sequence, and converting the sequential lexical unit sequence into vector representation through an embedding layer; Constructing an abstract syntax tree of the patch code, extracting all path sets from root nodes to leaf nodes, converting the path sets into vector representations through an embedding layer, analyzing a control flow of the patch code, dividing the control flow into basic blocks, extracting an execution sequence, and converting the basic blocks into the vector representations; And acquiring sentence-level semantic vectors, performing splicing operation on the four vector representations to acquire a unified multidimensional semantic representation tensor, and extracting vulnerability descriptions and affected code fragments and processing the vulnerability descriptions and the affected code fragments with candidate patch codes when a vulnerability report is received.
- 3. The method for quantifying micro-patch vulnerability matching based on self-encoder of claim 2, wherein generating the low-dimensional semantic embedded vector comprises inputting the generated unified semantic representation into an encoder network, mapping the unified semantic representation to a low-dimensional potential space, and obtaining a dimensional semantic embedded vector; Constructing a decoder network matched with the encoder network, and reconstructing an output consistent with the original unified semantic representation dimension through a reverse transformation process; And constructing and optimizing a reconstruction loss function with the aim of minimizing the difference by comparing the difference between the reconstruction output and the original unified semantic representation, introducing a regularization constraint term for the low-dimensional semantic embedded vector, and completing the self-encoder model training by iteratively optimizing the combined target of the reconstruction loss function and the regularization constraint term.
- 4. The method for quantifying the micro-patch vulnerability matching based on the self-encoder of claim 3, wherein quantifying the degree of difference between the two semantic embedded vectors in the semantic space comprises extracting a semantic embedded vector representing a patch sample and a semantic embedded vector representing a vulnerability sample, and calculating the original distance values of the two semantic embedded vectors in different calculation criteria by using a basic distance metric algorithm; And designing a linear fusion strategy, distributing different weight coefficients for the original distance values, and defining a weighted summation result as a matching distance.
- 5. The method for self-encoder based micro-patch vulnerability matching quantization of claim 4, wherein the self-encoder model training comprises an encoder and a decoder, wherein the encoder compresses the input into a low-dimensional vector by using a ReLU activation function through a multi-layer neural network, extracts features, and the decoder performs the opposite operation to decompress the low-dimensional vector to restore the original input dimension; the encoder multi-layer neural network formulation is: Wherein, the As a function of the encoder, For the unified semantic input tensor of the kth sample, For the total number of layers of the encoder, For the weight matrix of the encoder layer L, For the weight matrix of layer 1 of the encoder, For the encoder offset vector, Activating a function, namely a ReLU activating function; The embedded vector of the encoder output is expressed as: Wherein, the A vector is embedded for the low-dimensional semantics of the encoder output, Is z-dimensional real vector space; the decoder performs the opposite operation formulation: Wherein, the For the input tensor reconstructed by the decoder, For the weight matrix of the layer L of the decoder, For the weight matrix of layer 1 of the decoder, In order for the decoder to activate a function, For the decoder bias vector to be a function of the decoder, The total layer number for the decoder is the same as the encoder; in the training stage, the model designs an objective function with minimum total loss, including a reconstruction error and a regularization term, and adjusts weights of the two parts through super parameters; The objective function formula is expressed as: Wherein, the As a function of the total loss, In order to train the total number of samples, For the index of the sample, Is the square of the L2 norm, Is a regularization coefficient.
- 6. The method for quantifying the micro-patch vulnerability matching based on the self-encoder of claim 5, wherein quantifying the difference degree of the two in the semantic space further comprises respectively inputting a patch sample and a vulnerability sample into the same encoder to obtain semantic embedded vectors respectively representing core features of the patch and the vulnerability in the semantic space; The three distances are calculated in parallel, wherein the Euclidean distance, the cosine distance and the Manhattan distance are subjected to weighted fusion, and fusion matching distance values are calculated; the three distance formulas are calculated as: Wherein, the As a function of the euclidean distance, For the patch sample index, In the case of a vulnerability sample, A vector is embedded for the low-dimensional semantics of patch sample i, A vector is embedded for the low-dimensional semantics of the vulnerability sample j, In order to be an L2 norm, In the form of the inner product of the vectors, For the cosine similarity it is the cosine similarity, As a function of the cosine distance, As a function of the manhattan distance, For the dimension index to be a function of the dimension index, As a dimension of the dimension, Is vector quantity The component values in the o-th dimension, Is vector quantity Component values in the o-th dimension; The calculation fusion matching distance value is expressed as: Wherein, the In order to fuse the matching distances, As the weight coefficient of the euclidean distance, Is the weight coefficient of the cosine distance, Is the weight coefficient of the manhattan distance.
- 7. The method for quantifying micro-patch vulnerability matching based on a self-encoder of claim 6, wherein generating a composite matching score comprises converting the calculated matching distance into a base similarity score by an exponential decay function, introducing a structural consistency score and a context alignment score, and performing weighted fusion of the base similarity score, the structural consistency score and the context alignment score to generate the composite matching score; The basic similarity score calculation formula is expressed as: Wherein, the For the base similarity score to be a score, As a function of the index of the values, Is a decay rate parameter; the structural consistency score calculation formula is expressed as: Wherein, the For the purpose of scoring the consistency of the structure, For the abstract syntax tree path set of patch i, A set of abstract syntax tree paths for vulnerability j, In order to aggregate the intersection operators, Union operators for collections; The context alignment score calculation formula is expressed as: Wherein, the For the context alignment score, For the sequence of control stream fragments of patch i, A sequence of control flow fragments for vulnerability j, Is an edit distance function; the comprehensive matching score calculation formula is expressed as: Wherein, the For the purpose of the comprehensive match score, The weight coefficients for the base similarity score, The weight coefficients that score for structural consistency, The weight coefficient that scores the context alignment.
- 8. The micro patch vulnerability matching quantization system based on the self-encoder is applied to the micro patch vulnerability matching quantization method based on the self-encoder as claimed in any one of claims 1 to 7, and is characterized by comprising a semantic data processing module, a semantic feature learning and compressing module, a multi-dimensional matching distance quantization module and a comprehensive matching scoring module; The semantic data processing module is used for receiving three different modes of original data, converting codes into Token sequences through a lexical analyzer, extracting an abstract syntax tree path set through a syntax analyzer, acquiring a basic block execution sequence through a control flow analysis technology, calling a pre-training language model to carry out semantic coding on text description, carrying out dimension normalization on features, splicing and fusing, and generating a semantic representation tensor; The semantic feature learning and compressing module is used for generating a semantic embedded vector by adopting an encoder-decoder structure, reconstructing original input, taking a minimum reconstruction error as a target in a training process, introducing L2 regularization constraint on the embedded vector, and representing low-dimensional features of essential semantics of patches and vulnerabilities through balance reconstruction precision and feature sparsity; The multi-dimensional matching distance quantization module is used for mapping patches and vulnerability samples to the same low-dimensional semantic space by using a trained encoder, calculating Euclidean distance, cosine distance and Manhattan distance by adopting a multi-angle distance measurement strategy, and weighting and fusing the three distances into a uniform matching distance value by a preset weight coefficient; The comprehensive matching scoring module is used for converting the matching distance into a basic similarity score through an exponential decay function, extracting structural features from the original data, respectively calculating a structural consistency score based on AST path Jaccard similarity and a context alignment score based on the control flow sequence editing distance, weighting and fusing the scores of three dimensions into a comprehensive matching score through an adjustable weight coefficient, and calculating the comprehensive score of all candidate patches and arranging the comprehensive scores in descending order for each vulnerability to generate a priority recommendation list.
- 9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the self-encoder based micro-patch vulnerability matching quantification method of any one of claims 1 to 7 when the computer program is executed.
- 10. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the self-encoder based micro-patch vulnerability matching quantification method of any of claims 1 to 7.
Description
Micro patch vulnerability matching quantization method, system, equipment and medium based on self-encoder Technical Field The invention relates to the technical field of intelligent code analysis, in particular to a micro-patch vulnerability matching quantization method, a system, equipment and a medium based on a self-encoder. Background Traditional bug fixes rely on manual analysis, have low efficiency and delayed response, are difficult to cope with security threats of large-scale outbreaks, and an automatic patch recommendation technology is generated, aims at rapidly positioning and recommending a repair scheme suitable for specific bugs from a massive patch library, early researches mainly focus on text-based matching methods, such as similarity between bug reports and patch descriptions through keyword extraction or natural language processing technology, however, the methods ignore the inherent structure and semantics of codes, and have limited matching precision. In recent years, with the development of deep learning technology, researchers begin to explore and learn by using code representation to capture deep association between patches and vulnerabilities, and feature extraction and similarity calculation are performed by using a neural network model through Abstract Syntax Tree (AST), control Flow Graph (CFG) and other structured information or serialization codes, so that the accuracy and the intelligentization level of matching are remarkably improved. Although the prior art has significant progress in the aspect of code feature extraction, the prior art still has a plurality of defects in the aspects of accuracy and comprehensiveness of quantitative matching, namely, firstly, most methods rely on single or limited distance measurement criteria, only cosine similarity or Euclidean distance is used, the relationship between semantic embedded vectors is difficult to comprehensively describe from multiple dimensions such as geometry, direction and absolute difference, so that the robustness of a matching result is insufficient, secondly, the prior art often excessively relies on a single code representation form, multi-source heterogeneous information such as morphology, syntax, structure and control flow cannot be effectively fused, so that the model is difficult to capture hidden and deep logic corresponding relationship between patches when facing complex or deformed holes, and furthermore, the original distance or similarity is directly used for sorting, a scoring system capable of integrating multi-dimensional features and carrying out normalized fusion is lacked, so that the intuitiveness and the interpretability of a final recommended result are poor, the severe requirement for high-accuracy and high-reliability matching in practical application is difficult to meet, and the limitations commonly cause that the prior art has severe requirements for high-accuracy patch matching when processing large-complexity patches and high-scale matching tasks. Disclosure of Invention In view of the existing problems, the invention provides a micro-patch vulnerability matching quantification method, a micro-patch vulnerability matching quantification system, micro-patch vulnerability matching quantification equipment and a micro-patch vulnerability matching quantification medium based on a self-encoder, which are used for solving the problems that in the prior art, robustness of a matching result is insufficient, hidden and deep logic corresponding relations between patches are difficult to capture, and intuitiveness and interpretability of a final recommended result are poor. In order to solve the technical problems, a micro-patch vulnerability matching quantization method based on a self-encoder is provided, which comprises the following steps, The method comprises the steps of obtaining multi-source semantic data, carrying out analysis and unified representation processing to generate unified semantic representation with consistent dimensionality, inputting the unified semantic representation into a self-encoder model, generating a low-dimensional semantic embedded vector through an unsupervised feature learning and compression process of the self-encoder model, calculating the matching distance between patches and vulnerabilities by adopting a distance measurement function, quantifying the difference degree of the patches and the vulnerabilities in semantic space, constructing a multi-dimensional matching scoring index according to the matching distance, generating comprehensive matching scores through a weighted fusion mechanism, and sequencing and screening matching results of the patches and the vulnerabilities according to the comprehensive matching scores. The analysis and unified representation processing comprises the steps of performing lexical analysis on patch codes and vulnerability context codes to generate continuous lexical unit sequences, and converting the continuous lexical unit se