CN-122020651-A - Method and system for detecting malicious package of software supply chain based on file dependency graph

CN122020651ACN 122020651 ACN122020651 ACN 122020651ACN-122020651-A

Abstract

The invention discloses a method and a system for detecting malicious packets of a software supply chain based on a file dependency graph, belonging to the technical field of information security; the method comprises the steps of constructing a reference data set containing malicious and benign samples, extracting initial safety semantic features of file nodes, constructing a code heterogeneous topological dependency graph, hierarchically dividing and quantifying edge attribute vectors through an edge attribute embedding module, fusing to realize semantic enhancement, taking a double-layer graph attention network as a backbone network to extract graph features, screening key subgraph features by combining a graph information bottleneck mechanism, constructing a classifier by adopting a cost sensitive learning strategy to complete malicious and benign code discrimination, and additionally arranging an incremental differential detection step to improve detection efficiency. The system comprises a user interaction layer, a gateway and service layer, a core calculation layer, a basic facility and data layer, a matched incremental differential detection module and the like. The method improves the accuracy, robustness and efficiency of detection, and adapts to the ecological high-frequency iteration characteristic of NPM.

Inventors

ZHAO ZIHAN
LI JIAN

Assignees

北京邮电大学

Dates

Publication Date: 20260512
Application Date: 20260210

Claims (10)

1. The method for detecting the malicious package of the software supply chain based on the file dependency graph is characterized by comprising the following steps of: S1, firstly, constructing a reference data set, wherein the reference data set comprises malicious samples and benign samples, extracting features of source codes of all samples in the reference data set to obtain initial safety semantic features of file nodes in all samples; S2, constructing a code heterogeneous topology dependency graph, taking files in a sample as nodes and dependency references among the files as edges, introducing an edge attribute embedding module to perform hierarchical division on the edges and quantifying the edges into edge attribute vectors, and fusing the edge attribute vectors with a graph topology structure to finish semantic enhancement; S3, adopting a double-layer graph attention network as a backbone network, extracting characteristics of the heterogeneous topological dependency graph, capturing multi-dimensional characteristics of nodes and higher-order topological association information through a multi-layer attention mechanism, and obtaining graph characteristics after nonlinear activation and regularization treatment; s4, introducing a graph information bottleneck mechanism, dynamically evaluating the importance of each node and each edge in the graph to classification tasks through a differentiable mask generator, screening key features, removing confusion noise, and obtaining purified key sub-graph features; s5, constructing a classifier by adopting a cost-sensitive learning strategy, inputting key subgraph characteristics into the classifier, and completing discrimination of malicious codes and benign codes.
2. The method for detecting malicious packets in a software supply chain based on a file dependency graph as set forth in claim 1, wherein in step S1, malicious data sets are extracted from real attack samples and OpenSSF Malicious PACKAGES DATASET of a cloud primary monitoring scene DataDog, respectively, and benign data sets are from main stream function packets of downloading Top-Ranking in an NPM official warehouse.
3. The method for detecting the malicious package of the software supply chain based on the file dependency graph of claim 1, wherein in the step S2, the method specifically comprises the following steps: The method comprises the steps of carrying out abstract syntax tree analysis on source codes of samples in a reference data set, identifying all files in the samples and dependency reference relations among the files, taking each file as an independent node of a heterogeneous topological dependency graph, taking the dependency reference relations among the files as edges of the graph, constructing a basic topological structure, starting an edge attribute embedding module, carrying out type identification and hierarchical division on each dependency reference edge in the graph by adopting a static analysis technology of regular matching, carrying out quantization processing on each classified hierarchical dependency reference edge, generating edge attribute vectors corresponding to each hierarchy, carrying semantic information of corresponding dependency types, and carrying out deep fusion on the edge attribute vectors and the constructed basic topological structure through injection operation to obtain a code heterogeneous topological dependency graph with complete semantics, wherein mathematical expressions of the heterogeneous dependency graph are as follows: G=(V,E, ,ψ), , Where V is the node set, E is the edge set, To map a node to a node type mapping function of node type set A, ψ is an edge type mapping function that maps an edge to a relationship type set R, and |A|+|R| > 2.
4. The method for detecting malicious packages in a software supply chain based on a file dependency graph according to claim 1, wherein in step S3, the method specifically comprises the following steps: step 3.1, combining the semantically complete code heterogeneous topological dependency graph obtained in the step 2 with the file node initial safety semantic features extracted in the step1 to form graph data of a backbone network to be input, wherein the graph data comprises a node feature matrix and an adjacent matrix; step 3.2, starting a first Layer 1 of a double-Layer graph attention network, wherein the Layer 1 adopts a four-head parallel attention mechanism, the four attention heads respectively correspond to different semantic subspaces, code confusion features, graph topological structure features, dependency relationship features and node safety semantic features in graph data are synchronously captured, and parallel extraction of multidimensional features is realized; Step 3.3, carrying out aggregation treatment on four-head attention features output by the Layer 1 to obtain fusion features of the Layer 1, and then sequentially carrying out ELU nonlinear activation treatment and Dropout regularization treatment on the fusion features to inhibit overfitting of the model and improve nonlinear fitting capacity of the model; step 3.4, inputting the fusion characteristics processed in the step 3.3 into a Layer 2 of a second Layer of the attention network of the double-Layer graph, wherein the Layer 2 adopts a single-head attention mechanism and focuses on high-order topology association information of nodes in the aggregation graph, and local first-order neighbor characteristics are integrated into a global code logic representation; And 3.5, carrying out standardized processing on the global code logic characterization output by the Layer 2 to obtain final graph characteristics, wherein the graph characteristics completely reserve node multidimensional semantic and graph topology high-order association information and are used for subsequent denoising and classification tasks.
5. The method for detecting malicious packages in a software supply chain based on a file dependency graph according to claim 1, wherein in step 4, the method specifically comprises the following steps: Step 4.1, inputting the graph characteristics obtained in the step 3 into a graph information bottleneck mechanism, and starting a differentiable mask generator, wherein the mask generator dynamically calculates importance scores of each node and each side in the graph based on graph characteristics and classification task targets, and the importance scores are positively correlated with the contribution degree of the characteristics to the classification task, and an optimization target formula of the graph information bottleneck mechanism is as follows: min Gsub −I(G sub ,Y)+β·I(G sub ,G), Wherein G sub is a key subgraph to be mined, G is an original heterogeneous dependency graph, Y is a target label, I (&, &) is mutual information, and beta is an adjustment coefficient; step 4.2, setting an importance score threshold, screening nodes and edges in the graph, reserving nodes and edges with importance scores higher than the threshold, and eliminating redundant nodes, redundant edges and confusion noise with importance scores lower than the threshold; Step 4.3, optimizing a mask generator through mutual information constraint, minimizing mutual information of the sub-graph and the original graph obtained after screening, maximizing mutual information of the sub-graph and a sample label, and ensuring that the screening process eliminates confusion noise and does not lose key classification characteristics; and 4.4, performing feature aggregation processing on the screened subgraph to obtain purified key subgraph features, wherein the key subgraph features remove confusion interference and focus on core semantics and topology information related to malicious code detection.
6. The method for detecting malicious packages in a software supply chain based on a file dependency graph according to claim 1, wherein in step S5, the method specifically comprises: Step 5.1, constructing a classifier based on the unbalanced quantity characteristics of malicious samples and benign samples in the reference data set by adopting a cost-sensitive learning strategy, wherein the classifier introduces an asymmetric weight mechanism to give punishment cost higher than that of misjudgment behaviors of the benign samples to misjudgment behaviors of the malicious samples, and the loss function of the cost-sensitive learning strategy is as follows: , Wherein C 1,0 represents the cost of misjudging the positive sample as a negative sample, and C 0,1 represents the false positive cost As a basic loss function, y i is a sample real label, and f (x i ) is a model predicted value; Step 5.2, compressing the purified key subgraph characteristics obtained in the step 4 into a graph level vector with fixed dimension through global pooling operation, and taking the graph level vector as an input characteristic of a classifier; Step 5.3, training a classifier, minimizing a loss function with asymmetric punishment cost through a gradient descent algorithm, forcing the model to pay attention to the feature distribution of few types of malicious samples in the training process, and optimizing the decision boundary of the model; And 5.4, processing the NPM packet to be detected through the steps 1-4 to obtain key sub-graph characteristics, inputting the key sub-graph characteristics into a classifier after training is completed, and outputting a judging result of whether the NPM packet is malicious or benign by the classifier, so that high-robustness malicious code detection is realized, and the missing report rate is effectively reduced.
7. The method for detecting malicious packages in a software supply chain based on file dependency graphs, which is characterized by further comprising an incremental differential detection step of fingerprint calculation of software package files and caching of historical heterogeneous dependency graphs, wherein when new version software packages are detected, changed files are identified, only changed parts and affected neighborhoods thereof are recalculated, and the generated local graph structures and the historical graphs are subjected to subgraph fusion to realize incremental detection.
8. A software supply chain malicious package detection system based on a file dependency graph is characterized by comprising a user interaction layer, a gateway and service layer, a core calculation layer and a basic facility and data layer, wherein the core calculation layer is pre-trained with a detection model corresponding to the detection method of any one of claims 1-7, performs full-flow reasoning of heterogeneous dependency graph construction, feature extraction, key subgraph mining and cost sensitivity classification, the user interaction layer realizes detection task operation, threat information presentation and attack path visualization, the gateway and service layer realizes access, authentication, load balancing and task scheduling of detection requests, the core calculation layer adopts a producer-consumer mode and comprises a static analysis node and an inference node, grammar tree extraction, intermediate representation conversion of source codes and reasoning calculation of the detection model are respectively completed, and the basic facility and data layer provide a containerized operation environment and data persistence storage and reading for the system.
9. The system for detecting malicious packages in a software supply chain based on a file dependency graph according to claim 8, wherein the system further comprises an incremental differential detection module, the incremental differential detection module uniquely identifies software package files through a fingerprint calculation mechanism, identifies changed files after version updating, recalculates only a changed part and an affected neighborhood thereof, and splices a local graph structure and a historical graph through a subgraph fusion technology.
10. The software supply chain malicious package detection system based on the file dependency graph, which is disclosed by claim 8, is characterized by further comprising a front-end interaction module, a graph construction module, a core reasoning module and a state management module, wherein the front-end interaction module supports multi-source uploading and detection result visualization, the graph construction module realizes grammar tree extraction, edge attribute embedding projection and heterogeneous graph tensor conversion, the core reasoning module encapsulates a detection model and executes full-flow algorithm reasoning, and the state management module realizes asynchronous decoupling of detection tasks through a task queue, monitors the life cycle of the tasks in real time and supports dynamic configuration and updating of sensitive API rules.

Description

Method and system for detecting malicious package of software supply chain based on file dependency graph Technical Field The invention relates to the technical field of information security, in particular to a method and a system for detecting malicious packages of a software supply chain based on a file dependency graph. Background With the rapid expansion of open source software ecosystems, software supply chain security has become a key security issue for global digital infrastructure. NPM (NodePackageManager) is taken as the global maximum software registry to bear millions of software packages, supports the modern Web development core architecture, and has the openness and the interconnectivity which promote the development efficiency and simultaneously make the development efficiency as the main target of network attack. In recent years, an attacker gradually turns from the traditional vulnerability exploitation to the intentional implantation of malicious codes, the attack means is more complex and hidden, for example, malicious scripts are injected into popular packets through phishing attack, millions of downloads can be affected in a short time, and serious threat is formed to the security of a software supply chain. The current malicious code detection method is mainly divided into two types, namely dynamic detection and static detection. The dynamic detection judges malicious attack by extracting the execution behavior characteristics through running codes in controllable environments such as sandboxes, virtual machines and the like, the static detection does not need to run codes, and a deep learning detection model is constructed by extracting the static characteristics to be matched with the malicious code characteristics or based on structures such as abstract syntax trees, control flow graphs and the like. The traditional detection method has the obvious defects that static detection is easy to be avoided by confusion technologies such as character string encryption, control flow planarization and the like, dynamic detection is often interfered by malicious software through countermeasures such as sandbox environment identification and the like, analysis results are unreliable, the traditional signature-based static scanning or text keyword matching method is gradually invalid in face of advanced confusion technology, especially in cloud native monitoring scenes, an attacker breaks a code grammar structure or covers text semantics, so that effective features are difficult to extract by a detection model relying on grammar tree traversal and natural language processing, malicious loads are wrapped by a large number of benign codes or noise logics, extremely high report missing rate is caused, although research is attempted to convert codes into graph structures, the graph structures are detected by utilizing a graph neural network, the single graph structure is difficult to completely describe multidimensional interaction relations among software entities, and the problems of noise redundancy, node feature deletion, malicious and benign sample category imbalance and the like in graph data are caused, so that detection accuracy and robustness are insufficient, and the actual requirements of NPM supply chain malicious code detection cannot be met. Disclosure of Invention Aiming at the defects of the prior art, the invention discloses a software supply chain malicious package detection method and system based on a file dependency graph, which are used for solving the problems in the background art. In order to achieve the above purpose, the invention provides a software supply chain malicious packet detection method based on a file dependency graph, which comprises the following steps: S1, firstly, constructing a reference data set, wherein the reference data set comprises malicious samples and benign samples, extracting features of source codes of all samples in the reference data set by adopting a static security analysis tool adapting to JavaScript/node.js environments to obtain initial security semantic features of file nodes in all samples, and adopting heterogeneous graph modeling, the data scale actually processed presents multiple level growth at the node level; S2, constructing a code heterogeneous topology dependency graph, taking files in a sample as nodes and dependency references among the files as edges, introducing an edge attribute embedding module to perform hierarchical division on the edges and quantifying the edges into edge attribute vectors, and fusing the edge attribute vectors with a graph topology structure to finish semantic enhancement; S3, adopting a double-layer graph attention network as a backbone network, extracting characteristics of the semantically enhanced heterogeneous topology dependency graph obtained in the step 2, capturing multi-dimensional characteristics of nodes and high-order topology association information through a multi-layer attention mechanis