CN-122020674-A - Software supply chain risk identification method, system and computer readable storage medium

CN122020674ACN 122020674 ACN122020674 ACN 122020674ACN-122020674-A

Abstract

The invention relates to a software supply chain risk identification method, a system and a computer readable storage medium, which are used for acquiring component data in a software supply chain, extracting features of the component data to form graph structural features, time sequence features and text semantic features respectively, outputting a first score through a graph meaning network integrating a multi-head attention mechanism and a time attenuation factor, generating behavior embedding after splicing the time sequence embedding and the structure embedding, comparing and learning the behavior corresponding to the behavior embedding to generate a second score, obtaining an inference result through encoding and classifying license text and reasoning conflict rules, combining conflict probability output by a graph neural network to generate a third score, weighting and fusing the first score, the second score and the third score to obtain a comprehensive score, dividing risks according to the comprehensive score, and the method has the advantages of comprehensive detection, high accuracy, strong interpretability, support of privacy protection, integration in a CI/CD flow and the like.

Inventors

XIE YIJIA
GUO YUXIN
ZHANG LIANG
ZHU CHONG
LIN JIANHONG
YU WENFEI
SHEN MURONG

Assignees

浙江鹏信信息科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260410

Claims (10)

1. A software supply chain risk identification method, comprising the steps of: s1, collecting component data in a software supply chain; S2, extracting features of the component data to form graph structural features, time sequence features and text semantic features respectively; S3, generating a relationship map to be identified based on fusion of graph structural features, time sequence features and text semantic features, inputting the relationship map to be identified into a graph attention network fused with a multi-head attention mechanism and a time attenuation factor, and outputting a risk score as a first score; generating time sequence embedding and structure embedding based on the time sequence features and the graph structure features respectively, generating behavior embedding after splicing the time sequence embedding and the structure embedding, and performing contrast learning on behaviors corresponding to the behavior embedding to generate a second score; Determining license categories based on license texts in the text semantic features, and carrying out rule reasoning according to the license categories by matching corresponding conflict rule bases to obtain conflict reasoning results; s4, carrying out weighted fusion on the first score, the second score and the third score to obtain a comprehensive score; and S5, classifying risk grades according to the comprehensive scores.
2. The method of claim 1, wherein the components in the graph-annotation-force network that fuses the multi-head-attention mechanism and the time-decay factor are nodes of the graph structure; If the update time of the component j is earlier than that of the component i, introducing a time attenuation factor to update the attention coefficient: ; Wherein, the In order for the attenuation coefficient to be a factor, 、 For the update time stamps of component i and component j, And the attention coefficient of the node corresponding to the component i to the node corresponding to the component j.
3. The method according to claim 1, wherein in the step S3, the time series characteristic is input to a transducer encoder, and the output time series is embedded; Inputting the graph structural features into a graph convolution network, and embedding an output structure; inputting the time sequence embedding and the structure embedding into a full-connection layer after splicing, and embedding the output behaviors; and performing contrast learning on the behavior embedded corresponding to the behavior, calculating the distance between the behavior embedded corresponding to the behavior embedded to the normal behavior center, judging whether the distance is larger than a similarity value corresponding to the maximized positive sample and the minimized negative sample, outputting the distance as a second score if the distance is larger than the similarity value corresponding to the maximized positive sample and the minimized negative sample, and outputting the second score as zero if the distance is not larger than the similarity value.
4. The software supply chain risk identification method according to claim 1, wherein in the step S3, the license text in the text semantic features is input into the BERT model for coding and license type classification, and license types are output; inputting the license knowledge graph in the dependency graph into a graph neural network, and outputting conflict probability; And if the graph neural network only judges that the conflict exists, taking the product of the conflict probability and the preset weight as the third scoring output.
5. The software supply chain risk identification method of claim 1, wherein the composite score is: ; Wherein, the For the first score to be a first score, For the second score of the score, the second score, For the third score, the third score was calculated, 、、 Respectively, are the corresponding weight coefficients of the two groups, 。
6. The software supply chain risk identification method of claim 1, wherein the ranking risk levels according to the composite score comprises: If the comprehensive score is more than 0.8, the risk grade is classified as high risk; if the comprehensive score is more than or equal to 0.5 and less than or equal to 0.8, the risk level is classified as stroke risk; If the composite score is <0.5, the risk class is classified as low risk.
7. The method for risk identification of a software supply chain of claim 6, further comprising pushing a reminder according to the risk level; if the risk level is high risk, the reminding information is advice to prohibit use; If the risk level is the risk, the reminding information is advice examination and replacement; if the risk level is low, the reminding information is acceptable to use.
8. A software supply chain risk identification system applying the software supply chain risk identification method of any one of claims 1-7, wherein the software supply chain risk identification system comprises: the data acquisition module is used for acquiring component data in a software supply chain; The feature extraction module is used for carrying out feature extraction on the component data to respectively form a graph structure feature, a time sequence feature and a text semantic feature; The multidimensional wind control algorithm module is used for generating a relationship map to be identified based on fusion of graph structural features, time sequence features and text semantic features, inputting the relationship map to be identified into a graph attention network fused with a multi-head attention mechanism and a time attenuation factor, and outputting a risk score as a first score; The multidimensional wind control algorithm module is further used for generating time sequence embedding and structure embedding based on the time sequence characteristics and the graph structure characteristics respectively, generating behavior embedding after the time sequence embedding and the structure embedding are spliced, and performing comparison learning on behaviors corresponding to the behavior embedding to generate a second score; The multidimensional wind control algorithm module is also used for determining license categories based on license texts in text semantic features, and carrying out rule reasoning according to the license categories matched with corresponding conflict rule bases to obtain conflict reasoning results; and the risk fusion and decision module is used for carrying out weighted fusion on the first score, the second score and the third score and outputting a risk grade.
9. The software supply chain risk identification system of claim 8, further comprising: And the visualization and alarm module is used for displaying the reminding information pushed according to the risk level and triggering an alarm.
10. A computer readable storage medium having instructions stored therein, which when run on a computer, cause the computer to perform the software supply chain risk identification method of any one of claims 1-7.

Description

Software supply chain risk identification method, system and computer readable storage medium Technical Field The invention belongs to the technical field of information security and software supply chain management, and particularly relates to a method and a system for identifying risks of a software supply chain based on a multidimensional algorithm and a computer readable storage medium. Background With the increasing complexity of software development models, particularly the widespread use of open source components, third party libraries, cloud services, etc., the software supply chain is facing increasing security and compliance risks. For example, malicious code injection, library-dependent tampering, license conflict, vulnerability propagation and other problems frequently occur, which brings great security threat and legal risk to enterprises. In the prior art, while some systems exist for scanning and detecting components in a software supply chain, the following problems are common: the detection dimension is single, most systems only rely on static scanning or vulnerability database matching, and the comprehensive evaluation of multiple dimensions such as behavior characteristics, dependency relationships, source credibility and the like is lacking; The risk assessment mechanism is lagged, lacks dynamic assessment capability and is difficult to adapt to a rapidly-changing supply chain environment; the false alarm rate is high, and the traditional rule matching mechanism is easy to generate a large number of false alarms, so that the detection efficiency is affected; The algorithm has insufficient intelligence, namely, the introduction of machine learning or deep learning is lacking, and the risk mode is difficult to learn from the historical data; thus, there is a need in the art to enable intelligent, dynamic, multi-dimensional risk identification and assessment of components in a software supply chain. Disclosure of Invention In view of the foregoing drawbacks and deficiencies of the prior art, it is an object of the present invention to at least address one or more of the problems of the prior art, in other words, to provide a software supply chain risk identification method, system and computer readable storage medium that meets one or more of the aforementioned needs. In order to achieve the aim of the invention, the invention adopts the following technical scheme: A software supply chain risk identification method comprising the steps of: s1, collecting component data in a software supply chain; S2, extracting features of the component data to form graph structural features, time sequence features and text semantic features respectively; S3, generating a relationship map to be identified based on fusion of graph structural features, time sequence features and text semantic features, inputting the relationship map to be identified into a graph attention network fused with a multi-head attention mechanism and a time attenuation factor, and outputting a risk score as a first score; generating time sequence embedding and structure embedding based on the time sequence features and the graph structure features respectively, generating behavior embedding after splicing the time sequence embedding and the structure embedding, and performing contrast learning on behaviors corresponding to the behavior embedding to generate a second score; Determining license categories based on license texts in the text semantic features, and carrying out rule reasoning according to the license categories by matching corresponding conflict rule bases to obtain conflict reasoning results; s4, carrying out weighted fusion on the first score, the second score and the third score to obtain a comprehensive score; and S5, classifying risk grades according to the comprehensive scores. Preferably, the component data includes a component name, a version number, a dependency graph, a behavior log, license text, and vulnerability information. In a preferred scheme, in the graph attention network integrating the multi-head attention mechanism and the time attenuation factor, a component is used as a node of a graph structure; If the update time of the component j is earlier than that of the component i, introducing a time attenuation factor to update the attention coefficient: ; Wherein, the In order for the attenuation coefficient to be a factor,、For the update time stamps of component i and component j,And the attention coefficient of the node corresponding to the component i to the node corresponding to the component j. In a preferred embodiment, in the step S3, the sequence feature is input to a transducer encoder, and the output sequence is embedded; Inputting the graph structural features into a graph convolution network, and embedding an output structure; inputting the time sequence embedding and the structure embedding into a full-connection layer after splicing, and embedding the output behaviors; and performing cont