CN-117688188-B - Construction method of intelligent power grid network security knowledge graph

CN117688188BCN 117688188 BCN117688188 BCN 117688188BCN-117688188-B

Abstract

The invention discloses a construction method of a network security knowledge graph of a smart grid, which comprises the steps of firstly crawling data comprising structured data, semi-structured data and unstructured data, constructing an ontology model, constructing an initial mode layer of the knowledge graph by analyzing information types contained in the structured data and combining the existing ontology model and expert experience from top to bottom, then adopting a knowledge extraction method based on rules or on deep learning for the characteristics of different semi-structured and unstructured texts to finish data annotation, entity extraction and relation construction, then constructing a data layer under the guidance of the initial mode layer after knowledge fusion, and then carrying out knowledge updating comprising updating of the mode layer and updating of the data layer. According to the invention, through training on the manually marked small-scale mixed language data set, the performance superior to that of the reference model is obtained, and the entity extraction of the power grid network safety text is realized, so that the construction of the knowledge graph is served.

Inventors

DU YE
PENG ZHEN
Zheng Tianshuai
CHEN QIFANG
LU SIYANG
LI MEIHONG

Assignees

北京交通大学

Dates

Publication Date: 20260508
Application Date: 20231208

Claims (8)

1. The construction method of the intelligent power grid network security knowledge graph is characterized by comprising the following steps of: Step 1, crawling data, including structured data, semi-structured data and unstructured data, constructing an ontology model, and constructing an initial mode layer of a knowledge graph from top to bottom by analyzing information types contained in the structured data and combining the existing ontology model and expert experience; Step 2, aiming at the characteristics of different semi-structured and unstructured texts, adopting a knowledge extraction method based on rules or deep learning to finish data labeling, entity extraction and relation construction; step 3, after knowledge fusion, constructing a data layer under the guidance of an initial mode layer; In the step 2, a knowledge extraction method based on rules is adopted to process semi-structured data, a deep learning method is adopted to process unstructured data, and a DA-XLMR-BiLSTM-FC-CRF model based on a five-layer architecture is constructed, wherein the model comprises five parts, namely a data enhancement DA layer, a XLMR layer, a BiLSTM layer, a characteristic serial FC layer and a CRF layer; In the data enhancement DA layer, the data enhancement method is divided into a training stage and a generating stage, in the training stage, tag information is inserted before and after entity words for marking the positions and types of the entity words, then the entity words are randomly masked by adopting a full word masking strategy, the entity words are sent into a pre-trained masking language model MLM for fine adjustment, and the entity words conforming to the context can be predicted by the fine-adjusted MLM model; In the generation stage, the label insertion and random mask processing which are the same as those in the training stage are carried out on the original marked corpus, and a finely tuned MLM model is sent to obtain sentences with entity words replaced; the text of the XLMR layers is firstly subjected to word segmentation through a Tokenizer tool, then special symbols "< s >" and "</s >" for identifying the head and the tail of a sentence are additionally added after the text is divided into sub-word token, a sub-word token sequence T is formed through dictionary mapping, then each sub-word token in the sequence T is embedded by a XLMR model, and a word vector sequence E= { E1, E2, E3, & gt, en }, wherein Ei is a vector representation corresponding to the i-th sub-word token, and each word vector dimension is 768 dimensions.
2. The method for constructing a smart grid network security knowledge graph according to claim 1, wherein when the ontology model is constructed, the ontology is divided into a network security domain ontology and an electric power domain ontology, and the smart grid network security knowledge graph ontology model is constructed in a seven-step method.
3. The method for constructing a smart grid network security knowledge graph according to claim 1, wherein the data enhancement algorithm adopted by the data enhancement DA layer is replaced by a synonym replacement, a homotag word replacement or a no-tag word replacement algorithm.
4. The method for constructing the intelligent power grid network security knowledge graph according to claim 1, wherein after the word vector sequence E is input to BiLSTM layers, a hidden vector sequence h L ={h L1 , h L2 , h L3 , …, h Ln is obtained by forward LSTM, a hidden vector sequence h R ={h R1 , h R2 , h R3 , …, h Rn is obtained by backward LSTM, and finally, vector splicing is performed on h L and h R to obtain a hidden layer sequence {[h L1 , h R1 ], [h L2 , h R2 ], [h L3 , h R3 ], …, [h Ln h Rn ]},, namely BiLSTM layers output h= { h1, h2, h 3.
5. The method for constructing the intelligent power grid network security knowledge graph according to claim 1, wherein the feature series FC layer performs feature series operation on XLMR layers of output E and BiLSTM layers of output h to obtain an output vector sequence H={[E 1 ,h 1 ], [E 2 ,h 2 ], [E 3 ,h 3 ],…,[E n ,h n ]};, and then the output vector sequence H={[E 1 ,h 1 ], [E 2 ,h 2 ], [E 3 ,h 3 ],…,[E n ,h n ]}; is converted into a score sequence P= { P 1 ,P 2 ,P 3 ,…, P n }, wherein the dimension of P i is equal to the type number of the entity tag, and P ij represents the score of the ith sub-word token classified as the jth entity tag.
6. The method for constructing the intelligent power grid network security knowledge graph according to claim 1, wherein the score sequence P is used as a transmission score to be input into a CRF layer, the CRF layer trains a transfer matrix M, a matrix element M ij represents the transfer score of a current label type j under the condition that the previous label type is i, the CRF layer calculates a loss function through the transmission score and the transfer score so as to continuously update the transfer matrix M, and finally, the CRF layer solves the optimal entity label sequence O= { O 1 , O 2 , O 3 , …, O n }, wherein O i represents the entity label type of an ith subword token.
7. The method for constructing the intelligent power grid network security knowledge graph according to claim 1 is characterized in that when knowledge is fused, solving the text co-reference problem, wherein the method comprises two aspects of noun abbreviation and case mixed use and inconsistent expression of different data sources, the noun abbreviation and case mixed use are shown in a 'company' type entity, the aspect of the method completes the co-reference resolution by constructing an electric enterprise dictionary and carrying out dictionary matching on the 'company' type entity, the inconsistent expression is shown in the attack type enumeration of CAPEC and the attack technology of ATT & CK, the aspect of the method calculates similarity of attack description texts by adopting SBERT algorithm, and therefore the attack methods with similar expressions are combined.
8. The method for constructing the intelligent power grid network security knowledge graph according to claim 1, wherein after the data layer is constructed, knowledge updating is performed, including updating of a mode layer and updating of the data layer, resource consumption is reduced in an incremental updating mode, the mode layer updating is performed by means of a manual method, new entity types appearing in newly-added data are added into the mode layer, the relation between the new entity types and existing entity types is set, and the data layer updating is performed by utilizing an original knowledge extraction method under the guidance of the mode layer, and then the entities and the relations are added into the knowledge graph.

Description

Construction method of intelligent power grid network security knowledge graph Technical Field The invention belongs to the technical field of intelligent power grids in the power technology, and particularly relates to a construction method of a network security knowledge graph of an intelligent power grid. Background Modern information communication technology brings convenience to intelligent development of the power grid and brings potential safety hazards to the network. In order to effectively cope with and prevent the damage and influence possibly caused by the network attack, network security holes existing in the power system must be fully discovered, and meanwhile, attack methods possibly adopted by network attackers are identified. However, network security knowledge in the power field often exists in vulnerability libraries, security knowledge bases and technical forums related to industrial control systems (Industrial Control System, ICS), and problems of source dispersion, large structural difference and Chinese-English mixing exist. Therefore, there is a need to extract and refine network security knowledge related to a power system from massive multi-source heterogeneous data through intelligent technology, and organize the knowledge into a structured and visual representation. The knowledge graph is a knowledge representation method proposed by Google corporation, and can represent entities and correlations thereof in the objective world in the form of graphs. By integrating and extracting the multi-source heterogeneous data, the knowledge graph contains richer semantic association information among entities and is often applied to the construction of a knowledge base. Knowledge maps are widely applied in the fields of finance, medical treatment and the like, and are also widely explored in the application of the electric power field. At present, researchers mainly use a knowledge graph technology as a knowledge management method for health management of power equipment, fault location of a power system, heterogeneous data management of the power system and the like (see documents: she Xinzhi, shang Lei, dong Xuzhu and the like; knowledge graph research and application [ J ]. Power grid technology, 2022,46 (10): 3739-3749) for power distribution network fault treatment. However, research on a knowledge graph construction method for intelligent power grid network security does not exist. The prior knowledge graph construction scheme generally comprises a mode layer construction and a data layer construction, wherein the data layer is constructed through three links of knowledge extraction, knowledge fusion and knowledge updating. The knowledge extraction link further includes three steps of entity extraction, relationship extraction and attribute extraction, and the entity extraction and the attribute extraction can be implemented by a named entity Recognition algorithm (NAMED ENTITY Recognizing, NER). Existing NER algorithms have undergone three stages of development, including dictionary and rule-based methods, machine learning-based methods, and deep learning-based methods. Since the advent of the BERT pre-training model, the benchmark model for the NER task has evolved from BiLSTM-CRF to the BERT-BiLSTM-CRF three-layer model. BERT is a language characterization model based on a transducer architecture that is capable of generating embedded representations of each word of text based on contextual semantic information. The BERT perfectly replaces the previous Word2vec model, and serves as an embedding layer of the NER model to generate a Word vector sequence of the input text. A Bi-directional long and short Term Memory network (Bi-direction Long Short-Term Memory, biLSTM) can capture both forward and reverse information of sequences, thereby learning context semantics. After entering the word vector sequence, the BiLSTM layer will output the scoring probabilities for each word corresponding to each tag. The conditional random field (Conditional Random Fields, CRF) layer can learn the dependency between tags, constrain the tag classification for each word and correct the BiLSTM layer output, thereby ensuring the rationality of predicting tags. However, the BERT-BiLSTM-CRF model cannot solve the problem of smaller data set size caused by high cost of data labeling for NER tasks, and the BERT-BiLSTM-CRF model is only applicable to specific languages and cannot process multilingual mixed texts. Object of the Invention The invention aims to solve the problem of multi-language entity extraction under a small amount of marked data and provide powerful support for network security risk identification of an electric power system. Disclosure of Invention The invention provides a construction method of a smart grid network security knowledge graph, which comprises the following steps: Step 1, crawling data, including structured data, semi-structured data and unstructured data, constructing an ontolo