CN-119561737-B - Network data asset security identification method and system based on entity alignment
Abstract
The invention provides a network data asset safety identification method and system based on entity alignment, which relate to the technical field of network safety and comprise the steps of acquiring network data assets, and historical asset states and corresponding state marks of the network data assets; establishing an association relation of network data assets, generating a first embedded representation of a conceptual entity, generating a second embedded representation of an instance entity, integrating the first embedded representation and the second embedded representation to obtain a target entity, extracting entity characteristics and entity labels of the target entity, constructing a random forest model, acquiring real-time entity characteristics, namely real-time asset states, of the target entity to be identified in the network data assets, inputting the real-time asset states into the random forest model, outputting an identification result of the real-time asset states, and outputting the target entity to be identified as abnormal if the identification result is abnormal, otherwise, outputting the target entity to be identified as normal. The method can finish the rapid common defense of the cross sources and improve the data security.
Inventors
- ZHANG YUQI
- CHENG LI
- LU XINGXING
- QI WENYU
- Ming Youwei
Assignees
- 金祺创(北京)技术有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20241120
Claims (9)
- 1. A method for securely identifying network data assets based on entity alignment, comprising: s1, acquiring a network data asset, and a historical asset state and a corresponding state mark of the network data asset, wherein the state mark comprises an abnormal state mark and a normal state mark; s2, establishing an association relation of the network data assets, and extracting the network data assets which are taken as concept entities and have objects as instance entities; S3, generating a first embedded representation of the concept entity by using a BERT module, and generating a second embedded representation of the instance entity by using a GAN module; S4, integrating the first embedded representation and the second embedded representation by utilizing a GCN module to obtain a target entity; S5, extracting entity characteristics and entity labels of the target entities, and establishing a target entity training set, wherein the entity characteristics are historical asset states corresponding to the target entities, and the entity labels are state marks corresponding to the historical asset states; S6, combining information entropy, and constructing a random forest model based on the target entity training set; S7, acquiring real-time entity characteristics about a target entity to be identified in the network data asset, wherein the real-time entity characteristics represent real-time asset states corresponding to the target entity to be identified; S8, inputting the real-time asset state into the random forest model, and outputting a recognition result of the real-time asset state, wherein the recognition result comprises a normal state and an abnormal state; S9, outputting a target entity to be identified in the network data asset to be abnormal under the condition that the identification result is in an abnormal state, otherwise, outputting the target entity to be identified in the network data asset to be normal; The association relationship is a knowledge graph, and the step S2 specifically comprises the following steps: S201, establishing a knowledge graph of the network data asset through Neo4j software; S202, clustering knowledge graph nodes based on a graph Laplace matrix to obtain a plurality of node clusters; S203, calculating the comprehensive centrality of each node cluster by combining the centrality, the approximate centrality and the medium centrality: ; Wherein, the Representing the overall centrality of the cluster of nodes S, Representing the comprehensive centrality of the knowledge-graph nodes v in the node cluster S, and alpha, beta and gamma respectively represent the centrality of the knowledge-graph nodes v in the node cluster Near centrality And medium centrality Is used for the weight of the (c), Representing the degree of the knowledge graph node v, wherein the degree represents the number of connections with other knowledge graph nodes, Represents the shortest path distance between the knowledge-graph node v and the knowledge-graph node u, Represents the total number of shortest paths from the knowledge-graph node s to the knowledge-graph node t, The shortest path number passing through the node V is represented, N represents the total number of the knowledge-graph nodes, and V represents the knowledge-graph node set; S204, outputting the knowledge graph nodes in the node cluster with the comprehensive centrality larger than the preset comprehensive centrality as the conceptual entity, and outputting the knowledge graph nodes in the node cluster with the comprehensive centrality smaller than or equal to the preset comprehensive centrality as the instance entity.
- 2. The entity alignment-based network data asset security identification method of claim 1, wherein the network data asset comprises a Web server, a database server, a file server, a router, a switch, a firewall, an application, an employee account, an administrator account, and a customer account; the historical asset state comprises the number of requests, the number of accesses, the number of network connections, the number of access devices and the number of active users within a preset time period.
- 3. The entity alignment-based network data asset security identification method of claim 1, wherein S202 specifically comprises: s2021, calculating an adjacency matrix and a degree matrix of the knowledge graph nodes, wherein elements of the adjacency matrix represent connection relations between every two knowledge graph nodes, if the connection relations exist, corresponding element values are one, otherwise, the values are zero, and the elements of the degree matrix represent the connection quantity of the knowledge graph nodes and other knowledge graph nodes; S2022, determining a graph Laplace matrix of the knowledge graph nodes based on the adjacency matrix and the degree matrix: ; Wherein, the Representing a graph Laplace matrix derived based on the adjacency matrix A and the degree matrix D; S2023, calculating a normalized graph Laplace matrix of the graph Laplace matrix: ; Wherein, the Representing a normalized graph laplace matrix; S2024, carrying out eigenvalue decomposition on the standardized graph Laplace matrix, reserving a preset number of eigenvalues smaller than preset eigenvalues, and calculating eigenvectors of each reserved eigenvalue; S2025, taking the feature vector as an embedded representation of the corresponding knowledge graph node; and S2026, clustering the embedded representation by a k-means algorithm to obtain a plurality of node clusters.
- 4. The entity alignment-based network data asset security identification method of claim 1, wherein the generating the first embedded representation of the conceptual entity using the BERT module comprises: s301, acquiring external data about the concept entity, wherein the external data comprises a general knowledge graph and a webpage document; S302, extracting descriptive text about the concept entity from the external data; S303, generating topic distribution of the descriptive text by combining an LDA topic model: ; Wherein, the Expressed in the number of subjects as The probability of the next word w in descriptive text d, Representing the probability that the word w is in the descriptive text d at a topic number K, Representing LDA topic models is based on confusion The total number of topics generated for the descriptive text set D, Represents the degree of confusion of D when the number of subjects is K, Representation of the cause The minimum number of topics K is used as the optimal number of topics , Represents the probability distribution of word w when the current topic z is the kth topic, Representing the topic distribution of d, log representing the logarithmic function, exp representing the natural exponential function, Representing the number of words in d, wherein the number of words in d represents the length of descriptive text d; S304, generating low-dimensional semantic embedding of topic distribution of the concept entity through a principal component analysis algorithm: ; Wherein, the Representing the intermediate variables obtained using principal component analysis algorithm PCA Low-dimensional semantic embedding of (2); s305, generating BERT semantic embedding of the low-dimensional semantic embedding by using a BERT module: ; Wherein, the Representing the ith conceptual entity Is embedded in the BERT semantics of (c), Representing a BERT module; S306, fusing the low-dimensional semantic embedding and the BERT semantic embedding to obtain a first embedded representation about concept entities: ; Wherein, the Representation of Is provided with a first embedded representation of (c), Representing the fusion weight coefficient.
- 5. The method for securely identifying network data assets based on entity alignment according to either of claims 1 and 4, wherein said generating a second embedded representation of said instance entity using a GAN module comprises: s307, training the GAN module under supervision of a counterdamage function aiming at generating entity embedding conforming to a real instance, wherein the counterdamage function specifically comprises the following steps: ; Wherein, the Representing the countermeasure targets of the generator G and the arbiter D, wherein the countermeasure targets of the generator G and the arbiter D represent G when the function is made to take the minimum value and D when the function is made to take the maximum value, respectively, Representing a real instance entity distribution with respect to a real instance entity x, The noise distribution with respect to the noise s is shown, Representing the probability that x output by the arbiter D is the true instance entity, Representing the generated instance entity of the generator with respect to the noise s, Representing the output of the discriminator D As a probability of a true instance entity, Representing entity distribution to real instances The average value of all the real instance entities x obtained by the sampling, Representing noise distribution Average value of all noise s obtained by middle sampling; S308, outputting a plurality of generated instance entity embeddings by using the trained GAN module, wherein the generated instance entity embeddings represent the second embeddings, and the second embeddings represent: ; Wherein, the Representing the i-th generated instance entity embedding generated based on noise s.
- 6. The entity alignment-based network data asset security identification method of claim 4, wherein S4 specifically comprises: s401, distributing the first embedded representation to concept entity nodes in the knowledge graph, and distributing the second embedded representation to instance entity nodes in the knowledge graph respectively; S402, respectively calculating an adjacency matrix and a weight matrix of the distributed knowledge graph; s403, integrating the knowledge graph in the GCN module based on the adjacency matrix and the weight matrix: ; Wherein, the And Representing a node embedding matrix of a first layer GCN and a first +1 layer GCN of a GCN module comprising a first embedded representation and a second embedded representation, A adjacency matrix representing the assigned knowledge-graph, Represents a learnable weight matrix for converting the node embedding of the first layer into the node embedding of the first +1 layer, Representing the function of the ReLU activation, Representing the initial node embedding matrix, Representing the i-th generated instance entity embedding generated based on noise s, Representation of Is embedded in the first embedded representation of (a); And S404, outputting each knowledge graph node in the integrated knowledge graph as the target entity.
- 7. The entity alignment-based network data asset security identification method according to claim 1, wherein S6 specifically comprises: s601, initializing a random forest with a plurality of decision trees, wherein decision tree parameters of the decision trees are different: ; Wherein, the A j-th decision tree representing a random forest, And Respectively represent Depth and segmentation threshold of (2), wherein the depth and the segmentation threshold represent M represents the number of decision trees, And Respectively represent the construction Entity characteristics and entity tags in the training subset; s602, dividing the training set according to the number of decision trees to obtain a plurality of training subsets; s603, numbering samples in each training subset; s604, calculating the information gain under the split nodes by using samples with different numbers: ; Wherein, the The entropy of the information representing the training subset Y, Representing respective subsets obtained with samples corresponding to the number q as split nodes Is the sum of the information entropy of (a); S605, selecting a sample corresponding to a number with the information gain value larger than the preset information gain value as a splitting criterion of the affiliated decision tree to train the affiliated decision tree, and obtaining the random forest model.
- 8. The entity alignment-based network data asset security identification method of claim 1, further comprising, after said S9: and under the condition that the target entity to be identified is abnormal, sending early warning information for starting a defense mode to a source entity aligned to the target entity to be identified.
- 9. A network data asset security identification system based on entity alignment, comprising: A processor; A memory having stored thereon computer readable instructions which, when executed by the processor, implement the entity alignment based network data asset security identification method of any of claims 1 to 8.
Description
Network data asset security identification method and system based on entity alignment Technical Field The invention relates to the technical field of network security, in particular to a network data asset security identification method and system based on entity alignment. Background Entity alignment refers to the identification and matching of different entities representing the same object in different data sources, integrating them into a unified representation. The method is widely applied to scenes such as knowledge maps, data integration and the like, so that unified analysis and use of data from different sources can be ensured. Network data assets refer to data resources of value and significance in a network environment, including device information (e.g., servers, routers, etc.), log information, user account information, etc., which provide critical support for the daily operations and decisions of an enterprise. In today's complex network environments, network data assets face a number of potential security risks, such as data leakage, unauthorized access, malicious attacks, and the like. Abnormal data activity may be identified through network data asset security identification, protecting the security of the network data asset. However, there is a large amount of homologous and non-homologous heterogeneous data in network data, in cross-organization collaboration, especially when threat information is shared, different organizations may record threat sources and events using different naming methods or formats, and the existing asset security identification scheme does not consider heterogeneous problems among different data sources, so that sharing of data security problems and threat retrieval across organizations are hindered, common defensive capacity of network data assets is reduced, the same threat source or security event is regarded as a plurality of different events, false alarm or missing alarm is caused, and security risk of the network data assets is greatly increased. Disclosure of Invention In order to solve the problem that a large amount of homologous and non-homologous heterogeneous data exist in network data, in cross-organization collaboration, particularly in the case of sharing threat information, different organizations may record threat sources and events by using different naming modes or formats, the existing asset security identification scheme does not consider heterogeneous problems among different data sources, so that sharing of data security problems and cross-organization threat retrieval are prevented, common defensive capacity of network data assets is reduced, the same threat sources or security events are regarded as a plurality of different events, misinformation or missing report is caused, and the technical problem of security risk of the network data assets is greatly increased. The technical scheme provided by the embodiment of the invention is as follows: first aspect: The embodiment of the invention provides a network data asset security identification method based on entity alignment, which comprises the following steps: S1, acquiring a network data asset, and a historical asset state and a corresponding state mark of the network data asset, wherein the state mark comprises an abnormal state mark and a normal state mark; s2, establishing an association relation of network data assets, and extracting the network data assets with objects as example entities by taking the network data assets as concept entities; S3, generating a first embedded representation of the concept entity by using the BERT module, and generating a second embedded representation of the instance entity by using the GAN module; S4, integrating the first embedded representation and the second embedded representation by utilizing a GCN module to obtain a target entity; S5, extracting entity characteristics and entity labels of the target entities, and establishing a target entity training set, wherein the entity characteristics are historical asset states corresponding to the target entities, and the entity labels are state marks corresponding to the historical asset states; S6, combining the information entropy, and constructing a random forest model based on the target entity training set; S7, acquiring real-time entity characteristics about a target entity to be identified in the network data asset, namely real-time asset states corresponding to the target entity to be identified; s8, inputting the real-time asset state into a random forest model, and outputting a recognition result of the real-time asset state, wherein the recognition result comprises a normal state and an abnormal state; and S9, outputting the target entity to be identified in the network data asset to be abnormal under the condition that the identification result is in an abnormal state, otherwise outputting the target entity to be identified in the network data asset to be normal. Second aspect: the embodim