CN-116684127-B - Network security-oriented interpretable network data marking method, system and computing equipment

CN116684127BCN 116684127 BCN116684127 BCN 116684127BCN-116684127-B

Abstract

The invention discloses a network security interpretable network data marking method, a system and a computing device, wherein the method comprises the steps that a simulator simulates each network attack, a corresponding network data packet is obtained through a packet grabbing operation, and clustering operation is carried out on the data on the basis to obtain a final data set; the anomaly detector uniformly models the network flow characteristic information of the final data set and a part of interpretation results provided by the interpreter, determines a suspicious flow in each interaction with a network analyst, and the interpreter interprets the currently detected suspicious flow based on maximum linear separation and inquires the network analyst to judge whether the suspicious flow is the anomalous flow. The invention has the advantages that the computing resource of the interpreter is fully utilized, and the anomaly detector can interact with network analysts, wherein the interaction quality is ensured through the interpreter, and finally, the anomaly detector model has the capability of adapting to the dynamic network environment.

Inventors

SHI LEI
LIN CHUNGANG
DUAN RONGCHANG
YU CUILING
ZHANG YUJUN
HOU WEI
Ai Zhengyang
DUAN DONGSHENG

Assignees

国家计算机网络与信息安全管理中心

Dates

Publication Date: 20260512
Application Date: 20230523

Claims (4)

1. A network security-oriented interpretable network data tagging method, comprising: The simulator simulates each network attack, obtains a corresponding network data packet through packet grabbing operation, and performs clustering operation on the data on the basis to obtain a final data set; the anomaly detector carries out unified modeling on the network flow characteristic information of the final data set and a part of interpretation results provided by the interpreter, in each interaction with network analysts, a suspicious flow is determined, the feedback of the network analysts is integrated back into the anomaly detector, the parameters of the anomaly detector are updated through an updating strategy, and the interaction process is iterated until the interaction times are used up; the interpreter interprets the currently detected suspicious traffic based on the maximum linear separation and inquires network analysts to judge whether the suspicious traffic is abnormal traffic; Clustering N flows into a group of K different clusters, wherein the flows in the same cluster share similar network flow characteristics or accord with the same flow mode; The anomaly detector uniformly models the network traffic characteristic information of the final data set and the partial interpretation result provided by the interpreter, and comprises the following steps: The method comprises the steps that a bidirectional data flow mode is adopted, an anomaly detector transmits current detected flow data to an interpreter, and after the interpreter interprets the flow data, part of interpretation results are transmitted back to the anomaly detector; Determining a suspicious flow includes selecting a suspicious flow with a highest degree of anomaly based on a tight confidence upper bound for expected revenue for a particular arm; Updating parameters of the anomaly detector by an update strategy, including updating model parameters according to a ridge regression; Determining a suspicious flow, and further comprising obtaining the contribution degree of any attribute to the abnormal node in each linear hyperplane through the parameter of the linear hyperplane; an interpreter interprets currently detected suspicious traffic based on a maximum linear separation, comprising: For the flow to be interpreted, giving an abnormality score thereof; giving out the attribute of the first two of the attribute abnormality scores for the flow, and giving out the score; Determining the distribution condition of the flow and the context thereof by taking the given two attributes as X and Y axes; The anomaly detector is based on a multi-arm slot machine method, and takes each flow cluster as one arm, so that the corresponding arm a (i) of the ith flow is the cluster where the ith flow is located; wherein the expected benefit of selecting the ith flow rate is: Where x i is the context feature vector of the ith flow, y i is the interpretation result of the interpreter on the ith flow, θa (i) is the coefficient vector of the a (i) th arm to which the ith flow belongs, ρ is the adjustable parameter of the interpretation result provided by the control interpreter on the expected benefit function; At each interaction t, a suspicious traffic i t with the highest degree of anomaly is selected based on a tight confidence upper bound for the expected benefit of selecting a particular arm, the algorithm is as follows: A compact confidence upper bound for the expected benefit of the ith flow is selected; The anomaly detector employs an algorithm: The inputs of alpha, beta, λ∈[0,1], ; Is a set of positive real numbers, Is a set of positive integers that are, Is an N x d dimensional real matrix; initializing: P _ c _ y _ x _ I, q is equal to or greater than the value of (q is equal to or less than the value of (0), In the formula (I), in the formula (II), And Respectively to thetaa (i) and The uncertainty of the estimate is determined by the method, P, q is model weight, t is interaction times; I is an identity matrix, and the matrix is a matrix, For each cluster a e { a 1..ak } for the inverse matrix of P; ; A a , b a is a model parameter, representing the weight.
2. A network-oriented security interpretable network data tagging system employing the method of claim 1, comprising: The simulator is used for simulating each network attack, obtaining a corresponding network data packet through packet grabbing operation, and clustering the data on the basis to obtain a final data set; The anomaly detector is used for uniformly modeling the network flow characteristic information of the final data set and the part of interpretation results provided by the interpreter, a suspicious flow is determined in each interaction with a network analyst, the feedback of the network analyst is integrated into the anomaly detector, the parameters of the anomaly detector are updated through an updating strategy, and the interaction process is iterated until the interaction times are used up; And the interpreter is used for interpreting the currently detected suspicious traffic based on the maximum linear separation and inquiring the network analyst to judge whether the suspicious traffic is abnormal traffic.
3. A computing device comprising a processor and a memory storing computer program instructions that are read and executed by the processor to implement the network-oriented security interpretable network data tagging method of claim 1.
4. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the network-oriented security interpretable network data tagging method of claim 1.

Description

Network security-oriented interpretable network data marking method, system and computing equipment Technical Field The invention relates to the technical field of computer networks, in particular to a network security-oriented interpretable network data marking method, a system and computing equipment. Background With the rapid development of the fields of artificial intelligence and machine learning, the demand for high quality tag data is increasing. High quality labeled data can help the model more accurately capture patterns and relationships in the data, thereby improving the predictive performance of the model on unlabeled data. However, data tagging typically requires a significant amount of effort, time and resources, especially where complex tasks and large-scale data sets are involved. To improve the efficiency and quality of data tagging, researchers and engineers are exploring various techniques such as automated tagging, semi-supervised learning, transfer learning, and utilizing pre-trained models. In addition, there are specialized data marking platforms and tools, such as Amazon Mechanical Turk, figure weight, and Prodigy, etc., that can help speed up the data marking process and improve marking quality. In the field of communication networks, data tagging is commonly used to build supervised machine learning models in terms of network security, network management, and network performance analysis. These models require a large amount of tagged data to train in order to accurately detect anomalies, identify attacks, or predict network performance in a real scenario. Particularly in the field of network security, researchers and engineers use machine learning models to monitor network data in real time in order to prevent malicious attacks and abnormal behaviors. In this scenario, data tagging involves assigning a label, such as "normal" or "anomaly", to each data packet or stream by an anomaly detector to train a model for intrusion detection and anomaly detection. Considering the sparsity of abnormal traffic and the richness of normal traffic, it is difficult to learn the abnormal pattern from limited network abnormal traffic, and because of the imbalance problem of the abnormal traffic and the normal traffic proportion, it is difficult to obtain an unbiased classifier to distinguish a small amount of abnormal traffic accompanied by a large amount of normal traffic. To solve these problems, most of the existing researches are performed in an unsupervised or supervised manner, mainly including a statistical probability-based method, a proximity-based method and a shallow machine learning-based method. The method based on statistical probability is used for modeling flow distribution and detecting abnormal flow according to the deviation degree of network flow from normal flow in the model, but the traditional method firstly lacks effective network flow statistical characteristics and cannot adapt to the current dynamically-changed network environment. With the development of deep learning, a class of deep learning-based methods is proposed. The method mainly extracts the deep features of the network traffic through a deep learning-based model, and then performs abnormality detection by utilizing the extracted deep features of the network traffic. However, such deep learning-based methods lack reasonable interpretation of abnormal traffic and are also not adaptable to current dynamically changing network environments. In fact, network analysts can often provide valuable information to model training by describing whether the discovered abnormal traffic is available, enabling the model to adapt to the dynamic network environment. The Human in the loop-based method utilizes the characteristic to enable network analysts to interact with the model in the model training process, so that the model can adapt to dynamically-changing network environments. However, because the interaction frequency of the network analyst and the model is far smaller than the frequency of network traffic generation, the method has the problem of limited labor resources, and the interaction quality of the network analyst and the model depends on the interpretation capability of the model on the predicted abnormal traffic to a certain extent, while the method based on Human in the loop cannot provide a reasonable interpretation on the detected abnormal traffic, so that the interaction quality of the network analyst and the model is reduced, and the performance of the model is affected. To combine both anomaly detection and anomaly interpretation, the prior art uses a unidirectional data flow pattern. The mode is to explain the detection result provided by the abnormality detection model to a certain extent, and then deliver the interpretation result to network analysts for further judgment. However, in such a method, the data flow relationship between the anomaly detector and the interpreter is unidirectional, and the anom