CN-122001675-A - Open set encryption flow identification method and system based on hierarchical comparison learning
Abstract
The invention belongs to the field of network security, and provides an open-set encryption flow identification method and system based on hierarchical comparison learning, which mainly solve the problems of poor identification ability for unknown class attacks, insufficient generalization ability caused by traditional closed world assumptions and the like in the existing encryption flow detection method based on deep learning. The main scheme includes that the network traffic data are collected and preprocessed, traffic characteristics are extracted, and a training data set is constructed. The method comprises the specific operations of designing Autoencoder characteristic embedded networks based on hierarchical contrast learning, constructing hierarchical contrast loss functions to realize close clustering of benign samples and hierarchical organization of malicious samples, designing joint loss functions to fuse classification losses and hierarchical contrast losses to perform model optimization, and adopting an MAD-based self-adaptive threshold detection mechanism to accurately identify unknown class flow by analyzing the deviation degree of the samples relative to the distribution center of the known class in an embedded space.
Inventors
- NIU WEINA
- BI YANG
- LI HONGKAI
- DING XUYANG
- ZHANG XIAOSONG
- LI XUEXING
- Jiang Muqi
Assignees
- 电子科技大学
- 四川警察学院
Dates
- Publication Date
- 20260508
- Application Date
- 20260320
Claims (10)
- 1. The open set encryption flow identification method based on hierarchical comparison learning is characterized by comprising the following steps: Step 1, collecting and preprocessing network traffic data, extracting statistical characteristics of traffic, and constructing a training data set containing benign traffic samples and known malicious traffic categories; step 2, constructing Autoencoder feature embedded networks based on hierarchical comparison learning, mapping the preprocessed flow features to a low-dimensional embedded space by an encoder, and then carrying out classification prediction on embedded vectors by using an MAD detection algorithm; step 3, designing a hierarchical contrast loss function, and realizing a hierarchical structure of an embedded space by constructing four sample pair relation mask matrixes, including benign sample pairs, the same malicious family sample pairs, different malicious family sample pairs and benign and malicious sample pairs, and respectively applying different distance constraints according to the types of the sample pairs; Step 4, constructing a joint loss function, wherein the joint loss function fuses hierarchical contrast loss and binary cross entropy classification loss, and training and optimizing the Autoencoder feature embedded network constructed in the step 2 based on the joint loss function; Step 5, calculating centroid vectors of all known classes in the low-dimensional embedding space based on the feature embedding network after training, and setting self-adaptive judging thresholds of all known classes in a mode of adding standard deviation to mean according to statistical characteristics of distance distribution of samples in all known classes to corresponding centroid vectors; Step 6, in the test stage, obtaining an embedded representation of an input flow sample to be detected through an encoder, calculating the distance from the embedded representation to the mass center of each known category, and determining a candidate category by adopting a minimum distance criterion; And 7, comparing the distance corresponding to the candidate category with an adaptive threshold value of the candidate category, judging the known category if the distance is smaller than or equal to the threshold value, and judging the unknown category if the distance is larger than the threshold value, so that the unknown attack flow in the open world environment is accurately identified.
- 2. The open set encrypted traffic identification method based on hierarchical contrast learning according to claim 1, wherein the specific steps of step 2 are as follows: The encoder module adopts a multi-layer fully-connected neural network structure, and maps an input flow characteristic vector x to a low-dimensional embedding space to obtain an embedding representation z=enc (x); The MAD detection algorithm receives the embedded vector z as input and performs classification prediction on the embedded vector.
- 3. The open set encrypted traffic identification method based on hierarchical contrast learning according to claim 1, wherein the specific steps of the step 3 are as follows: step 3.1 setting up a total of n samples in the batch, each sample having a binary label E {0,1}, where 0 represents benign, 1 represents malicious, defining a binary similarity mask matrix for all pairs of samples (i, j), i+.j in the batch E {0,1} (n×n), the (i, j) th element of which is: wherein 1[ [ And is an indicator function, when the condition in brackets is satisfied, the value 1 is taken, otherwise 0 is taken, the diagonal element is set to 0 to exclude the pairing of the samples themselves, =1 Means that two samples are benign of the same genus or malicious of the same genus, =0 Indicates a good-bad, A label representing the i-th sample, A label representing the jth sample; Constructing a binary similarity mask matrix according to binary labels of the samples, and identifying whether the two samples are benign or malicious in the same genus; Step 3.2 Multi-class Label per sample E {0, 1..the., K }, where 0 represents benign, 1~K corresponds to K known malicious families, respectively, defining a multi-class similarity mask matrix E {0,1} (n×n), the elements of which are: =1 means that sample i has exactly the same fine-grained label as sample j, i.e. is of the same malicious family or is benign; =0 means that the two categories are different, Represent the first A plurality of types of labels for the individual samples, Represent the first A multi-class label of the individual samples; constructing a multi-class similarity mask matrix according to multi-class labels of the samples, and identifying whether the two samples belong to the same malicious family or are benign; step 3.3: performing difference operation by utilizing the matrixes obtained in the step 3.1 and the step 3.2, and respectively extracting the pair relation of two types of malicious samples; (1) Identical malicious familial sample pair mask I.e. both samples are malicious and belong to the same family: (2) Different malicious family sample pair masks Mdiff, namely two samples are malicious and belong to different families, and the difference value between the binary mask and the multi-class mask is obtained: Msame corresponds to a strong positive sample pair set Pz (i) in the hierarchical contrast loss, and drives the malicious samples of the same family to be tightly aggregated in an embedded space; obtaining different malicious family sample masks and the same malicious family sample mask through the difference operation of the binary mask and the multi-class mask; Step 3.4, constructing a negative sample mask matrix of benign and malicious sample pairs, wherein the negative sample mask matrix Mneg identifies cross-class pairing between benign and malicious samples, and is directly obtained by taking the complement of the binary mask: =1 if and only if one of samples i and j is benign =0), The other is malicious =1), The mask corresponds to the set of negative sample pairs in the hierarchical contrast penalty By maximizing the embedding distance of such pairing, a significant separation boundary between benign and malicious samples is formed in the feature space, and an effective discrimination basis is provided for subsequent MAD-based adaptive threshold detection; And 3.5, calculating Euclidean distance between sample pairs based on the four mask matrixes, applying boundary constraint on benign sample pairs to enable the distance to be smaller than a preset boundary m, applying strong aggregation constraint on the same malicious family sample pairs to minimize the distance, applying weak similarity constraint on different malicious family sample pairs to enable the distance to be smaller than the boundary m, and applying separation constraint on benign and malicious sample pairs to enable the distance to be larger than 2m.
- 4. The open set encrypted traffic identification method based on hierarchical contrast learning according to claim 1, wherein the joint loss function in step 4 is defined as: Wherein the method comprises the steps of In order to achieve a loss of the layer-by-layer contrast, The binary cross entropy classification loss is realized, and lambda is a balance coefficient; the binary cross entropy loss is defined as Where y.epsilon.0, 1 represents the binary label of the sample true, Representing the prediction probability of the input data belonging to the malicious category, wherein 0 is a benign traffic sample, 1 is a malicious traffic sample, and the hierarchical contrast loss is as follows: Wherein: representing the Euclidean distance between two sample embeddings; representing a set of weak positive sample pairs; Representing a set of strong positive sample pairs; Representing a set of negative sample pairs; for the sample And Is used for embedding the distance of the embedded part.
- 5. The open set encrypted traffic identification method based on hierarchical contrast learning according to claim 1, wherein the specific steps of the step 5 are as follows: step 5.1, for each known class c, extracting embedded representations of all training samples of the class by using a trained encoder; Step 5.2 calculating centroid vector of class c in embedding space As a prototype representation of the category; step 5.3, calculating Euclidean distance from all training samples in the category to the mass center to obtain distance distribution ; Step 5.4, based on the statistical properties of the distance distribution, adopting Setting a discrimination threshold of category c, wherein And The sample mean and standard deviation of the distance set are represented, respectively.
- 6. The open-set encrypted traffic identification method based on hierarchical comparison learning according to claim 1, wherein the specific discrimination rules in the step 7 are as follows: When the distance from the sample to the nearest centroid Less than or equal to the threshold value of the category When the sample is judged to be in the category c; a sample is determined to be an unknown class when the minimum distance of the sample to the centroid of all known classes is still greater than the corresponding threshold.
- 7. An open set encryption traffic identification system based on hierarchical comparison learning is characterized by comprising the following modules: The data preprocessing module is used for collecting and preprocessing network flow data, extracting flow statistical characteristics and constructing a training data set and a testing data set which comprise benign flow samples and known malicious flow categories; The model training module is used for training and optimizing the feature embedded network through a joint loss function, and the joint loss function fuses layered comparison loss and binary cross entropy classification loss, wherein the layered comparison loss realizes a hierarchical structure of an embedded space by constructing four sample pair relation mask matrixes, including benign sample pairs, the same malicious family sample pairs, different malicious family sample pairs and benign and malicious sample pairs, and respectively applying different distance constraints according to the types of the sample pairs; The characteristic embedding module is used for constructing Autoencoder characteristic embedding networks based on hierarchical contrast learning, the encoder adopts a multi-layer fully-connected neural network structure, the preprocessed flow characteristic vector is mapped to a low-dimensional embedding space to obtain embedded representation, and an MAD detection algorithm is used for carrying out classified prediction on the embedded vector to realize an embedded space structure of benign sample compact clustering and malicious sample hierarchical organization; The threshold calculation and unknown class detection module is used for extracting embedded representations of all training samples of each known class after training is completed, calculating centroid vectors of each known class in a low-dimensional embedded space, setting self-adaptive judging thresholds of each known class in a mode of adding standard deviation according to statistical characteristics of distance distribution of samples in each class to corresponding centroids, acquiring the embedded representations of the samples to be detected through an encoder, calculating distances from the embedded representations to the centroids of each known class, determining candidate classes by adopting a minimum distance criterion, comparing the distances with the self-adaptive thresholds corresponding to the candidate classes, judging the known classes if the distances are smaller than or equal to the threshold, judging the unknown classes if the distances are larger than the threshold, and accurately identifying unknown attack flow under the open world environment.
- 8. The open-set encrypted traffic identification system based on hierarchical contrast learning according to claim 7, wherein the data preprocessing module is specifically implemented as follows: The method comprises the steps of collecting original flow data from a network environment, extracting statistical characteristics including connection duration, protocol type, byte number, data packet number and the like, carrying out normalization processing on the characteristics, and dividing the characteristics into benign categories and a plurality of malicious categories according to flow labels.
- 9. The open-set encrypted traffic recognition system based on hierarchical contrast learning according to claim 7, wherein the specific implementation of the feature embedding module comprises the following steps: constructing an encoder network, and mapping an input high-dimensional flow characteristic vector to a low-dimensional embedded space; Performing binary classification prediction on the embedded vector by using an MAD detection algorithm; The benign samples form a compact cluster by restricting the embedded space structure through the layered comparison loss function, the samples of the same malicious family are strongly aggregated, the weak similarity among different malicious families is achieved, and the benign samples and the malicious samples are obviously separated.
- 10. An open set encrypted traffic recognition system based on hierarchical contrast learning, comprising a processor that implements the method of any of claims 1-6 at run-time.
Description
Open set encryption flow identification method and system based on hierarchical comparison learning Technical Field The invention belongs to the field of network security, and relates to an open set encryption flow identification method and system based on hierarchical comparison learning. Background With the rapid development of network technology and the increasing complexity of network attack means, network security has become a significant challenge in the digital era. In particular, in the context of the widespread popularity of encrypted communication technologies, conventional network traffic analysis and intrusion detection methods face unprecedented difficulties. The encryption technology provides convenience for the hidden propagation of malicious traffic while protecting the data privacy, so that a network security protection system is difficult to effectively identify and prevent potential security threats. The core challenges faced by the current network security protection are mainly presented in two aspects, namely, how to accurately identify unknown malicious traffic in an encryption environment and how to enable a detection system to continuously adapt to a continuously evolving network threat environment. Although the traditional flow classification method based on deep learning is excellent in detection of known attack types, the traditional flow classification method often has obvious limitations when facing unknown threats such as zero-day attack. These methods typically employ the assumption of a closed world, i.e., assuming that all sample classes that occur during the test phase have occurred during the training phase, when the network environment changes or new attack types occur, the model can falsely classify malicious traffic of unknown classes as known benign classes or known attack types, thereby creating serious security risks. In order to effectively detect novel zero-day attacks, anomaly detection techniques are currently a common protection strategy. Compared with the traditional malicious traffic classification method based on supervised learning, the anomaly detection method only uses normal network traffic data to perform model training, and all network traffic which deviates from a normal behavior distribution mode obviously is identified and marked as potential anomaly traffic by establishing a standard model of normal behavior. In the prior art Mirsky et al propose an automatic encoder-based network intrusion detection system that uses only validated normal network traffic data to train an automatic encoder model to quantify the degree of traffic anomalies by calculating the reconstruction error between the input traffic data and its reconstructed output. However, although the anomaly detection method has many advantages in theory, the key technical challenge of high false alarm rate is generally faced in the practical application process. The root cause of this problem is that it is difficult to collect representative sample data that can fully cover the entire benign behavior distribution space in an actual network environment, which results in quite fuzzy and inaccurate decision boundaries for the trained anomaly detection classifier. The more critical technical problem is that the method based on anomaly detection can only simply divide network traffic into two basic categories of normal and anomaly, and cannot carry out finer-granularity classification and feature analysis on detected attack traffic. Aiming at the inherent technical limitations of the traditional anomaly detection method, the detection method based on open set identification is gradually and widely focused in academia and industry. The core idea of the open set recognition technology is to introduce the recognition capability of unknown class samples based on the traditional closed set recognition problem. The most representative of these is the system of methods based on OpenMax algorithm framework, bendale et al propose OpenMax method which uses the confidence score calculated from the activation vector of the input sample at the last layer of the network to determine whether the sample belongs to an unknown class that was not seen in the training phase by replacing the traditional SoftMax classification layer in the deep neural network with OpenMax layers. However, these OpenMax-based open set identification methods mainly rely on confidence scores as core references for judging whether a sample belongs to an unknown class, and the detection effect and the identification accuracy still have disadvantages. The more critical problem is that the OpenMax type method is technically realized by severely relying on manually preset cut-off threshold parameters, a large number of false positives are generated when the threshold value is set too low, false negatives are generated when the threshold value is set too high, and the generalization capability of the model in the face of different network env