Search

CN-121984701-A - Industrial Internet unknown flow identification method and system based on self-supervision learning and prototype network

CN121984701ACN 121984701 ACN121984701 ACN 121984701ACN-121984701-A

Abstract

The invention discloses an industrial Internet unknown flow identification method and system based on self-supervision learning and prototype network, and belongs to the technical field of network security and flow identification. The method comprises the steps of processing original industrial internet traffic to obtain standardized feature vectors, constructing an improved InfoNCE contrast learning mechanism self-supervision pre-training network, carrying out self-supervision pre-training on unlabeled data by adopting an improved InfoNCE contrast learning mechanism, taking semantic prototypes maintained in a self-supervision stage as a system initial prototype set, carrying out semantic classification and open set identification on new samples through the trained self-supervision pre-training network, carrying out semantic prototype expansion on samples which are identified as unknown, and carrying out index sliding average update on known type samples to correct the semantic prototypes in real time. According to the invention, under the condition of no need of manual labeling, efficient feature learning, unknown protocol identification and model continuous incremental optimization of industrial Internet traffic can be realized.

Inventors

  • CHEN XUEJIAO
  • WANG PAN
  • JIANG MINMIN

Assignees

  • 南京信息职业技术学院

Dates

Publication Date
20260505
Application Date
20251215

Claims (10)

  1. 1. An industrial internet unknown flow identification method based on self-supervision learning and prototype network is characterized by comprising the following steps: step 1, carrying out session-level aggregation on original industrial internet traffic, and forming a session sequence by combining a timeout segmentation strategy; step 2, feature extraction is carried out on the obtained session sequence to construct feature vectors, and standardized feature vectors are obtained through processing; Step 3, constructing an improved InfoNCE contrast learning mechanism self-supervision pre-training network, realizing general flow characterization on standardized feature vectors under the condition of no labeling data through contrast learning by the improved InfoNCE contrast learning mechanism self-supervision pre-training network, generating positive and negative sample pairs by utilizing a data enhancement means, extracting deep features by a lightweight encoder, introducing semantic prototype constraint on the basis of the traditional InfoNCE, maintaining a group of dynamic semantic prototypes in a feature space in a training stage, adding similarity constraint between sample features and the most similar semantic prototypes in a loss function, and enabling the sample features in the same semantic cluster to be more tightly aggregated; Step 4, generating and maintaining an initial semantic prototype and dynamically updating in an online stage on the basis of maintaining the semantic prototype in the training process of the self-supervision pre-training network by improving InfoNCE contrast learning mechanisms; Step 5, carrying out semantic classification and open set recognition on the new sample through a trained improved InfoNCE contrast learning mechanism self-supervision pre-training network; And 6, carrying out semantic prototype expansion on the sample identified as the unknown sample, and carrying out index sliding average updating on the sample of the known category to correct the semantic prototype in real time.
  2. 2. The method for identifying unknown traffic of the industrial Internet based on the self-supervised learning and prototype network as set forth in claim 1, wherein the improvement InfoNCE of the loss function of the self-supervised pre-training network with respect to the learning mechanism is as follows: Wherein, the Indicating the total loss of the total of the components, The number of samples in a batch is indicated, Representing a sample Is used for the feature vector of (a), Representation and sample The positive sample semantic prototype of the match, Represent the first The adaptive temperature coefficients of the individual semantic prototypes, Wherein Is used as a basic temperature super-parameter, For the standard deviation of the feature distribution within the prototype cluster, An adaptive penalty weight representing a refractory negative sample, when the sample Non-matching prototypes Is higher than a preset threshold In the time-course of which the first and second contact surfaces, Otherwise , The total number of categories representing the semantic prototypes, Represent the first The number of semantic prototypes, Represents the balance coefficient of the balance-wheel, And (3) with Representing any two different semantic prototype vectors in the prototype set.
  3. 3. The method for identifying the unknown traffic of the industrial Internet based on the self-supervised learning and prototype network as set forth in claim 2, wherein the method of step 5 comprises the steps of: Step 51, when a new sample x is input, calculating The encoder extracts its characteristic representation and uses the cosine distance Calculating similarity distances to each semantic prototype, wherein x represents the new sample, Representing a sample The normalized feature vector extracted by the encoder, A feature extraction function representing a lightweight encoder, Representing sample characteristics And (d) The cosine distance between the individual semantic prototypes, Represent the first A semantic prototype vector; step 52, selecting the minimum similarity distance And when the minimum similarity distance is smaller than the set similarity threshold, judging that the sample is matched with the corresponding prototype, otherwise, judging that the sample is an unknown sample or an abnormal sample.
  4. 4. The method for identifying unknown traffic of industrial Internet based on self-supervised learning and prototype network as recited in claim 3, wherein the similarity threshold is calculated by a threshold adaptive computing system based on a maximum set of similarities of samples in a training phase or a recent window Calculate the average value thereof And standard deviation The similarity threshold is: , wherein, And k is an adjustment coefficient, which is a similarity threshold.
  5. 5. The method for identifying the unknown traffic of the industrial Internet based on the self-supervised learning and prototype network as claimed in claim 4, wherein in the step 2, the statistical feature, the distribution feature, the time sequence feature and the behavior feature are extracted for each session and are constructed into feature vectors, and the feature vectors are normalized by a normalization module to obtain normalized feature vectors.
  6. 6. The method for identifying the unknown traffic of the industrial Internet based on the self-supervised learning and prototype network according to claim 5, wherein the statistical features comprise session duration, total number of packets, total number of bytes, average packet length, packet length variance, uplink packet number, downlink packet number and uplink/downlink proportion, the distribution features comprise packet length square duty ratio, the time sequence features comprise arrival time intervals of the first 10 data packets, less than 10 sessions are filled with 0 and are filled with interpolation, the missing intervals are filled with average values or interpolation, the missing intervals can be replaced with average values or variances, and the behavior features comprise packet direction switching times, maximum continuous unidirectional packet numbers and unidirectional average duration.
  7. 7. The method for identifying the unknown flow of the industrial Internet based on the self-supervised learning and prototype network of claim 6, wherein the method in the step 6 comprises the steps of self-adaptively generating a new semantic prototype according to an unsupervised clustering or statistical distribution result for a sample judged to be unknown, or combining manual review information to verify the effectiveness of the new cluster, so as to realize continuous expansion of the prototype and incremental adaptation of the model; For a known class of samples, an exponential moving average update is performed,  The semantic prototype is modified in real-time, wherein, Representing the semantic prototype vector of the corresponding known class to be updated, The momentum coefficients representing the prototype updates are presented, Representing new samples determined to belong to the known class Is described.
  8. 8. The method for identifying the unknown traffic of the industrial Internet based on the self-supervised learning and prototype network as set forth in claim 7, wherein the method of step 1 comprises the steps of: Step 11, collecting an original network data packet from an industrial network interface or a mirror image port in real time, and filtering the original network data packet, wherein the original network data packet comprises an Ethernet frame and an IP datagram in a load thereof, carrying out protocol filtering on the collected original network data packet, only retaining effective messages in IPv4 and IPv6 formats, and discarding a broadcast packet and a non-IP layer control message; Step 12, carrying out session level aggregation on the filtered original network data packet, carrying out preliminary grouping on the message by taking quintuple information of both communication parties, namely a source IP address, a destination IP address, a source port number, a destination port number and a protocol number as basic aggregation keys, judging the start-stop boundary of a session by detecting SYN and FIN (field) zone bits for TCP traffic so as to ensure the integrity of session division, introducing a time window mechanism on the basis of the quintuple for connectionless or short interaction protocol, setting an adjacent message interval threshold value, and automatically cutting into new sessions when the adjacent message interval threshold value is exceeded so as to avoid cross-session mixing, and preferentially adopting a 'request-response pair' as a session division unit for industrial control protocol with a request-response structure so as to keep the behavior semantic consistency of communication; And step 13, after the session division is completed, extracting the basic meta information of each session, and writing the basic meta information into a session index table to form a session sequence, wherein the basic meta information comprises the total number of packets, the uplink and downlink packet numbers, the start and stop time, the duration, the byte number, the average packet length and the protocol type.
  9. 9. The industrial Internet unknown traffic recognition system based on the self-supervision learning and prototype network is characterized by comprising a session aggregation module, a feature extraction and processing module, a self-supervision pre-training module, a prototype learning module, an open set recognition and increment updating module and an output module, wherein the industrial Internet unknown traffic recognition system based on the self-supervision learning and prototype network is used for executing the industrial Internet unknown traffic recognition method based on the self-supervision learning and prototype network of claim 1, and the system comprises the following components: The session aggregation module is used for carrying out session-level aggregation on the original industrial internet traffic and forming a session sequence by combining a timeout segmentation strategy; The feature extraction and processing module is used for carrying out feature extraction on the obtained session sequence to construct a feature vector, and carrying out processing to obtain a standardized feature vector; The self-supervision pre-training module is used for constructing an improved InfoNCE contrast learning mechanism self-supervision pre-training network, realizing general flow characterization on standardized feature vectors under the condition of no marked data through contrast learning by the improved InfoNCE contrast learning mechanism self-supervision pre-training network, generating positive and negative sample pairs by utilizing a data enhancement means, extracting deep features by a lightweight encoder, introducing semantic prototype constraint on the basis of the traditional InfoNCE, maintaining a group of dynamic semantic prototypes in a feature space in a training stage, adding similarity constraint between sample features and the most similar semantic prototypes thereof in a loss function, and enabling sample features in the same semantic cluster to be more tightly aggregated; the prototype learning module is used for generating and maintaining an initial semantic prototype and dynamically updating the initial semantic prototype in an online stage on the basis of maintaining the semantic prototype in the training process of the self-supervision pre-training network by the improved InfoNCE contrast learning mechanism; The open set identification and increment updating module is used for carrying out semantic classification and open set identification on a new sample through a trained improved InfoNCE contrast learning mechanism self-supervision pre-training network; the output module is used for outputting the identified semantic classification.
  10. 10. An electronic device comprising at least one processor, at least one memory and a communication interface, wherein the processor, the memory and the communication interface are in communication with each other, the memory stores program instructions executable by the processor, and the processor invokes the program instructions to perform the method for identifying unknown traffic of the industrial internet based on self-supervised learning and prototype network according to any one of claims 1 to 8.

Description

Industrial Internet unknown flow identification method and system based on self-supervision learning and prototype network Technical Field The invention belongs to the technical field of network security and flow identification, and particularly relates to an industrial Internet unknown flow identification method and system based on self-supervision learning and prototype network, which are used for protocol identification and abnormal flow detection in an industrial Internet environment. Background With the development of industrial Internet, more and more industrial equipment and control systems are connected into a network, and the production and manufacturing systems are highly interconnected, so that a complex heterogeneous communication environment is formed. The communication protocols used by these devices are diverse, including standardized Modbus, DNP3, IEC104, and other industrial protocols, as well as a large number of proprietary or custom protocols. The traffic in the network environment is various and dynamically changed, a large amount of normal traffic exists in the industrial internet, and the continuously emerging security threat traffic, such as novel malicious software or abnormal communication behavior, exists. Traditional network traffic identification methods mainly rely on supervised learning or rule bases, and the methods need a large amount of manual annotation data and can only identify known types of traffic, so that the new unknown traffic types cannot be identified in time, and a large potential safety hazard exists. In addition, the flow data acquisition is limited and the labeling cost is high in industrial scenes, so that the difficulty of model training is further aggravated. At present, a flow identification solution capable of combining open set identification and incremental learning under the condition of a small amount of marked data and even no tag data is not available. Therefore, a novel method capable of effectively identifying unknown flows without labeling a large amount of samples and having self-learning capability is needed, so as to improve the flow safety management and anomaly detection capability in the industrial internet environment. Disclosure of Invention The invention aims to provide the industrial Internet unknown flow identification method based on the self-supervision learning and prototype network, which can identify the industrial Internet unknown protocol and abnormal flow under the condition of scarce labels. The technical scheme adopted by the invention is as follows: An industrial internet unknown flow identification method based on self-supervision learning and prototype network comprises the following steps: And step 1, performing session-level aggregation on the original industrial Internet traffic, and forming a session sequence by combining a timeout segmentation strategy. And 2, carrying out feature extraction on the obtained session sequence to construct a feature vector, and carrying out processing to obtain a standardized feature vector. And 3, constructing an improved InfoNCE self-supervision pre-training network for comparison with a learning mechanism, realizing universal flow characterization by comparison and learning of standardized feature vectors under the condition of no marked data through the improved InfoNCE self-supervision pre-training network for comparison with the learning mechanism, generating positive and negative sample pairs by utilizing a data enhancement means, and extracting deep features through a lightweight encoder. On the basis of the traditional InfoNCE, semantic prototype constraint is introduced, a group of dynamic semantic prototypes are maintained in a feature space in a training stage, and similarity constraint between sample features and the most similar semantic prototypes is added in a loss function, so that the sample features in the same semantic cluster are more tightly aggregated. And 4, generating and maintaining an initial semantic prototype and dynamically updating in an online stage on the basis of maintaining the semantic prototype in the training process of the self-supervision pre-training network by the improved InfoNCE contrast learning mechanism. And 5, performing semantic classification and open set recognition on the new sample through a trained improved InfoNCE contrast learning mechanism self-supervision pre-training network. And 6, carrying out semantic prototype expansion on the sample identified as the unknown sample, and carrying out index sliding average updating on the sample of the known category to correct the semantic prototype in real time. The improvement InfoNCE compares the loss function of the learning mechanism self-supervising pre-training network to: Wherein, the Indicating the total loss of the total of the components,The number of samples in a batch is indicated,Representing a sampleIs used for the feature vector of (a),Representation and sampleThe positive s