CN-122027714-A - Network unknown protocol reverse analysis method based on deep learning

CN122027714ACN 122027714 ACN122027714 ACN 122027714ACN-122027714-A

Abstract

The specification provides a network unknown protocol reverse analysis method based on deep learning, and relates to the technical field of network protocols. The method designs a self-adaptive protocol framing method by adopting Kmeans++ unsupervised clustering algorithm and combining feature engineering, carries out autonomous framing treatment on an original byte stream of an unknown protocol with an indefinite length, automatically captures deep features such as local field boundaries, long sequence dependence, global semantic association and the like of the unknown protocol by combining CNN, LSTM, transformer and other model advantages, takes the deep features as input and integrates the algorithms such as BiLSTM-CRF, LSTM+HMM, attention+classifier and the like to realize format extraction, state recovery and semantic inference of the unknown protocol, remarkably improves the automation degree and accuracy of unknown protocol analysis, and solves the problem that the unknown protocol of the unknown network cannot be accurately and efficiently reversely analyzed under the condition of lacking priori knowledge.

Inventors

YANG WENTAO
YANG SONGLIN
YANG CONG
He Maoheng
JIANG YONGKANG
WEI ZIQIANG
ZHOU XIA

Assignees

贵州航天计量测试技术研究所

Dates

Publication Date: 20260512
Application Date: 20251203

Claims (10)

1. The network unknown protocol reverse analysis method based on deep learning is characterized by comprising the following steps of: S1, carrying out data framing and cleaning pretreatment on a byte stream by adopting a Kmeans++ unsupervised clustering method to obtain a framing byte stream with noise elimination and known message types; S2, constructing an unknown protocol semantic feature extraction model based on fusion of a convolutional neural network CNN, a long-short-term memory network LSTM and a Transformer model, and automatically capturing local field boundaries, long-sequence dependence and global semantic association deep features of a protocol; s3, taking deep features as input, carrying out high-precision format extraction, state recovery and semantic inference on an unknown protocol by using an unknown protocol semantic inference model, and outputting field semantic tags, wherein the unknown protocol semantic inference model comprises a two-way long-short-term memory network-conditional random field model BiLSTM-CRF, a long-short-term memory network-hidden Markov model LSTM+HMM, an attention mechanism and a classifier.
2. The method according to claim 1, wherein the processing in step S1 comprises: S11, designing a feature vector to automatically extract a sliding window, traversing the byte stream by the independent sliding window, and calculating a central byte value, shannon entropy, context information and second-order features for each position; s12, carrying out cluster analysis on the feature vectors based on a Kmeans++ unsupervised clustering method, distinguishing boundary points and non-boundary points, and obtaining two clusters after clustering is completed; S13, analyzing mass center characteristics of the two clusters, determining a frame head cluster, and segmenting a byte stream according to position indexes in the frame head cluster; S14, filtering repeated data packets of the split frame message, discarding CRC check error frames, and removing link layer/network layer header data cleaning.
3. The method according to claim 1, wherein the processing in step S2 comprises: s21, extracting short-distance field dependence by adopting a convolutional neural network algorithm, and capturing local field characteristics of the protocol message through convolutional kernel sliding; S22, extracting time sequence dependency characteristics of a protocol by adopting a gating mechanism of a long-term and short-term memory network algorithm; S23, capturing global semantic association by adopting a transducer algorithm, directly calculating the dependency relationship of any two positions in the sequence through a self-attention mechanism, and extracting global semantic association characteristics of the protocol.
4. The method according to claim 1, wherein the processing in step S3 includes: S31, extracting a protocol format by adopting BiLSTM-CRF algorithm, and outputting a field segmentation result; S32, combining the event feature extraction capability of the long-term memory network LSTM and the state transition modeling capability of the hidden Markov model HMM to realize the automatic recovery of a state machine; s33, focusing key features through an attention mechanism by adopting a semantic inference mechanism combining the attention and the classifier, combining the classifier to realize semantic classification, and outputting field semantic tags.
5. The method according to claim 2, wherein the processing in step S11 includes: S111, carrying out feature extraction based on a sliding window and a feature engineering method, and calculating a core feature for a position i based on the sliding window to form a feature vector X i , wherein the core feature comprises a central byte value, N-Gram statistical features, position context differences and second-order features; Wherein the center byte value is a fixed frame initiator; wherein, the N-Gram statistical characteristics comprise Unigram entropy, byte value frequency and unique byte number; The position context difference comprises a difference from a previous byte and a difference from a next byte; The second order features include the mean and variance of the byte values throughout the window.
6. The method according to claim 2, wherein the processing in step S13 includes: s131, determining the mass centers c 1 and c 2 of the two clusters to be the frame start cluster through multi-feature fusion; Traversing the central byte value features of c 1 and c 2 , if the first feature value of one centroid is very concentrated and the other is very dispersed, then this cluster is the start of frame cluster; The mean diff feature is traversed, and the mean diff value of the frame starting position is usually higher; Judging a frame starting cluster through fusion of center bytes, statistical features and context difference features; s132, a set S= { i|X i ∈Cluster start } marked as a frame start cluster is a starting point of a frame; The original byte stream is cut from the start of Frame, frame k ＝B[s k ∶s k+1 , where S k and S k+1 are two consecutive points in the set S.
7. A method according to claim 3, wherein the processing in step S22 comprises: s221, setting an LSTM gating mechanism; S222, determining the type of discarded cell state information; S223, determining the new information type stored in the cell state S224, updating the long-term information of the cell state according to the discarded cell state information type and the stored new cell state information type; and S225, determining the output of the current hidden state H t , and finally, determining the hidden state H= [ H 1 ,h 2 ,…,h t ,…,h L ],h t ∈R h of the LSTM output sequence, wherein each H t comprises the time sequence dependency characteristics of the previous t-step sequence.
8. The method of claim 4, wherein the processing in step S31 includes: s311, biLSTM feature codes; s312, mapping the bidirectional hidden state to a tag space; S313, adding label transfer constraint by CRF on the basis of label score output by BiLSTM; S314, minimizing negative log likelihood loss during training.
9. The method of claim 4, wherein the processing in step S32 includes: S321, event feature extraction, namely processing a data packet sequence by using an LSTM and outputting event features of each data packet; s322, modeling an HMM state; S323, parameter learning and state inference, namely optimizing HMM parameters pi and A, B by using a Baum-Welch algorithm, and maximizing likelihood probability P (H|pi, A and B) of an observation sequence H.
10. The method of claim 4, wherein the processing in step S33 includes: s331, calculating field characteristics related to focusing and semantics of each position based on a global characteristic A of a transducer; s332, summing the weighted features to obtain global semantic features of the whole sequence; S333, semantic classification is achieved by using the full connection layer +softmax.

Description

Network unknown protocol reverse analysis method based on deep learning Technical Field The document relates to the technical field of network protocols, in particular to a network unknown protocol reverse analysis method based on deep learning. Background Network unknown protocol reverse parsing refers to the technique of reverse deriving the format structure (field division, data type), state transition rules (e.g., TCP three-way handshake) and semantic meaning (field function, e.g., length/type field) of the protocol by analyzing network traffic or program behavior without protocol specification documents. The basic goal of reverse parsing of an unknown protocol is to infer the format, semantics, and behavior of the unknown protocol. Protocol messages typically include a header and a payload, where the payload often contains dynamically structured fields (e.g., variable length arrays), even between messages of the same type, which can significantly interfere with accurate measurement of message similarity and increase uncertainty in message type identification. The traditional method has obvious defects that the algorithm based on rules and the like seriously depends on manual experience or explicit characteristics, is difficult to process private protocols and variable length/nested fields, has poor robustness on noise data, cannot automatically capture implicit sequence dependence of the protocols, has poor expansibility, and needs to redesign rules when the types of the protocols are newly added. Semantic inference is another important task of protocol reverse parsing, and existing methods mainly include type matching and heuristic mining. Type matching infers semantics by identifying common data types (e.g., integer, floating point, timestamp) for protocol fields, but is limited to common reusable field types. Heuristic mining finds fields such as length, checksum and the like by calculating the relation among fields, but requires a specific algorithm design and takes a long time. In recent years, deep learning models are increasingly applied to protocol reverse parsing. The network unknown protocol reverse analysis based on deep learning mainly utilizes a deep learning model to extract characteristics and identify patterns of network data flow, and then restores protocol specifications. The deep learning solves the core pain point of the traditional method through automatic feature extraction and end-to-end learning, can process unstructured binary data, does not need to manually define features, can capture local field correlation, long sequence dependence and state transition rules of a protocol, supports unsupervised/semi-supervised learning, and is suitable for unknown protocol scenes lacking marked data. Therefore, a network unknown protocol reverse analysis method is needed to solve the problems of strong data dependence, insufficient adaptability, limited semantic analysis depth and the like in the protocol reverse analysis in the prior art. Disclosure of Invention The specification provides a network unknown protocol reverse analysis method based on deep learning, which is used for solving the problems of strong data dependence, insufficient adaptability, limited semantic analysis depth and the like in protocol reverse analysis in the prior art. In a first aspect, the present disclosure provides a network unknown protocol reverse parsing method based on deep learning, which is characterized by comprising the following steps: Carrying out data framing and cleaning pretreatment on the byte stream by adopting a Kmeans++ unsupervised clustering method to obtain a framing byte stream with noise elimination and known message types; Constructing an unknown protocol semantic feature extraction model based on fusion of a convolutional neural network CNN, a long-short-term memory network LSTM and a transducer model, and automatically capturing local field boundaries, long-sequence dependence and global semantic association deep features of a protocol; The deep features are used as input, the unknown protocol semantic inference model is used for carrying out high-precision format extraction, state recovery and semantic inference on the unknown protocol, and field semantic tags are output, wherein the unknown protocol semantic inference model comprises a two-way long-short-term memory network-conditional random field model BiLSTM-CRF, a long-short-term memory network-hidden Markov model LSTM+HMM, a attention mechanism and a classifier. The beneficial effects of the invention are as follows: The application provides a network unknown protocol reverse analysis method based on deep learning, which comprises the steps of designing a self-adaptive protocol framing method by adopting Kmeans++ unsupervised clustering algorithm and combining feature engineering, carrying out autonomous framing processing on an original byte stream of an unknown protocol, automatically capturing deep features such as local field