CN-121985060-A - Private protocol message segmentation method and system based on self-supervision learning
Abstract
A private protocol message segmentation method and system based on self-supervision learning. The method comprises the steps of shunting private protocol flow messages according to five-tuple, converting bytes into a specified format sequence with additional offset, constructing a self-supervision learning model, receiving the message sequence, mapping the bytes into a high-dimensional vector space from two aspects of word sequence and word sense by taking the bytes as granularity, fusing vectors learned by two layers, converting the bytes into computable byte relation embedded vectors, adopting cosine similarity calculation, generating independent byte relation curves for each protocol message based on the byte relation embedded vectors of two adjacent bytes, and making a decision on whether the two adjacent bytes can be segmented based on the byte relation curves to realize the refined message segmentation of the private protocol. The scheme of the invention gets rid of the constraint of the byte level label knowledge of the datagram, and can effectively realize fine-grained segmentation of the private protocol message lacking priori knowledge.
Inventors
- LI JUNCHEN
- Lv zhuo
- Chen cen
- LAN JINGHONG
- ZHANG ZHENG
- LI NUANNUAN
- CAI JUNFEI
- LI MINGYAN
Assignees
- 国网河南省电力公司电力科学研究院
Dates
- Publication Date
- 20260505
- Application Date
- 20260123
Claims (10)
- 1. The private protocol message segmentation method based on self-supervision learning is characterized by comprising the following steps of: Step 1, shunting private protocol flow messages according to five-tuple, and converting message sections into formatted message sequences with additional offset; Step 2, inputting the formatted message sequence into a self-supervision learning model, mapping bytes into a high-dimensional vector space according to two layers of word sequence and word sense respectively, fusing vectors learned by the two layers, and converting the bytes into computable byte relation embedded vectors; And 3, performing cosine similarity calculation on the byte relation embedded vectors of the adjacent two bytes, generating independent byte relation curves for each protocol message, and performing fine message segmentation based on longitudinal and transverse combination analysis based on the byte relation curves to obtain segmentation results of the adjacent two bytes.
- 2. The self-supervised learning based private protocol message segmentation method as set forth in claim 1, wherein the step 1 further includes: According to the format of five-tuple < source IP, destination IP, source port, destination port, transport layer protocol >, dividing the protocol messages with the same information into the same stream, and dividing the messages exchanged between the source information and the destination information into the same stream; and converting the protocol message into a formatted message sequence by taking bytes as granularity, wherein the formatted message sequence comprises byte values and offsets, the byte values are hexadecimal values of the bytes, and the offsets are offsets of the bytes in the current message.
- 3. The self-supervised learning based private protocol message segmentation method as set forth in claim 2, wherein the step 2 further includes: Constructing a self-supervision learning-based inter-byte relation understanding model, adopting an LSTM model with a self-attention mechanism to learn word sequence relations among the bytes, converting the word sequence relations into word sequence vectors, adopting a GloVe model to learn word sense relations among the bytes, and converting the word sequence relations into word sense vectors; And fusing the word sequence vector and the word sense vector, and converting the bytes into computable byte relation embedded vectors.
- 4. The method for segmenting a private protocol message based on self-supervised learning as recited in claim 3, wherein the training of the inter-byte relationship understanding model in a self-supervised learning manner further comprises: When training an LSTM model with a self-attention mechanism, training by taking the output of the model as a target for predicting each byte word sequence in a formatted message sequence, wherein a loss function is used for minimizing cross entropy loss, and finally outputting 512-dimensional word sequence vectors; In training GloVe models, training is performed with the word co-occurrence matrix of the reconstructed format message sequence as a target, the loss function is designed to minimize the dot product between two word vectors and the square difference of the co-occurrence times of the two word vectors in the word co-occurrence matrix, and the final output 512 is a word sense vector.
- 5. The method for segmenting a private protocol message based on self-supervised learning as recited in claim 4, wherein the fusing of word order vectors and word sense vectors converts bytes into computable byte relation embedded vectors, further comprising: fusing the word sequence vector and the word sense vector of each byte according to the following formula to generate a byte embedded vector BRE: I.e. Wherein the method comprises the steps of Respectively bytes Word order vectors, word sense vectors and fusion representation vectors, , , Is the first Byte number The value of the dimension vector is taken, , , In the number of bytes, Take 0.85.
- 6. The method for segmenting a private protocol message based on self-supervised learning as recited in claim 5, wherein the step 3 further comprises: Sliding on the formatted message sequence by adopting a sliding window mechanism with a window of 2 and a step length of 1, and calculating the relation between two adjacent bytes by adopting cosine similarity to generate a byte relation curve BRC; based on the byte relation curve BRC, a refined message segmentation algorithm HVM-Seg based on longitudinal and transverse combination analysis is adopted, and a decision is made on whether two adjacent bytes can be segmented through transverse reasoning and longitudinal correction, so that message segmentation is realized.
- 7. The self-supervised learning based private protocol message segmentation method as set forth in claim 6, wherein the lateral reasoning is implemented based on a byte relational curve BRC, including the following constraints: Constraint one when an element appears in the BRC When the value continuously rises, i.e. the corresponding byte-to-byte relationship is gradually enhanced, the continuous bytes are not segmented, but segmented at the next byte corresponding to the rising edge point, i.e. at the first byte Generating segments at the bytes; Constraint two, when an element appears in BRC When the value continuously decreases, i.e. the relation between the corresponding byte and the last byte gradually decreases, the segmentation is performed at the position of the last byte corresponding to the decreasing edge point, i.e. at the first byte Generating segments at the bytes; Constraint three, a segmentation point is generated at the last byte to identify the end of the protocol message.
- 8. A self-supervised learning based private protocol message segmentation system, comprising: the protocol flow preprocessing module is used for shunting the private protocol flow message according to the five-tuple and converting the message section into a formatted message sequence with additional offset; the vector fusion module is used for inputting the formatted message sequence into a self-supervision learning model, mapping bytes into a high-dimensional vector space according to two layers of word sequence and word sense respectively, fusing vectors learned by the two layers, and converting the bytes into computable byte relation embedded vectors; And the protocol message segmentation module is used for carrying out cosine similarity calculation on the byte relation embedded vectors of the adjacent two bytes, generating an independent byte relation curve for each protocol message, and carrying out fine message segmentation based on longitudinal and transverse combination analysis based on the byte relation curve to obtain a segmentation result of the adjacent two bytes.
- 9. A terminal comprises a processor and a storage medium, and is characterized in that: The storage medium is used for storing instructions; The processor is configured to operate in accordance with the instructions to perform the steps of the self-supervised learning based private protocol message segmentation method as set forth in any one of claims 1-7.
- 10. A computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of the self-supervised learning based private protocol message segmentation method as set forth in any one of claims 1-7.
Description
Private protocol message segmentation method and system based on self-supervision learning Technical Field The invention belongs to the field of network security protocol analysis, and particularly relates to a private protocol message segmentation method and system based on self-supervision learning. Background The gradual penetration of the internet technology into various fields brings great convenience to a plurality of industries of production and living, and in the emerging networks, information interaction is not required to be free from the existing network frame, and only the stable combination of the emerging networks and various industries is required to be ensured. Thus, the network traffic is developed towards diversification and complicacy, and the continuous emergence of unknown traffic further deepens the difficulty of monitoring the network traffic. The network Protocol reverse technology (Protocol REVERSE ENGINEERING, PRE) provides a scheme for solving the problem of unknown traffic, and is used for clarifying the attribution of the unknown traffic from the private Protocol layer, revealing the message composition structure of the unknown traffic and analyzing the private Protocol format of the unknown traffic. The private protocol message segmentation is an indispensable pre-step for processing unknown traffic, and is the first-step analysis of direct contact traffic data. The message transmitted in the network is a set of continuous bits/bytes with variable length, which are analyzed layer by layer through each layer of protocol, and the meaning of the bit/byte combination is understood, which requires that the field segmentation points must be definitely specified in the protocol specification, and the information receiver, after receiving the protocol message, according to the protocol specification, understands the byte combination between any two adjacent segmentation points as the information of interactive transmission, i.e. the protocol field has semantics, and the protocol message is composed of the protocol fields, so that the segmentation points exist between the fields. The protocol message segmentation is a process of determining a message sequence segmentation point, the protocol message segmentation takes a message sequence as a reference, takes a field with finer granularity as an analysis object to replace the whole message sequence of unknown flow, and deconstructs the protocol message in a finer mode so as to realize more accurate private protocol specification information reasoning, and the protocol message segmentation directly determines the accuracy of a follow-up algorithm reasoning result. The traditional protocol reverse technology based on network traffic often ignores the importance of message segmentation, the whole message sequence is used as the input of reverse analysis to extract a message skeleton based on a sequence comparison method, and the method based on probability statistics or frequent item mining preferentially considers that the N-gram algorithm is used for segmenting the protocol message to support the subsequent reasoning, so that the refined message segmentation method still needs to be improved. In addition, the traditional message segmentation method has certain defects in terms of byte structure or information entropy evaluation, and ignores the relation information existing between bytes of the protocol message during construction, so that the message segmentation effect is poor. Moreover, the proprietary protocol specification information followed by the unknown traffic is missing, and these packet segmentation labels at fine-grained byte level are not theoretically available, so supervised learning is essentially not applicable to protocol reverse analysis. Part of the technology takes message segmentation points of a known protocol as labels to train a model for segmenting the message of the known protocol, and then the model is applied to segmentation of private protocol traffic. This approach is based on the fact that there is a certain feasibility of commonality between protocols, but there is still a certain difference between protocols, especially a private protocol, which has strong randomness when constructing protocol messages, accords with the preference of protocol designers, and the information of message segmentation points should not be used as known information to train a model, but should be obtained in an inferred manner, after all, the protocol specification of the private protocol is unknown. However, the segmentation of the message is very important, because only accurate segmentation of the message can ensure that the downstream task (extraction of protocol keywords, inference of protocol format, etc.) of the protocol reverse technology obtains an ideal protocol specification, and the segmentation result with larger error can generate error propagation effect, i.e. the error can be accumulated and amplified graduall