CN-121997084-A - Binary protocol message clustering method and system based on region division keyword positioning

CN121997084ACN 121997084 ACN121997084 ACN 121997084ACN-121997084-A

Abstract

The invention relates to the technical field of binary protocol reverse analysis, and provides a binary protocol message clustering method and system based on regional division keyword positioning. The method comprises the steps of 1, extracting application layer data from network traffic to conduct preprocessing to obtain a binary protocol message set, 2, transversely dividing each binary protocol message structure in the binary protocol message set based on a semantic detection rule to determine a fixed offset area set and a non-fixed offset area set, 3, respectively extracting candidate keyword fields of the fixed offset area set and the non-fixed offset area set by adopting different strategies to obtain a keyword field candidate set, 4, conducting two-stage deduction on the keyword field candidate set by combining clustering constraint and self-constraint to obtain a real keyword field, and 5, clustering the binary protocol message set according to the real keyword field value.

Inventors

YANG QICHAO
ZHAO FANGFANG
XIE YAOBIN
LI LUKAI
ZHU XIAOYA
SUN JIAJIA
YIN XIAOKANG
CAI RUIJIE
LIU SHENGLI

Assignees

中国人民解放军网络空间部队信息工程大学

Dates

Publication Date: 20260508
Application Date: 20251210

Claims (10)

1. A binary protocol message clustering method based on region division keyword positioning is characterized by comprising the following steps: step 1, extracting application layer data from network flow to preprocess, and obtaining a binary protocol message set; Step 2, transversely dividing each binary protocol message structure in the binary protocol message set based on a semantic detection rule, and determining a fixed offset region set and a non-fixed offset region set; step 3, respectively extracting candidate keyword fields of the fixed offset region set and the non-fixed offset region set by adopting different strategies to obtain a keyword field candidate set; Step 4, carrying out two-stage deduction on the keyword field candidate set by combining clustering constraint and self-constraint to obtain a real keyword field; and step 5, clustering the binary protocol message set according to the real keyword field value.
2. The binary protocol message clustering method based on region division keyword positioning according to claim 1, wherein in step 2, the semantic detection rules include a constant field detection rule, a sequence number detection rule, a timestamp detection rule, a sparse value detection rule, an address detection rule, and a checksum detection rule.
3. The binary protocol message clustering method based on region division keyword positioning according to claim 2, wherein step 2 specifically comprises: Calculating the minimum message length of the binary protocol message set as a scanning range; traversing the offset location within a scanning range of each binary protocol message; Collecting message fragments of the current offset position of each binary protocol message, adding the right boundary of the message fragment into a hit record offset set to detect the next offset position if any message fragment accords with one of the semantic detection rules, and detecting the next offset position if the message fragments of the current offset position of each binary protocol message do not accord with the semantic detection rules; Traversing the offset position in the scanning range of each binary protocol message, and taking the maximum value in the hit record offset set as a final boundary; and dividing the fixed offset area and the non-fixed offset area of each binary protocol message according to the final boundary to obtain a fixed offset area set and a non-fixed offset area set.
4. The method for clustering binary protocol messages based on region-division keyword positioning according to claim 2, wherein said step 3 specifically comprises extracting candidate keyword fields for a fixed offset region by excluding fields whose semantic features do not match the keyword fields; And for the non-fixed offset region, searching TLV structure modes through a mode matching method, and generating candidate key word fields.
5. The binary protocol message clustering method based on region division keyword positioning of claim 4, wherein for the fixed offset region, candidate keyword fields are extracted by excluding fields whose semantic features do not match the keyword fields, specifically comprising: Confirming a semantic region conforming to the semantic detection rule in the fixed offset region; deleting the sparse value field area in the semantic area to obtain a high variation field area; Deleting the high variation field region in the fixed offset region to obtain an undetected region; combining the undetected area with the sparse value field area to generate a candidate scanning sequence; and traversing the offset positions in the candidate scanning sequence, screening the offset positions which meet the preset candidate length through modulo operation, verifying the continuity of the screened offset positions, and taking the field corresponding to the current offset position as the candidate keyword field after verification.
6. The binary protocol message clustering method based on region division keyword positioning according to claim 4, wherein for the non-fixed offset region, searching TLV structure patterns by a pattern matching method to generate candidate keyword fields, specifically comprising: Taking the minimum message length in the non-fixed offset region set as the maximum compatible processing length; For each message in the non-fixed offset region set, gradually scanning and extracting a type T field and a length L field of the TLV structure by adopting a sliding window method within the maximum compatible processing length of each message; And verifying whether the TLV structure is complete in data and valid in coding based on the T field and the L field, and marking the T field as a candidate key field after verification is passed.
7. The method according to claim 1, wherein in step 4, the first-stage inference includes calculating a posterior probability of each candidate keyword field based on a clustering constraint, and ordering each candidate keyword field from high to low according to the posterior probability, wherein the clustering constraint includes a message similarity constraint, a remote coupling constraint, a structural consistency constraint, and a dimension constraint.
8. The binary protocol message clustering method based on region-dividing key positioning according to claim 7, wherein in step 4, the self-constraints include bit-use constraints and position constraints; calculating the keyword field probability of the first N candidate keyword fields in the first-stage inferred ordering result based on bit use constraint and position constraint, and selecting the keyword field with the highest keyword field probability as a real keyword field; the keyword field probability calculation formula is as follows: in the formula, Wherein, the A field representing a candidate key is presented, The candidate key field is represented as a true key field, The representation bits use a constraint probability that, The probability of the position constraint is represented, The posterior probability is represented by the probability of a posterior, The representation is made of a combination of a first and a second color, And (3) representing.
9. The binary protocol message clustering method based on region-division key positioning of claim 8, wherein the bits are calculated using constraint probabilities by: in the formula, Wherein, the Representing the euclidean distance and, Indicating the maximum distance of the distance between the two, Representing candidate key field The probability of use of the bits is determined, Representing candidate key field The theory of the bits uses the probability that, Representing the most significant bits of the candidate key field, Representing a significant bit with a probability of 1, the position constraint probability being calculated by: Wherein, the Representing the offset of the candidate key field.
10. A binary protocol message clustering system based on region-dividing keyword positioning, comprising: the preprocessing module is used for extracting application layer data from network traffic to perform preprocessing to obtain a binary protocol message set; The area dividing module is used for transversely dividing each binary protocol message structure in the binary protocol message set based on a semantic detection rule and determining a fixed offset area set and a non-fixed offset area set; The candidate keyword field generation module is used for respectively extracting candidate keyword fields of the fixed offset region set and the non-fixed offset region set by adopting different strategies to obtain a keyword field candidate set; the real keyword field generation module is used for carrying out two-stage inference on the keyword field candidate set by combining clustering constraint and self-constraint to obtain a real keyword field; and the binary protocol message clustering module is used for clustering the binary protocol message set according to the real keyword field value.

Description

Binary protocol message clustering method and system based on region division keyword positioning Technical Field The invention relates to the technical field of binary protocol reverse analysis, in particular to a binary protocol message clustering method and system based on region division keyword positioning. Background In the existing protocol reverse method based on network traffic, message type clustering is a core basic work. Proper clustering can provide a precondition for subsequent message structure analysis and semantic inference. Currently, the main stream clustering method mainly comprises the following three types: (1) An alignment-based clustering method. The method uses the multi-sequence comparison (MSA) technology in bioinformatics to calculate the similarity score between messages through byte or character level comparison, so as to construct a similarity matrix and complete clustering. Typical representatives include PI (Beddoe, 2004), PEXT (Shevrtalov et al., 2007), netzob (Bossert et al., 2014), and SCRIPTGEN (Leitaet al., 2005). The basic assumption of these approaches is that messages of the same type should have a high similarity in field values. However, this assumption is not always true in binary protocols, especially if the field offset is not fixed or there is a variable length field. In addition, MSA algorithms introduce redundant gaps, resulting in loss of critical field position information for fixed offset regions (e.g., message headers), thereby affecting cluster accuracy. On the other hand, MSA has a high temporal complexity (e.g., the optimized progressive alignment algorithm has a complexity of O (K 2*n2), and the computational overhead is significant when processing large-scale datasets. (2) Token-based clustering methods. Such methods rely on predefined separators or n-grams to segment messages into tokens and cluster based on frequently occurring token values. Typical industries have Discover (Cui et al, 2007), proDecoder (Wanget al, 2012), and Veritas (Wang et al, 2011). While such methods perform well in text protocols, binary protocols lack explicit syntax structures and are highly flexible in value space, resulting in greater challenges in their accuracy and time complexity. In particular, ambiguity of token partitioning can further reduce clustering effects when processing binary data without delimiters. (3) A clustering method based on keywords. The method generates keyword field candidates by aligning messages, derives keywords by using probability constraints to realize clustering, and reveals the root cause of the message format difference. Representative work includes Netplier (Ye et al, 2021) and MDIplier (Liang et al, 2024). Although such an approach is theoretically advantageous, it still relies heavily on global or local MSA alignment, not taking into account the structural differences of the Fixed Offset Region (FOR) and the non-fixed offset region (NFOR) present in the binary message. For example, netplier and MDIplier may destroy the fixed offset nature of the key field by inserting redundant gaps during alignment, resulting in false clustering, as shown in fig. 1, and MSA destroys the fixed offset nature of the "function code" field during alignment, resulting in it not being correctly identified, thereby causing false clustering (as shown in fig. 2 (b)). In addition, existing methods such as ProInfer (Guo et al, 2025) attempt to infer keywords by statistical probability, but there is uncertainty in their field partitioning strategy, and it is assumed that all keywords have a fixed offset characteristic, and cannot effectively deal with the scenario (such as DHCP protocol) where the keywords are located in NFOR. On the other hand, although deep learning methods (such as CNNPre and LSTM-FCN) perform well in some scenes, the applicability is limited in unknown protocol reverse direction due to the problems of large flexibility of binary protocol design, large calculation overhead and the like. Disclosure of Invention Aiming at the problems of insufficient generality and reduced accuracy when dealing with variable length fields, non-fixed offsets and flexible value spaces which are common in binary protocols due to the fact that the existing protocol inverse clustering method depends on specific structural assumptions, the invention provides a binary protocol message clustering method and system based on region division keyword positioning. The partitioning generates candidate key word fields by identifying fixed offset regions FOR and non-fixed offset regions NFOR in the message, and combines double probability constraint inference to realize high-precision binary protocol message clustering. In a first aspect, the present invention provides a binary protocol message clustering method based on region division keyword positioning, including: step 1, extracting application layer data from network flow to preprocess, and obtaining a binary protocol messa