CN-122027323-A - Diversity-oriented industrial protocol format inferred flow generation method

CN122027323ACN 122027323 ACN122027323 ACN 122027323ACN-122027323-A

Abstract

The invention belongs to the technical field of flow generation, and discloses a diversity-oriented industrial protocol format deduction flow generation method. A format inference-oriented countermeasure flow generation model is designed, a decoupled generation architecture is adopted, flexible effective load distribution learning is forcedly isolated from rigid syntax, a generation process is separated from deterministic rules, the flow generation model can synthesize a high-entropy effective load mode exceeding limited diversity of an original sparse dataset, and through the flexible distribution learning and the rigid syntax forcedly isolated, FIGAN, inherent conflict between syntax effectiveness and semantic diversity is solved, so that a new paradigm is provided for industrial flow enhancement.

Inventors

YAO YU
HE PING
LI XU
YANG WEI

Assignees

东北大学

Dates

Publication Date: 20260512
Application Date: 20260309

Claims (7)

1. A diversity-oriented industrial protocol format inferred flow generation method is characterized by designing a format inferred countermeasure flow generation model, adopting a decoupled generation architecture, forcibly isolating learning of flexible effective load distribution from rigid syntax, separating a generation process from a deterministic rule, enabling the flow generation model to synthesize a high entropy effective load mode exceeding limited diversity of an original sparse data set, and specifically comprising the following steps: The method comprises the steps of mining global protocol identifiers and length fields, storing the global protocol identifiers and the length fields in a grammar template library, and setting an original flow data set as Wherein the data Capturing global protocol identifier information using a sliding window algorithm; step two, preprocessing the data of the message data, constructing a Wasserstein generation countermeasure network architecture based on gradient penalty, comprising a generator Distinguishing device The Wasserstein generation countermeasure network architecture based on gradient penalty is countertrained by adopting a training strategy which integrates micro sampling and gradient constraint, so as to obtain a flow generation model and generate flow; The method comprises the steps of closed loop verification and sample generation, eliminating the difference between the probability output of a flow generation model and the strict protocol specification of an industrial control system, including sequence reconstruction, grammar calibration and physical verification based on interaction, collecting all samples passing the physical verification, and constructing a final high-quality and diversified industrial protocol test corpus.
2. The diversity-oriented industrial protocol format inferred traffic generation method of claim 1 wherein the global protocol identifier mining is specifically: step 1.1.1 defining the protocol identifier as a sequence of bytes of constant rigidity observed in the header of the packet, defining the search space to the prefix of the traffic packet, for Data packet, defining candidate byte set The method comprises the following steps: Wherein, the Represent the first The data packet is at the first The value of the offset position of the individual bytes, The representation is located at the first Byte values for the individual locations; Step 1.1.2 based on inclusion Calculating empirical probabilities for complete data sets of individual data packets for a particular byte offset position Frequency of The calculation formula of (2) is as follows: Wherein, the Representing statistics at all Specific byte offset position in each data packet Value of occurrence of a place If the mode of a particular byte offset position satisfies Marking the specific byte offset location as static; Step 1.1.3 detecting the spatial adjacency of bytes marked as static when consecutive byte sequences Are all identified as static and spliced to form a single rigid constraint, the processing results are recorded as containing a range of positions and fixed values Is a tuple of (2) ; Step 1.1.4, performing a global consistency check for each data packet Extracting it in index interval In actual byte values and Fixed value in (2) Comparing, when the message proportion of successful match reaches more than 95%, judging the tuple The method is characterized by directly entering the following length field mining for the global protocol identifier of the current protocol, otherwise, extracting the data packet The destination port number in the metadata is used as a filtering operator to collect the original data set Dividing into a plurality of isomorphic subsets based on ports, and re-executing the steps 1.1.1 to 1.1.3 on the divided subsets until the global protocol identifier meeting the 95% consistency requirement is mined in a single port background.
3. The method for generating inferred traffic in a diversity-oriented industrial protocol format of claim 1 wherein the mining of the length field is specifically: step 1.2.1, according to the physical length attribute of the data packet in the flow data set Dividing the flow data set, defining clusters To include all physical lengths equal to a particular value Is to reorganize the traffic data set into data packet subsets Isomorphic subsets based on physical length ; Step 1.2.2 defining the scanning range as Wherein For the minimum physical packet length observed in the dataset, at each cluster Traversing byte positions internally, extracting bytes meeting intra-group stability and inter-group variability as candidate objects, merging adjacent candidate objects, and constructing a candidate field set with 1 byte width or 2 bytes width; The intra-group stability is that in any single cluster In, the value of the byte position is kept constant; The inter-group variability is that clusters at different lengths And (3) with The value of the byte position shows variation; The candidate field is defined as : Wherein, the Indicating that the candidate field consists of 1 or 2 consecutive bytes; Step 1.2.3, based on the message length, extracting the pairs of clusters with different lengths from the data packet result Construction comprises A set of differential samples; ; Wherein, the Representing the physical length difference between two clusters; Representing candidate fields The difference of the values taken in the corresponding clusters; for each candidate field, calculating the pearson correlation coefficient between its numerical variation and the physical length variation ; Wherein, the 、 Respectively mean values; If it is Greater than 95%, then consider the candidate field Is a length field; step 1.2.4, calculating the offset by the following method; represents the first Actual values of the individual packet length fields; Step 1.2.5, the mined linear rule The method comprises the steps of applying the method to a flow data set, calculating the proportion of data packets meeting the linear rule, and confirming the candidate field as a protocol length field if the proportion of the data packets exceeds a consistency threshold value of 95 percent.
4. The method for generating the inferred flow rate in the diversity-oriented industrial protocol format according to claim 1, wherein the data preprocessing in the second step specifically comprises: step 2.1.1 based on the flow data set For each data packet therein Performing nibble decomposition Dividing each 8-bit byte of the original 256-dimensional byte space into two 4-bit nibbles which correspond to the upper 4 bits and the lower 4 bits of the byte respectively, and mapping the original 256-dimensional byte space into a compact word list containing 17 discrete symbols through the nibbles Defining a half byte sequence generated based on partitioning as ; Step 2.1.2, setting the length of the unified sequence input by the flow generation model as For the decomposed nibble sequence The following alignment operation is performed if the nibble sequence Is less than the length of Then add continuously at the end of the sequence Signs until the total length reaches If the nibble sequence Is longer than (1) Before reservation The remaining portion is discarded by nibbles until all samples in the dataset are unified to be of length Is a standard fixed length sequence of (2); step 2.1.3, performing one-time thermal encoding on each nibble symbol in the standard fixed length sequence, and tabulating the compact word list Each index value of (a) is mapped into a 17-dimensional sparse vector ; Finally, each original data packet Is converted into dimension of Numerical matrix of (2) An orthogonal feature input space is obtained as input to the flow generation model.
5. The method for generating inferred traffic in a diversity-oriented industrial protocol format of claim 1 wherein the generator is configured to generate the inferred traffic in step 2.2.1 The method comprises the steps of configuring low-dimensional potential noise to be mapped into a characteristic tensor with target protocol semantic distribution characteristics, wherein the characteristic tensor is used for representing probability distribution of an original data packet sequence; global expansion stage receives random noise vector with dimension of 100 The full connection layer projects low-dimensional noise to a high-dimensional middle feature space, and establishes global context association of the data packet; the local refinement stage reshapes the high-dimensional intermediate features into a time sequence format, the one-dimensional convolution layer processes the time sequence to refine local artifacts and capture the dependency relationship between adjacent nibbles, and the operation process of the generator model is as follows: Wherein, the Output is directed to the vocabulary Serves as a pre-input to a subsequent micro-samplingpprocess; Representing a first full connection layer weight matrix, Representing potential spatial feature vectors, i.e. randomly sampled initial noise vectors, Representing a first full link layer bias vector, Representing a second full connection layer weight matrix, Representing a second full link layer bias vector; Step 2.2.2, the discriminator The method is used for evaluating the authenticity of an input data packet sequence for a regression network, and is reciprocal to a generator, and a discriminator adopts a discrimination strategy of local extraction-global scoring; local extraction stage input sequence The stacked one-dimensional convolution layers extract local features from the input sequence in a layered manner, and capture the spatial correlation among bytes; The global scoring stage flattens the extracted local feature map and projects the flattened local feature map into a single scalar effectiveness score through a linear layer; The discriminator removes the batch normalization layer, and adopts LeakyReLU activation functions on the whole line, and the operation process of the discriminator is defined as follows: Wherein, the Representing unconstrained Wasserstein distance estimates; representing the weight matrix of the output layer, Representing leaky linear rectification functions, i.e. 、 Is an output layer bias term.
6. The method for generating inferred flow rate in industrial protocol format for diversity according to claim 5, wherein the training strategy for merging micro-samplings and gradient constraints is specifically that step 2.3.1 is performed for Gumbel-Softmax continuous relaxation processing, and setting Specific time step output by generator The unnormalized logarithmic probability of the word list Token is introduced into Gumbel noise distribution during training to construct a slightly soft Token sample ; Wherein, the Is an independent noise term sampled from the gummel (0, 1) distribution, used to enhance the random exploration ability of the generation process, Is a temperature parameter for controlling the smoothness of the sampling distribution; Step 2.3.2, the loss function of the discriminator is The optimization objective of the arbiter is to maximize the distribution distance between the real sample and the generated sample while satisfying the 1-Lipschitz continuity constraint, and the arbiter loss function is defined as: Wherein, the Representation is produced by a generator Receiving potential noise Post-output, and continuously relaxing the obtained characteristic tensor by Gumbel-Softmax, On behalf of the generation of the sample score expectations, Representing the actual sample scoring expectations, Is a penalty coefficient; For randomly interpolating samples, it is defined as Wherein ; Step 2.3.3, generator loss function is The optimization objective of the generator is to maximize the score of the discriminators to the generated samples, and the generator loss function is defined as: through the generator Distinguishing device The traffic generation model parameters are updated continuously so that the generator can implicitly learn and fit the high-dimensional probability constraints of the protocol fields; using trained generators Traffic data is generated.
7. The method for generating inferred flow rates in a diversity-oriented industrial protocol format of claim 1, wherein the third step is specifically: Step 3.1, setting the continuous logarithmic probability sequence output by the generator as Wherein the vector of each time step Extracting discrete nibble index from deterministic operator : Arithmetically combining every two consecutive nibble indexes, wherein the reconstructed byte sequence is Then (1) Byte number The calculation formula of (2) is defined as: Wherein, the The high-order nibbles representing the bytes, Representing low-order nibbles; by detecting the reconstructed byte sequence The filling mark index in the data packet determines the effective load boundary of the data packet, and the reconstructed message Intercept to the first filler occurrence location: step 3.2, forcedly calibrating the reconstructed message according to the grammar template library; According to templates in grammar template library Rigid constraint set in (a), corresponding index positions in a reconstructed sequence The byte value of (a) is forcedly replaced with a fixed value of the protocol identifier; Based on Reading length field values in a generated sequence Calculating expected physical message length ; If the physical length of the current reconstructed sequence is less than the expected length Zero bytes are appended at the tail of the reconstructed sequence until the physical length is equal to To correct the discrepancy between the head definition and the load length, if the calculated expected length Less than the minimum legal packet length observed during the preprocessing stage Judging that the sample semantics are illegal and directly discarding; step 3.3, verifying semantic validity of the data packet, packaging the calibrated byte sequence into a network data packet according to an automatic flow replay script, sending the network data packet to target equipment, and monitoring network feedback of the target equipment in real time; If the target equipment returns an effective application layer response, the generated request packet is judged to be an effective sample through analysis and verification of an internal protocol stack of the equipment, and if the target equipment does not respond or returns an error code, the generated sample is marked as invalid and filtered.

Description

Diversity-oriented industrial protocol format inferred flow generation method Technical Field The invention relates to the technical field of flow generation, in particular to a diversity-oriented industrial protocol format inferred flow generation method. Background The Industrial Control System (ICS) forms the basic backbone of the critical infrastructure, and the operating efficiency is improved by using the industrial internet of things (IIoT). However, the convergence of the physical and digital fields inevitably expands the scope of attack, exposing proprietary control environments to complex network threats. To cope with these risks, security mechanisms such as intrusion detection and ambiguity testing are critical, but their effectiveness is fundamentally dependent on a comprehensive understanding of the protocol specifications. Unfortunately, many manufacturers view these specifications as unpublished documents for business confidentiality. This opacity necessitates reverse engineering of the protocol, where Protocol Format Inference (PFI) is the key first step, determining the accuracy and depth of subsequent security analysis. However, the intrinsic operational nature of ICS constitutes a fundamental challenge to PFI, data scarcity. Unlike a general IT network, the ICS environment operates according to a strict schedule to ensure physical stability. Thus, the flow shows a high degree of certainty and time periodicity, typically manifested by insufficient functional code coverage and sparse dynamic field distribution. This lack of semantic change is catastrophic to PFI algorithms because these algorithms rely on statistical variability to identify field boundaries. Standard inference methods are prone to structural misunderstanding if there is no diversified data basis. For such diversity bottlenecks, methods such as static analysis, binary execution tracking, and diversified seed corpus generation strategies have been proposed in an effort to extract protocol logic without relying on large amounts of data. These approaches rely primarily on "white-box" preconditions-static analysis of firmware images, binary executables, or detailed documents-which are rarely feasible for proprietary ICS protocols due to vendor locking. In contrast, existing black box generation models are primarily directed to fuzzy test or Intrusion Detection Systems (IDS). Since these methods preferentially synthesize distorted inputs to induce crashes, or synthesize attack-specific patterns for anomaly detection, their optimization goals deviate fundamentally from the PFI requirements. Therefore, they cannot guarantee the syntactical validity and functional legitimacy necessary for reverse engineering complex protocols. There is still a key gap in reconciling strict syntactic compliance with semantic diversity within a unified framework, especially in the protocol inference of black box industrial scenarios. Traffic generation solves the "data diversity scarcity problem" in protocol format inference, as yet research remains blank. It is currently known to address data diversity in general but not based on traffic generation, or traffic generation effort, but downstream tasks are not protocol format inferences. Paper "LUO Z, et al. Enhancing protocol fuzzing via diverse seed corpus generation[J]. IEEE Transactions on Software Engineering. IEEE, 2025." parses the protocol specification using a large language model and generates high quality test seeds by iterative optimization. The method designs a grammar-free generation mechanism based on a structured knowledge base, and aims to solve the problem of uneven distribution of real flow samples. However, although the method can expand the structural diversity of a single message, the lack of collaborative modeling of complex state transition and long time sequence context in dynamic interaction leads to difficulty in precisely generating a continuous flow sequence triggering deep logic interaction when a complex test scene is inferred in the face of a protocol format, and can not fundamentally solve the problem of insufficient data diversity caused by long tail distribution of real network flow. Paper "JIANG J, et al. BinPRE: Enhancing Field Inference in Binary Analysis Based Protocol Reverse Engineering[C]. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. ACM, 2024." leverages instruction-based semantic similarity analysis to enhance field inference in binary protocol reverse engineering. The method designs a clustering and refining paradigm, and aims to overcome logic differences in different protocol implementations to accurately extract field formats and semantics. However, although the method can accurately analyze static field structures and local semantics, the cooperative modeling of a variation generation mechanism under the drive of a global interaction state is lacking, so that when the method faces to a protocol format deducing