CN-122027530-A - Analysis method and device for application program encryption communication protocol based on large model

CN122027530ACN 122027530 ACN122027530 ACN 122027530ACN-122027530-A

Abstract

The invention relates to the field of network security, and particularly provides an analysis method and device of an application program encryption communication protocol based on a large model. The method comprises the steps of dividing a byte sequence in an original network flow packet into a plurality of flow fragments, extracting features of each flow fragment to obtain a high-level feature matrix of each flow fragment, clustering the plurality of high-level feature matrices to obtain a plurality of target flow clusters, wherein each target flow cluster is a flow fragment set with similar features, dividing the flow fragments in the target flow clusters into a plurality of fixed fields and a plurality of variable fields according to the probability that bytes appear in the same position in the plurality of flow fragments for the plurality of flow fragments corresponding to each target flow cluster, and determining the field types of the plurality of fixed fields and the plurality of variable fields according to a preset classification model to obtain analysis results of the characterization flow fragments. The technical scheme provided by the invention is used for improving the resolution accuracy of the APP encryption communication protocol.

Inventors

ZHOU ZIQIANG
LU SHAN
LI TAOCHEN
WANG YAO

Assignees

国网山西省电力有限公司电力科学研究院

Dates

Publication Date: 20260512
Application Date: 20260116

Claims (10)

1. A method for analyzing an encrypted communication protocol of an application program based on a large model, the method comprising: dividing a byte sequence in an original network traffic packet into a plurality of traffic segments, and extracting characteristics of each traffic segment to obtain a high-level characteristic matrix of each traffic segment, wherein at least part of byte sequences in two adjacent traffic segments are overlapped; Clustering the high-level feature matrixes to obtain a plurality of target flow clusters, wherein each target flow cluster represents a set of flow fragments with similar features; dividing the traffic segments in the target traffic cluster into a plurality of fixed fields and a plurality of variable fields according to the probability that bytes appear in the same position in the plurality of traffic segments for a plurality of traffic segments corresponding to each target traffic cluster, wherein the fixed fields or the variable fields comprise one or more bytes; and determining field types of the fixed fields and the variable fields according to a preset classification model to obtain an analysis result representing the flow fragment.
2. The method of claim 1, wherein dividing the byte sequence in the original network traffic packet into a plurality of traffic segments and performing feature extraction on each of the traffic segments to obtain a high-level feature matrix for each of the traffic segments comprises: Dividing byte sequences in an original network flow packet into a plurality of flow fragments according to a plurality of preset sliding windows with different sizes, wherein the sliding windows with different scales correspond to different sliding step sizes; and respectively inputting each flow segment into a preset encoder model to obtain a high-level feature matrix of each flow segment, wherein the preset encoder model comprises a plurality of layers of sub-encoding blocks, and the input of the sub-encoding block of the former layer is used as the output of the sub-encoding block of the latter layer.
3. The method of claim 2, wherein dividing the byte sequence in the original network traffic packet into a plurality of traffic segments according to a preset plurality of sliding windows of different scales, comprises: Dividing a byte sequence in an original network flow packet into a plurality of initial flow fragments according to a plurality of preset sliding windows with different scales; Respectively carrying out normalization processing on byte values in each initial flow segment to obtain flow segments; and adding zero bytes at the tail end of the traffic segment under the condition that the length of the traffic segment is smaller than the size of a sliding window corresponding to the standard traffic segment, so that the length of the traffic segment is equal to the size of the sliding window.
4. A method according to claim 2 or 3, wherein inputting each of the traffic segments into a pre-set encoder model to obtain a high-level feature matrix for each of the traffic segments, respectively, comprises: mapping each byte in the traffic segment into a high-dimensional vector to obtain an embedding matrix representing the traffic segment, wherein the size of the embedding matrix is as follows Where w is the window size and d is the dimension of the high-dimensional vector; Adding a first position code and a second position code to each element in the embedded matrix to obtain a first code matrix and a second code matrix, wherein the first position code is expressed as The second position code is expressed as Wherein pos represents a position index and i represents a dimension index; adding the embedding matrix, the first coding matrix and the second coding matrix to obtain a target embedding matrix; and inputting the target embedded matrix into a coding layer comprising a plurality of layers of sub-coding blocks to code, so as to obtain a high-level characteristic matrix corresponding to the target embedded matrix, wherein the input of the sub-coding block of the former layer is used as the output of the sub-coding block of the latter layer, and each layer of sub-coding block comprises a multi-head attention mechanism, a feedforward neural network, residual connection and layer normalization.
5. The method of claim 1, wherein clustering the plurality of high-level feature matrices to obtain a plurality of target traffic clusters comprises: performing dimension reduction processing on each high-level feature matrix by using a principal component analysis method to obtain a low-level feature matrix; calculating Euclidean distance between any two low-level feature matrixes; Determining a point with the Euclidean distance smaller than or equal to the preset distance as a candidate flow cluster; And determining the candidate traffic cluster as a target traffic cluster under the condition that the number of low-level feature matrices in the candidate traffic cluster is larger than or equal to a preset threshold.
6. The method of claim 1, wherein for each of the plurality of traffic segments corresponding to the target traffic cluster, dividing the traffic segments in the target traffic cluster into a plurality of fixed fields and a plurality of variable fields according to a probability that bytes occur at a same location in the plurality of traffic segments, comprising: calculating shannon entropy of each byte position in a plurality of flow fragments in the same target flow cluster; Under the condition that the aroma entropy of one or a plurality of continuous byte positions is smaller than or equal to a first preset threshold value, determining a field consisting of bytes corresponding to the one or a plurality of byte positions as a fixed field; determining a field consisting of bytes corresponding to one or more byte positions as a low-entropy variable field under the condition that the aromatic entropy of one or more continuous byte positions is larger than a first preset threshold value and smaller than or equal to a second preset threshold value, wherein the second preset threshold value is larger than the first preset threshold value; And under the condition that the aromatic concentration entropy of one or a plurality of continuous byte positions is larger than a second preset threshold value, determining a field consisting of bytes corresponding to the one or the plurality of byte positions as a high entropy variable field.
7. The method according to claim 1, wherein the method further comprises: Generating a plurality of flow reconstruction fragments by respectively passing the high-level feature matrixes through a preset generation type large model, wherein the total loss function of the preset generation type large model in a training stage is obtained by adding reconstruction loss and auxiliary task loss according to a preset weight coefficient; Correspondingly, in the step of dividing the traffic segments in the target traffic cluster into a plurality of fixed fields and a plurality of variable fields according to the probability that the bytes appear in the same position in the plurality of traffic segments for a plurality of traffic segments corresponding to each target traffic cluster, the method comprises: And dividing the flow reconstruction fragments in the target flow cluster into a plurality of fixed fields and a plurality of variable fields according to the probability that bytes appear in the same position in the plurality of flow reconstruction fragments for a plurality of flow reconstruction fragments corresponding to each target flow cluster.
8. The method of claim 7, wherein after the step of dividing the traffic reconstruction segments in the target traffic cluster into a plurality of fixed fields and a plurality of variable fields according to the probability of occurrence of bytes at the same location in the plurality of traffic reconstruction segments, the method further comprises: Performing global sequence alignment on a plurality of flow reconstruction fragments by adopting a dynamic programming algorithm, and identifying the boundary and the position of a target field, wherein the target of the global sequence alignment is any two flow reconstruction fragments And The edit distance between the two is the smallest, and the formula can be expressed as: Wherein, the Representing bytes And Is a replacement cost of (c).
9. The method according to claim 1, wherein the method further comprises: Constructing a hierarchical tree structure according to the analysis result of the flow fragment, wherein a root node of the hierarchical tree structure is a protocol, and the root node comprises one or more child nodes which comprise one or more leaf nodes; Calculating the dependency relationship among fields in a plurality of flow fragments by adopting an association rule mining algorithm based on priori properties, wherein the dependency relationship comprises the support degree and the confidence degree among the fields; Based on the hierarchical tree structure and the dependency relationship, constructing a hidden Markov model, and iteratively training parameters of the hidden Markov model by using a Baum-Welch algorithm until the hidden Markov model converges, wherein the trained hidden Markov model is suitable for analyzing protocol contents of the traffic segment by adopting a Viterbi algorithm under the condition of inputting the traffic segment.
10. An analysis device for an application encryption communication protocol based on a large model, characterized in that the analysis device for an application encryption communication protocol based on a large model comprises: The traffic dividing module is used for dividing a byte sequence in an original network traffic packet into a plurality of traffic segments, and extracting characteristics of each traffic segment to obtain a high-level characteristic matrix of each traffic segment, wherein at least part of byte sequences in two adjacent traffic segments are overlapped; the flow clustering module is used for clustering a plurality of high-level feature matrixes to obtain a plurality of target flow clusters, wherein each target flow cluster represents a set of flow fragments with similar features; the field dividing module is used for dividing the traffic segments in the target traffic cluster into a plurality of fixed fields and a plurality of variable fields according to the probability that the bytes appear in the same position in the plurality of traffic segments for a plurality of traffic segments corresponding to each target traffic cluster, wherein the fixed fields or the variable fields comprise one or a plurality of bytes; And the flow analysis module is used for determining field types of a plurality of fixed fields and a plurality of variable fields according to a preset classification model so as to obtain analysis results representing the flow fragments.

Description

Analysis method and device for application program encryption communication protocol based on large model Technical Field The invention relates to the technical field of computer network security, in particular to an analysis method and an analysis device for an application program encryption communication protocol based on a large model. Background Along with the rapid development of the mobile internet and the internet of things, various types of APP widely use custom encryption communication protocols to protect user privacy and data security. However, these unknown cryptographic communication protocols may be misused by malware or attackers for concealing illegal actions such as data leakage, remote control, etc. Traditional protocol reverse analysis methods rely primarily on static code analysis. The encryption logic and communication mode are extracted by disassembling and reverse engineering the binary file or source code of the target program. However, modern applications commonly employ code obfuscation techniques such as control flow flattening, virtual machine protection, etc., making it difficult for static analysis to accurately restore the original logic. In addition, many applications dynamically load cryptographic modules or communication logic at runtime, resulting in static analysis failing to obtain complete protocol information. Therefore, the technical problem of inaccurate analysis of the APP encryption communication protocol exists in the prior art. Disclosure of Invention The analysis method, the analysis device, the electronic equipment, the storage medium and the computer program product of the application program encryption communication protocol based on the large model provided by the embodiment of the invention are created to improve the analysis accuracy of the APP encryption communication protocol to a certain extent. In a first aspect, the present invention provides a method for analyzing an encrypted communication protocol of an application program based on a large model, the method comprising: dividing a byte sequence in an original network traffic packet into a plurality of traffic segments, and extracting characteristics of each traffic segment to obtain a high-level characteristic matrix of each traffic segment, wherein at least part of byte sequences in two adjacent traffic segments are overlapped; Clustering the high-level feature matrixes to obtain a plurality of target flow clusters, wherein each target flow cluster represents a set of flow fragments with similar features; dividing the traffic segments in the target traffic cluster into a plurality of fixed fields and a plurality of variable fields according to the probability that bytes appear in the same position in the plurality of traffic segments for a plurality of traffic segments corresponding to each target traffic cluster, wherein the fixed fields or the variable fields comprise one or more bytes; and determining field types of the fixed fields and the variable fields according to a preset classification model to obtain an analysis result representing the flow fragment. In a second aspect, the present invention provides an analysis device for an application encryption communication protocol based on a large model, where the analysis method device for an application encryption communication protocol based on a large model includes: The traffic dividing module is used for dividing a byte sequence in an original network traffic packet into a plurality of traffic segments, and extracting characteristics of each traffic segment to obtain a high-level characteristic matrix of each traffic segment, wherein at least part of byte sequences in two adjacent traffic segments are overlapped; the flow clustering module is used for clustering a plurality of high-level feature matrixes to obtain a plurality of target flow clusters, wherein each target flow cluster represents a set of flow fragments with similar features; the field dividing module is used for dividing the traffic segments in the target traffic cluster into a plurality of fixed fields and a plurality of variable fields according to the probability that the bytes appear in the same position in the plurality of traffic segments for a plurality of traffic segments corresponding to each target traffic cluster, wherein the fixed fields or the variable fields comprise one or a plurality of bytes; And the flow analysis module is used for determining field types of a plurality of fixed fields and a plurality of variable fields according to a preset classification model so as to obtain analysis results representing the flow fragments. In a third aspect, the present invention provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor implements the method for analyzing a large model-based application encryption communication protocol according to any one of the above when executing the computer prog