CN-122027497-A - Method and device for deducing data transmission quantity based on encrypted flow

CN122027497ACN 122027497 ACN122027497 ACN 122027497ACN-122027497-A

Abstract

The embodiment of the application provides a method and a device for deducing data transmission quantity based on encrypted traffic, and the method and the device comprise the steps of obtaining the encrypted traffic, extracting data quantity related features from the encrypted traffic, screening the data quantity related features, reserving key features, constructing a training sample set based on the key features, and training a neural network model by using the training sample set to obtain a data quantity deducing model for deducing the data quantity from the encrypted traffic to be detected. The application can accurately infer the actually transmitted data quantity from the encrypted flow to be detected by using the data quantity inference model under the condition of no need of decryption, and provides a basis for monitoring the data transmission quantity.

Inventors

SHI JINQIAO
WANG MINGYU
ZHANG KAI
Diao Yigang
WEN XIN

Assignees

北京邮电大学

Dates

Publication Date: 20260512
Application Date: 20260104

Claims (10)

1. A method of inferring an amount of data traffic based on encrypted traffic, comprising: Obtaining encrypted traffic; extracting data volume related features from the encrypted traffic; Screening the data quantity association features, and reserving key features; And constructing a training sample set based on the key features, and training the neural network model by using the training sample set to obtain a data volume inference model for inferring the data volume from the encrypted traffic to be detected.
2. The method of claim 1, wherein the data volume correlation feature comprises: Basic characteristics of a basic rule of data transmission, encryption protocol characteristics of the proprietary of different encryption protocols, statistical rules of data packet load content and general statistical characteristics of byte-level distribution are represented from the aspects of flow statistics and transmission control.
3. The method of claim 2, wherein screening the data volume related features, retaining key features, comprises: Calculating the correlation coefficient of each feature in the data quantity correlation features and the data quantity, comparing the correlation coefficient of each feature with a preset correlation threshold value, and retaining the features with the correlation coefficients larger than the correlation threshold value; evaluating the distribution difference of each feature under different data volumes through Kolmogorov-Smirnov test, comparing the distribution difference of each feature with a preset difference threshold, and retaining the feature that the distribution difference is larger than the difference threshold; Based on the information entropy principle, calculating the contribution degree of each feature to the judgment data quantity, and reserving the features with the contribution degree larger than a preset influence threshold.
4. A method according to claim 3, wherein evaluating the distribution differences of the features under different data volumes by a Kolmogorov-Smirnov test, comparing the distribution differences of the features with a preset difference threshold, retaining features with distribution differences greater than the difference threshold, comprises: Constructing a feature data set comprising data quantity related features extracted from all data packets and the actual transmitted data quantity; The characteristic data sets are sequenced according to the sequence from the small data volume to the large data volume, the characteristic data set from the minimum value to the median is used as one sample, and the characteristic data set from the median to the maximum value is used as another sample; Analyzing KS statistics of the two samples; And determining the distribution difference of the features under different data amounts according to the KS statistic, and retaining the features with the distribution difference larger than a difference threshold.
5. The method of claim 4, wherein calculating the contribution of each feature to the determination of the amount of data based on the principle of information entropy, retaining features having a contribution greater than a preset impact threshold, comprises: Based on the information entropy principle, calculating the information gain of each characteristic pair for judging the data quantity; Calculating the contribution degree of each feature to judging the data quantity according to the correlation coefficient of the data quantity and each feature, KS statistics of each feature and information gain of each feature, and reserving the feature with the contribution degree larger than a preset influence threshold.
6. The method of claim 5, wherein the method of calculating the contribution degree is: (3) Wherein W is a preset weight, C is a correlation coefficient, K is KS statistic, and G is information gain.
7. The method of claim 1, wherein the training sample set includes key features extracted from a size of an original data file, a data amount of the data file, and an encrypted traffic corresponding to the data file.
8. The method of claim 7, wherein the data amount of the data file is a number of rows and the output of the data amount inference model is an inferred number of rows.
9. The method of claim 1, wherein the neural network model is implemented based on a Mamba model.
10. An apparatus for inferring an amount of data traffic based on encrypted traffic, comprising: the acquisition module is used for acquiring the encrypted traffic; The extraction module is used for extracting data volume related characteristics from the encrypted traffic; the screening module is used for screening the data quantity association characteristics and reserving key characteristics; And the training module is used for constructing a training sample set based on the key characteristics, and training the neural network model by using the training sample set to obtain a data volume inference model for inferring the data transmission volume from the encrypted traffic to be detected.

Description

Method and device for deducing data transmission quantity based on encrypted flow Technical Field The embodiment of the application relates to the technical field of data security, in particular to a method and a device for deducing data transmission quantity based on encrypted traffic. Background The accurate monitoring of the data traffic in the network is of great importance. Related regulations have strict limits on the number of large-scale data transmission, and a supervision department judges whether data leakage or other security threats exist by monitoring large-scale data transmission behaviors, and enterprises and service providers can optimize data processing flows and ensure the integrity and service quality of data transmission by grasping the data transmission scale. In the current internet environment, encrypted traffic has become dominant, especially in cross-border transport scenarios. Although the encryption mechanism can effectively protect confidentiality and integrity of data, the encryption mechanism brings great challenges to monitoring and management of data flow, and how to accurately infer the data transmission amount by analyzing the encrypted traffic without decryption is a key problem to be solved. Disclosure of Invention Accordingly, an objective of the embodiments of the present application is to provide a method and apparatus for deducing data transmission amount based on encrypted traffic. Based on the above object, an embodiment of the present application provides a method for deducing data transmission amount based on encrypted traffic, including: Obtaining encrypted traffic; extracting data volume related features from the encrypted traffic; Screening the data quantity association features, and reserving key features; And constructing a training sample set based on the key features, and training the neural network model by using the training sample set to obtain a data volume inference model for inferring the data volume from the encrypted traffic to be detected. Optionally, the data volume correlation feature includes: Basic characteristics of a basic rule of data transmission, encryption protocol characteristics of the proprietary of different encryption protocols, statistical rules of data packet load content and general statistical characteristics of byte-level distribution are represented from the aspects of flow statistics and transmission control. Optionally, screening the data volume related features, and retaining key features, including: Calculating the correlation coefficient of each feature in the data quantity correlation features and the data quantity, comparing the correlation coefficient of each feature with a preset correlation threshold value, and retaining the features with the correlation coefficients larger than the correlation threshold value; evaluating the distribution difference of each feature under different data volumes through Kolmogorov-Smirnov test, comparing the distribution difference of each feature with a preset difference threshold, and retaining the feature that the distribution difference is larger than the difference threshold; Based on the information entropy principle, calculating the contribution degree of each feature to the judgment data quantity, and reserving the features with the contribution degree larger than a preset influence threshold. Optionally, evaluating the distribution difference of each feature under different data amounts by Kolmogorov-Smirnov test, comparing the distribution difference of each feature with a preset difference threshold, and retaining the feature that the distribution difference is larger than the difference threshold, including: Constructing a feature data set comprising data quantity related features extracted from all data packets and the actual transmitted data quantity; The characteristic data sets are sequenced according to the sequence from the small data volume to the large data volume, the characteristic data set from the minimum value to the median is used as one sample, and the characteristic data set from the median to the maximum value is used as another sample; Analyzing KS statistics of the two samples; And determining the distribution difference of the features under different data amounts according to the KS statistic, and retaining the features with the distribution difference larger than a difference threshold. Optionally, calculating the contribution degree of each feature to the judgment data amount based on the information entropy principle, and reserving the feature that the contribution degree is greater than a preset influence threshold value, including: Based on the information entropy principle, calculating the information gain of each characteristic pair for judging the data quantity; Calculating the contribution degree of each feature to judging the data quantity according to the correlation coefficient of the data quantity and each feature, KS statistics of each fea