Search

CN-122001623-A - Online encryption flow identification fine tuning system for multi-source flow drift scene

CN122001623ACN 122001623 ACN122001623 ACN 122001623ACN-122001623-A

Abstract

The invention discloses an online encryption flow identification fine tuning system of a multisource flow drift scene, which comprises a drift flow characteristic distribution alignment module, a confidence level optimization module, a drift flow confidence level distribution modeling module and a drift sample adaptive pseudo-tag module, wherein the drift flow characteristic distribution alignment module is used for realizing characteristic distribution alignment by mapping original flow and multisource drift flow to a regeneration kernel Hilbert space and combining intra-domain characteristic diversity constraint by utilizing maximum mean value difference, the confidence level optimization module is used for modeling single sample uncertainty based on shannon entropy and introducing diversity constraint of batch prediction distribution, the model output confidence level and class balance are improved, the drift flow confidence level distribution modeling module is used for modeling prediction entropy distribution by adopting a Gaussian mixture model with double Gaussian components and adaptively distinguishing correct prediction samples and uncertain samples, and the drift sample adaptive pseudo-tag module is used for screening high-confidence pseudo-tags according to posterior probability and fusing the high-confidence pseudo-tags with the original samples to construct a mixed training set, and online fine tuning model parameters by a weighted joint loss function.

Inventors

  • LI QI
  • DENG XINHAO
  • ZHANG YIXIANG
  • XU KE

Assignees

  • 清华大学

Dates

Publication Date
20260508
Application Date
20260106

Claims (10)

  1. 1. An online encrypted traffic identification fine tuning system for a multisource traffic drift scenario, comprising: The drift flow characteristic distribution alignment module is used for aligning the characteristic distribution of the multi-source drift flow and the original flow; the confidence optimization module is used for optimizing the distribution characteristics of the model output confidence based on the aligned characteristic distribution; the drift flow confidence distribution modeling module is used for distinguishing a correct prediction sample from an uncertain prediction sample by adopting a Gaussian mixture model, and generating a high-reliability pseudo tag by screening according to posterior probability; and the drift sample self-adaptive pseudo tag module is used for incorporating the screened pseudo tag sample into a training set to update model parameters, continuously adapting to flow distribution change through online fine adjustment, and maintaining the long-term accuracy and reliability of the encrypted flow identification model.
  2. 2. The system of claim 1, wherein the drift flow feature distribution alignment module comprises a feature mapping unit, a difference metric unit, and a feature alignment unit, wherein, The feature mapping unit generates an embedded vector representation of the original flow, maps the embedded vectors of the original flow and the drift flow to a linearly separable high-dimensional regeneration kernel Hilbert space through a feature mapping function, wherein the embedded vectors are obtained by extracting middle layer output of an existing flow identification model, and the middle layer output comprises a convolution module output vector which is input by a full connection module; The difference measurement unit calculates the distribution difference of the flow before and after the drift in the high-dimensional space based on a distribution difference measurement function, and adopts the maximum mean value difference as a distribution difference measurement index; The feature alignment unit introduces an optimization target for maximizing the average distance between features in the original flow and the drift flow on the basis of the distribution difference loss, and improves the consistency of feature distribution before and after drift while maintaining the feature diversity in the domain.
  3. 3. The system of claim 1, wherein the confidence optimization module comprises an uncertainty measurement unit, a diversity constraint unit, and a confidence optimization unit, wherein, The uncertainty measurement unit adopts shannon entropy to carry out uncertainty modeling on the prediction probability distribution of model output; the diversity constraint unit is used for calculating the average distribution of the batch sample prediction results, and taking the entropy value of the average distribution as class balance constraint to avoid excessive deviation of the model to a few classes; the confidence optimization unit is used for constructing a loss function, reducing single-sample prediction uncertainty by minimizing the loss function and improving the output confidence of the model.
  4. 4. The system of claim 1, wherein the drift flow confidence distribution modeling module comprises a confidence distribution modeling unit, wherein the confidence distribution modeling unit normalizes the prediction probability distribution entropy of the batch of samples to ensure consistency among the flow samples, models the normalized entropy distribution by using a gaussian mixture model comprising two gaussian components, and selects gaussian components with a mean value smaller than a preset threshold as the confidence distribution of the correct prediction samples.
  5. 5. The system of claim 1, wherein the drift sample adaptive pseudo tag module comprises a model update unit, a pseudo tag generation unit, wherein: the pseudo tag generation unit is used for calculating the correct prediction posterior probability of the label-free sample based on the modeling result of the Gaussian mixture model, and distributing the pseudo tag to the sample with the posterior probability exceeding a preset threshold; the model updating unit is used for bringing the highly-trusted pseudo tag sample generated by the drift flow confidence distribution modeling module into a training set and combining the highly-trusted pseudo tag sample with an original training sample to form a mixed training set; Performing online fine adjustment on parameters of the encrypted flow identification model based on the mixed training set by adopting a gradient descent method, and continuously updating model parameters by iteratively minimizing a weighted sum of characteristic distribution alignment loss and confidence optimization loss; And dynamically adjusting the weight coefficient of the weighting loss according to the actual drift scene to realize continuous adaptation of the model to the change of flow distribution.
  6. 6. The system of claim 1, wherein the system further comprises: and the standardized interface module is used for interfacing with the existing deep learning website fingerprint identification framework to realize modularized integration.
  7. 7. The system of claim 1, wherein prior to the feature distribution alignment module, the system further comprises: The flow data preprocessing module is used for carrying out data cleaning, format standardization and feature extraction preprocessing on the original flow and the multisource drift flow, wherein the preprocessing comprises the steps of removing abnormal flow data, unifying time granularity and dimension of the flow data and providing standardized flow feature data for the feature distribution alignment module.
  8. 8. An online encryption traffic identification fine tuning method for a multi-source traffic drift scene is characterized by comprising the following steps: S1, aligning the characteristic distribution of the multisource drift flow and the original flow; S2, optimizing distribution characteristics of the model output confidence based on the aligned feature distribution; S3, distinguishing a correct prediction sample from an uncertain prediction sample by adopting a Gaussian mixture model, and screening according to posterior probability to generate a high-credibility pseudo tag; And S4, incorporating the screened pseudo tag sample into a training set to update model parameters, continuously adapting to flow distribution change through online fine tuning, and maintaining long-term accuracy and reliability of the encrypted flow identification model.
  9. 9. A computer device comprising a processor and a memory; wherein the processor runs a program corresponding to executable program code stored in the memory by reading the executable program code for implementing the system according to any one of claims 1-7.
  10. 10. A non-transitory computer readable storage medium, having stored thereon a computer program, which when executed by a processor, implements the system according to any of claims 1-7.

Description

Online encryption flow identification fine tuning system for multi-source flow drift scene Technical Field The invention relates to the technical field of communication, in particular to an online encryption traffic identification fine tuning system for a multi-source traffic drift scene. Background The encryption traffic recognition technology is used as a core means of network traffic analysis and is widely applied to the fields of network security audit, dark network monitoring and communication behavior analysis. With popularization of encryption protocols such as HTTPS/Torr, the traditional flow identification method based on plaintext analysis is gradually disabled, and the deep learning driven website fingerprint identification technology constructs a complete technical system from flow collection to feature extraction and classification decision by extracting metadata features such as data packet size, transmission direction sequence, time interval and the like. Specifically, the technical system covers key links of local time sequence pattern extraction of a convolution layer and probability output of integrating full-connection layer features into a softmax layer, and forms an encrypted traffic identification framework based on a deep neural network. However, the prior art faces the serious challenges of traffic drift in long-term deployments, where the differences in the distribution of features arise from the additive impact of multiple source dynamic factors such as client environment, server configuration, and network links. Existing anti-drift schemes have significant limitations. The retraining method relies on manually marked drift flow samples, the acquisition cost of a target domain label in an actual scene is high and is not sustainable, the distribution difference of the field self-adaptive technology is reduced through feature transformation or countermeasure training, the target domain data is assumed to be controllable and can be acquired in batches, the dynamic characteristics of online and multisource drift are difficult to deal with, and the data enhancement method introduces diversified disturbance in the training stage and cannot cover a complex drift mode in the actual environment. Based on the above, two major technical bottlenecks exist in a long-term deployment scene, namely, firstly, a drift mode has unpredictability, different time periods and network paths can show completely heterogeneous feature distribution, and secondly, a target domain sample lacks a real label, so that a traditional supervised learning mechanism is invalid. The prior art system does not form an online self-adaptive solution to multi-source drift, and a systematic technical framework capable of dynamically responding to flow distribution changes and needing no manual labeling is needed to be constructed. Disclosure of Invention The present invention aims to solve at least one of the technical problems in the related art to some extent. To this end, a first object of the present invention is to propose an online encrypted traffic identification fine tuning system for a multisource flow drift scenario. Another object of the present invention is to provide an online encryption traffic identification fine tuning method for a multi-source traffic drift scenario. A third object of the invention is to propose a computer device. A fourth object of the present invention is to propose a non-transitory computer readable storage medium. To achieve the above object, an embodiment of a first aspect of the present invention provides an online encrypted traffic identification trimming system for a multi-source traffic drift scene, including: The drift flow characteristic distribution alignment module is used for aligning the characteristic distribution of the multi-source drift flow and the original flow; the confidence optimization module is used for optimizing the distribution characteristics of the model output confidence based on the aligned characteristic distribution; the drift flow confidence distribution modeling module is used for distinguishing a correct prediction sample from an uncertain prediction sample by adopting a Gaussian mixture model, and generating a high-reliability pseudo tag by screening according to posterior probability; and the drift sample self-adaptive pseudo tag module is used for incorporating the screened pseudo tag sample into a training set to update model parameters, continuously adapting to flow distribution change through online fine adjustment, and maintaining the long-term accuracy and reliability of the encrypted flow identification model. In one embodiment of the invention, the drift flow characteristic distribution alignment module comprises a characteristic mapping unit, a difference measurement unit and a characteristic alignment unit, wherein, The feature mapping unit generates an embedded vector representation of the original flow, maps the embedded vectors of the origin