Search

CN-122001676-A - Malicious flow detection model robustness enhancement method and system based on reinforcement learning and incremental learning

CN122001676ACN 122001676 ACN122001676 ACN 122001676ACN-122001676-A

Abstract

The invention belongs to the technical field of malicious flow detection and network security, and provides a method and a system for enhancing the robustness of a malicious flow detection model based on reinforcement learning and incremental learning, which mainly solve the problems of insufficient robustness, poor compliance against a sample, easy catastrophic forgetting under a dynamic threat environment, weak adaptability and the like of the existing malicious flow detection model. Constructing a state action space joint modeling, a third-order compound rewarding function and a compliance action mask mechanism by combining reinforcement learning with a transducer strategy network, and generating a countermeasure sample with high concealment and strong compliance; the intelligent malicious flow defense framework with high robustness against attack and continuous environment adaptability is constructed through the process, and meanwhile, the performance of the model can be verified through a related mechanism, so that the defense effect is ensured.

Inventors

  • NIU WEINA
  • Jiang Muqi
  • LI HONGKAI
  • DING XUYANG
  • ZHANG XIAOSONG
  • LI XUEXING
  • BI YANG

Assignees

  • 电子科技大学
  • 四川警察学院

Dates

Publication Date
20260508
Application Date
20260320

Claims (10)

  1. 1. A malicious flow detection model robustness enhancement method based on reinforcement learning and incremental learning is characterized by comprising the following steps: step 1, constructing a malicious flow training data set and a test data set, collecting benign flow samples and malicious flow samples, preprocessing data, extracting historical attack characteristics, context characteristics and flow statistics characteristics of the flow samples, and providing data support for subsequent flows; step 2, based on reinforcement learning technology and combining a transducer strategy network, constructing a compliance countermeasure sample generation model, and generating a high-efficiency countermeasure sample with high concealment and high compliance through state action space joint modeling, a third-order compound rewarding function and an action mask mechanism; step 3, constructing a concept drift detection mechanism based on MMD, carrying out drift detection on the countermeasure sample generated in the step 2 and the malicious flow sample acquired in real time, quantifying the distribution deviation degree of the current flow data and the historical training data, and starting an incremental training process when the MMD value exceeds a preset threshold value; Step 4, based on an incremental learning technology and combining a layered EWC parameter protection strategy, performing collaborative incremental training on a malicious flow detection model, and applying differential protection intensity to different levels of the model through parameter importance evaluation to realize a dual target of stably maintaining historical attack knowledge and adapting to novel threats; And 5, using the malicious flow detection model trained and updated in the step 4 for detecting the malicious flow sample to be detected, outputting a classification result, and feeding back the detection result to the compliance anti-sample generation model in the step 2, so as to realize closed loop iteration of generation, detection, update and optimization, and continuously improve the robustness and adaptability of the model.
  2. 2. The method for enhancing the robustness of the malicious traffic detection model based on reinforcement learning and incremental learning according to claim 1, wherein the specific steps of the step1 are as follows: Collecting samples, namely collecting benign traffic samples and malicious traffic samples from a public network security data set and an actual network environment, and constructing an initial data set with balanced proportion of the benign samples and the malicious samples; Step 1.2, carrying out data preprocessing, carrying out denoising, duplication removing and standardization processing on the collected flow samples, removing invalid samples, extracting historical attack characteristics, context characteristics and flow statistics characteristics of the samples, and converting characteristic data into a format which can be identified by a model; and 1.3, dividing the data set, dividing the preprocessed data set into a training set and a testing set according to a preset proportion, wherein the training set is used for initial training and subsequent incremental training of the model, and the testing set is used for verifying the performance of the model.
  3. 3. The method for enhancing the robustness of the malicious traffic detection model based on reinforcement learning and incremental learning according to claim 1, wherein the specific steps of the step 2 are as follows: Step 2.1, constructing a state action space joint modeling, wherein the state space s t integrates historical attack characteristics, context characteristics and flow statistics characteristics, wherein the historical attack characteristics comprise recent success rate, alarm type and interception frequency, the context characteristics comprise protocol type, source and destination information and duration, and the flow statistics characteristics comprise packet size distribution, throughput fluctuation and connection duration; Step 2.2, constructing a transducer strategy network, inputting a flow characteristic sequence into a transducer encoder formed by a multi-layer multi-head self-attention mechanism and a feedforward neural network, capturing the relevant information of single flow characteristics in different dimensions through the multi-head self-attention mechanism, carrying out residual connection and layer normalization processing on the output characteristics of each layer of the transducer encoder, carrying out depth modeling on the flow state characteristics of high-dimensional and long sequences by using the transducer encoder, capturing complex time sequence and space dependency relationship among data packets, inputting the final output characteristics of the encoder into a full-connection layer, mapping the final output characteristics into vectors matched with the dimension of an action space through an activation function, and generating a preliminary flow modification action instruction; Step 2.3, realizing an action Mask mechanism, constructing a network protocol compliance rule base based on the main stream network transmission protocol specifications such as TCP/UDP/IP, and the like, formulating an illegal action judgment standard, wherein the illegal action judgment standard comprises data packet size violations, protocol key field tampering, time sequence adjustment violations and illegal modification of source and destination information, simultaneously associating the high-dimensional flow characteristics of the state space s t in the step 2.1 with the flow modification actions of the action space a t , defining the executable action types and action modification ranges under different flow states, constructing a mapping rule base from the state space to the action space, combining the network protocol compliance rule base and the mapping rule base from the state space to the action space at each time step t, dynamically generating a Mask vector Mask t , Wherein the method comprises the steps of The value of (1) is 0 or 1,0 represents illegal actions which cannot be executed at the time step t, 1 represents legal actions which can be executed at the time step t, so that illegal actions which violate the network protocol specification are forcedly filtered, and compliance of action instructions is ensured; step 2.4, designing a third-order composite rewarding function comprising core attack rewards, hidden rewards and resource cost penalties, wherein the third-order composite rewarding function is designed as follows: Wherein the method comprises the steps of Rewarding for core attack, namely generating the bypass success rate of the challenge sample to the detection model, To disguise the rewards, i.e. to generate a similarity of the challenge sample to the original traffic sample, Punishment for resource cost, manifested as resource time consumption when the agent generates the challenge sample, , , The weights corresponding to the three are used for balancing the composite rewarding function, driving the intelligent agent to obtain optimal balance among attack efficiency, hidden survival and resource cost, and quantifying the benefit of each action of the intelligent agent; And 2.5, generating and optimizing an countermeasure sample, combining the action modification instruction generated by the strategy network in the step 2.2 and the action mask mechanism in the step 2.3 by an agent to obtain a legal action filtered by the mask, performing characteristic modification on the original malicious flow sample to generate the countermeasure sample, inputting the countermeasure sample into a current malicious flow detection model, calculating a reward value according to a detection result, back-propagating and updating a transducer strategy network parameter, and iteratively optimizing the quality of the countermeasure sample.
  4. 4. The method for enhancing the robustness of the malicious traffic detection model based on reinforcement learning and incremental learning according to claim 1, wherein the specific steps of the step 3 are as follows: Extracting characteristics, namely extracting core characteristics of the countermeasure sample generated in the step 2 and the malicious flow sample acquired in real time, and aligning the core characteristics with the historical flow sample characteristics used in model training; Step 3.2, calculating an MMD value, and according to a formula by using a maximum mean difference MMD algorithm: calculating the distribution difference between the current flow sample characteristics and the historical flow sample characteristics, and quantifying the deviation degree of the current flow sample characteristics and the historical flow sample characteristics, wherein: for a sample set In the first place The average value over the dimensions of the individual features, For a sample set In the first place The average value over the dimensions of the individual features, Is the total number of feature dimensions; And 3.3, carrying out drift judgment and alarm, presetting an MMD threshold, judging that conceptual drift occurs when the calculated MMD value exceeds the threshold, starting a drift alarm based on a sliding window, triggering the incremental training flow in the step 4, and if the threshold is not exceeded, maintaining the current detection model and continuing to detect the flow.
  5. 5. The method for enhancing the robustness of the malicious traffic detection model based on reinforcement learning and incremental learning according to claim 1, wherein the specific steps of the step 4 are as follows: Step 4.1, evaluating the importance of parameters, adopting elastic weight to consolidate EWC algorithm, calculating the importance of each parameter in the model to the historical attack detection task by using Fisher information matrix, introducing predictive weight for the key parameter of each historical task, and correcting the data of each task; Step 4.2, designing a hierarchical parameter protection mechanism, setting differentiated EWC regular coefficients, namely protection weights, for a bottom layer feature extraction layer, a middle layer semantic understanding layer and a top layer decision classification layer of a malicious traffic detection model based on the hierarchical importance, wherein the coefficients directly determine constraint intensity of parameter updating during incremental training, give higher protection weights to a feature extraction layer with stronger universality, allow a higher parameter adjustment space for adapting to new attack types, introduce hierarchical EWC regular items into a loss function of the incremental training, and design a total loss function as follows: Wherein the method comprises the steps of For incremental task loss, representing the model hierarchy, As the Fisher information matrix mean value of the layer parameters, For the current parameters in the incremental training, Setting protection weight for original parameters of historical task by layering Different levels of differential protection are realized; step 4.3, performing collaborative incremental training, namely performing incremental training on the current malicious traffic detection model by taking the traffic sample with concept drift screened in the step 3 as incremental training data and combining the layered EWC parameter protection strategy described in the step 4.2, solving the contradiction between the protection force and the update amplitude in the traditional single constraint method through layered differential constraint, and realizing moderate protection on the historical key parameters while learning new threat characteristics so as to avoid 'catastrophic forgetting'; And 4.4, verifying and adjusting the model, using the model after incremental training for detecting a test data set, verifying the detection accuracy, robustness and adaptability of the model, and if the detection accuracy, robustness and adaptability of the model do not reach the preset performance index, adjusting the parameter protection intensity or training parameters, and carrying out incremental training again until the requirements are met.
  6. 6. The method for enhancing the robustness of the malicious traffic detection model based on reinforcement learning and incremental learning according to claim 1, wherein the specific steps of the step 5 are as follows: Step 5.1, model deployment detection, namely deploying a malicious flow detection model meeting preset performance indexes after incremental training in the step 4, and detecting a malicious flow sample to be detected; step 5.2, outputting and feeding back a detection result, classifying each sample to be detected by a model, outputting a benign or malicious classification result, recording key detection data, feeding back the classification result and the key detection data to the compliance anti-sample generation model in the step 2, and providing data support for optimizing a sample generation strategy; And 5.3, performing closed loop iterative optimization, namely adjusting a sample generation strategy according to the fed-back detection data by the anti-sample generation model, and re-executing the flow of the step 2-5 to realize closed loop iteration of generation, detection, update and optimization, so as to continuously improve the robustness of the model and the adaptability to novel malicious flow.
  7. 7. A malicious flow detection model robustness enhancement system based on reinforcement learning and incremental learning is characterized by comprising the following modules: The data preprocessing module is used for constructing a malicious flow training data set and a test data set, collecting benign flow samples and malicious flow samples, preprocessing the data, extracting historical attack characteristics, context characteristics and flow statistics characteristics of the flow samples, and providing data support for the follow-up module; The countermeasures sample generation module is used for constructing a compliance countermeasures sample generation model based on reinforcement learning technology and combining a transducer strategy network, and generating a high-efficiency countermeasures sample with high concealment and high compliance through state action space joint modeling, a third-order compound rewarding function and an action mask mechanism; The concept drift detection module is used for constructing a concept drift detection mechanism based on MMD, carrying out drift detection on the countermeasure sample generated by the countermeasure sample generation module and the malicious flow sample acquired in real time, quantifying the distribution deviation degree of the current flow data and the historical training data, and starting an incremental training process when the MMD value exceeds a preset threshold value; The incremental training module is used for carrying out cooperative incremental training on the malicious flow detection model based on an incremental learning technology and combining a layered EWC parameter protection strategy, and applying differential protection intensity to different levels of the model through parameter importance evaluation to realize a dual target of stably keeping historical attack knowledge and adapting to novel threats; The flow detection and closed loop optimization module is used for applying the malicious flow detection model trained and updated by the incremental training module to detection of malicious flow samples to be detected, outputting classification results, and feeding back detection results to the countermeasure sample generation module to realize closed loop iteration of generation, detection, update and optimization, and continuously improving the robustness and adaptability of the model.
  8. 8. The malicious traffic detection model robustness enhancement system based on reinforcement learning and incremental learning according to claim 7, wherein the data preprocessing module is specifically implemented as follows: The method comprises the steps of collecting samples from a public network security data set and an actual network environment, collecting benign traffic samples and malicious traffic samples, constructing an initial data set with balanced proportion between the benign and malicious samples, carrying out denoising, deduplication and standardization processing on the collected traffic samples, removing invalid samples, extracting historical attack characteristics, contextual characteristics and traffic statistical characteristics of the samples, converting characteristic data into a format which can be identified by a model, dividing the preprocessed data set into a training set and a testing set according to a preset proportion, wherein the training set is used for initial training and subsequent incremental training of the model, and the testing set is used for verifying the performance of the model.
  9. 9. The malicious traffic detection model robustness enhancement system based on reinforcement learning and incremental learning according to claim 7, wherein the countermeasure sample generation module is specifically implemented as: Constructing a state action space joint model, wherein the state space s t integrates historical attack characteristics, context characteristics and flow statistics characteristics, wherein the historical attack characteristics comprise recent success rate, alarm type and interception frequency, the context characteristics comprise protocol type, source and destination information and duration, and the flow statistics characteristics comprise packet size distribution, throughput fluctuation and connection duration; Constructing a transducer strategy network, inputting a flow characteristic sequence into a transducer encoder formed by a multi-layer multi-head self-focusing mechanism and a feedforward neural network, capturing the associated information of single flow characteristics in different dimensions through the multi-head self-focusing mechanism, carrying out residual connection and layer normalization processing on the output characteristics of each layer of the transducer encoder, carrying out depth modeling on the flow state characteristics of high-dimensional and long sequences by using the transducer encoder, capturing complex time sequence and space dependence relation among data packets, inputting the final output characteristics of the encoder into a full-connection layer, mapping the final output characteristics into vectors matched with the dimension of an action space through an activation function, and generating a preliminary flow modification action instruction; Realizing an action Mask mechanism, constructing a network protocol compliance rule base based on main stream network transmission protocol specifications such as TCP/UDP/IP and the like, formulating an illegal action judgment standard, wherein the illegal action judgment standard comprises data packet size violations, protocol key field tampering, time sequence adjustment violations and source and destination information illegal modification, simultaneously associating the high-dimensional flow characteristics of a state space s t with the flow modification actions of an action space a t , defining the executable action types and action modification ranges in different flow states, constructing a mapping rule base from the state space to the action space, dynamically generating a Mask vector Mask t at each time step t by combining the network protocol compliance rule base and the mapping rule base from the state space to the action space, Wherein the method comprises the steps of The value of (1) is 0 or 1,0 represents illegal actions which cannot be executed at the time step t, 1 represents legal actions which can be executed at the time step t, so that illegal actions which violate the network protocol specification are forcedly filtered, and compliance of action instructions is ensured; Designing a third-order composite rewarding function comprising core attack rewards, hidden rewards and resource cost penalties, wherein the third-order composite rewarding function is designed as follows: Wherein the method comprises the steps of Rewarding for core attack, namely generating the bypass success rate of the challenge sample to the detection model, To disguise the rewards, i.e. to generate a similarity of the challenge sample to the original traffic sample, Punishment for resource cost, manifested as resource time consumption when the agent generates the challenge sample, , , The weights corresponding to the three are used for balancing the composite rewarding function. Driving the intelligent agent to obtain optimal balance between 'attack efficiency', 'hidden survival' and 'resource cost', and quantifying the benefit of each action of the intelligent agent; Generating and optimizing a countermeasure sample, combining an action modification instruction and an action masking mechanism generated by a strategy network by an agent, obtaining legal actions through masking filtering, carrying out characteristic modification on an original malicious flow sample, generating the countermeasure sample, inputting the countermeasure sample into a current malicious flow detection model, calculating a reward value according to a detection result, reversely propagating and updating a transducer strategy network parameter, and iteratively optimizing the quality of the countermeasure sample.
  10. 10. The malicious traffic detection model robustness enhancement system based on reinforcement learning and incremental learning according to claim 7, wherein the concept drift detection module is specifically implemented as: Extracting core characteristics of an countermeasure sample generated by the countermeasure sample generation module and a malicious flow sample acquired in real time, and aligning the core characteristics with historical flow sample characteristics used during model training; Calculating MMD value, and performing maximum mean difference MMD algorithm according to the formula: calculating the distribution difference between the current flow sample characteristics and the historical flow sample characteristics, and quantifying the deviation degree of the current flow sample characteristics and the historical flow sample characteristics, wherein: for a sample set In the first place The average value over the dimensions of the individual features, For a sample set In the first place The average value over the dimensions of the individual features, Is the total number of feature dimensions; And (3) carrying out drift judgment and alarm, presetting an MMD threshold value, judging that conceptual drift occurs when the calculated MMD value exceeds the threshold value, starting a drift alarm based on a sliding window, triggering an incremental training process, and if the threshold value is not exceeded, maintaining a current detection model and continuing to detect the flow.

Description

Malicious flow detection model robustness enhancement method and system based on reinforcement learning and incremental learning Technical Field The invention belongs to the technical field of malicious traffic detection and network security, and particularly relates to a malicious traffic detection model robustness enhancement method and system based on reinforcement learning and incremental learning. Background With the acceleration of the global digitizing process, the scale and complexity of the network space are exponentially increased, and a malicious flow detection system has become a core defense line for guaranteeing the safety of key information infrastructure and data assets. However, the limitations of existing malicious traffic detection systems are increasingly prominent in dealing with increasingly intelligent, diverse Advanced Persistent Threats (APT), especially in the face of adaptive attacks driven by the generation of a frontend technology such as a challenge network (GAN). By simulating the encrypted flow characteristics of legal cooperation software, an attacker successfully implements an 'AI bypass' attack, which further proves that the robustness of the traditional detection model faces a serious test when facing highly-confusing malicious flow. At present, research in the field of malicious traffic detection is mainly divided into two categories: The method is a traditional machine learning and deep learning method for improving the self performance of the detection model, such as SVM, KNN, CNN, LSTM, and the like, the method is a static defense model in nature, parameters are fixed after training is finished, and when the method faces to continuous evolution and continuous variation malicious attacks, the generalization capability and the robustness are insufficient and are easy to be failed due to characteristic space disturbance attacks; The other is an antagonism learning method for exploring attack and defending dynamic games, which is represented by GAN, but when the method generates an antagonism sample, the problems of pattern collapse and compliance loss often exist, the generated sample may violate network protocol specification, and the generated sample cannot be effectively transmitted and attacked in a real network environment, so that the practicability is limited. In order to cope with the problem of 'disastrous forgetting' in the updating of a defense model under a dynamic threat environment, the incremental learning technology is gradually applied to the field of malicious traffic detection. The existing incremental learning method mainly comprises basic incremental learning, a playback-based method, a regularization-based method and a knowledge distillation-based method, but when the methods are applied to a malicious traffic detection scene, obvious shortboards still exist, namely the contradiction of 'history knowledge maintenance and new threat rapid adaptation' is not solved, or history parameters are excessively protected to cause new threat adaptation lag, or excessively updated to cause forgetting, drift detection and incremental update are disjointed, an attack mode cannot be dynamically responded to evolve, and protocol characteristics and attack rules of malicious traffic are not combined, and suitability and generalization are limited. In the document 'IoT-Based Android Malware Detection Using Graph Neural Network WITH ADVERSARIAL DEFENSE', related research utilizes a graph neural network to generate an API graph embedding based on centrality measurement, comprehensive authority and intention are detected, but the method does not involve the generation of an countermeasure sample of malicious traffic and incremental learning, and can not solve the model adaptability problem in a dynamic environment. In the literature Android malicious software detection based on interpretability, researchers realize the interpretability detection by using a multi-layer perceptron and an attention mechanism, but the method is only aimed at malicious software, is not applied to malicious traffic detection, and cannot relieve 'catastrophic forgetting', and has insufficient robustness. In addition, part of research attempts to apply reinforcement learning to malicious traffic countermeasure generation, but most of the research attempts depend on auxiliary models such as GAN and the like, so that the risk of pattern collapse cannot be completely eliminated, most of the methods generally lack protocol compliance constraint on generated samples, the practicability is limited, most of the research is focused on success rate of single attack, and the problem that a detection model is continuously learned and evolved under a dynamic threat environment is ignored, so that the problem that a defending model can 'catastrophically forget' when continuously learning new threats is not solved, namely, the model forgets how to recognize old attacks while learning new knowledge. In summar