CN-114826632-B - Network attack classification method based on network security data cleaning fusion
Abstract
The invention discloses a network attack classification method based on network security data cleaning fusion, which relates to the field of network security, and comprises the steps of calculating information entropy of each sub-attribute space and taking the information entropy as weight of the sub-attribute space through cleaning multi-source heterogeneous network security data, constructing a hidden Markov model for each sub-attribute space and training, taking output results of k trained sub-Markov models as k evidence bodies, inputting attribute sequence values of data to be tested of network attack into the Markov model to obtain probability of occurrence of each attack result, carrying out weighted calculation according to the probability of occurrence of each attack result, carrying out data fusion on the k evidence bodies by utilizing a D-S evidence theory, judging the fused data by utilizing a trust function based on the D-S evidence theory, and classifying the network attack according to judgment results.
Inventors
- ZHANG JING
- XIN ZHENG
- WU JIANYING
- ZHANG HAIXIA
- HUANG KEZHEN
- LIAN YIFENG
- LI YITING
Assignees
- 中国科学院软件研究所
Dates
- Publication Date
- 20260512
- Application Date
- 20210127
Claims (10)
- 1. A network attack classification method based on network security data cleaning fusion is characterized by comprising the following steps: 1) Collecting multi-source heterogeneous network security data, cleaning the data, combining the data and transforming the data; 2) Dividing the attribute space of the transformed data into k sub-attribute spaces according to the source and the property of the data, calculating the information entropy of each sub-attribute space, and taking the information entropy as the weight of the sub-attribute space; 3) Constructing a hidden Markov model for each sub-attribute space, and training the Markov model of each sub-attribute space to obtain k trained Markov models; 4) Taking the output results of the k trained Markov models as k evidence bodies; 5) Inputting attribute sequence values of data to be tested of network attack into k trained Markov models to obtain probability of occurrence of each attack result; 6) According to the probability of each attack result, weighting calculation is carried out by utilizing the weights of the sub-attribute spaces, and data fusion is carried out on the k evidence bodies by utilizing the D-S evidence theory; 7) And judging the fused data based on the trust function of the D-S evidence theory, and classifying the network attack according to the judgment result.
- 2. The method of claim 1, wherein the data cleansing comprises the steps of: Processing the incomplete data to fill the blank data; Performing de-duplication processing on the repeated data; processing the error data, wherein the processing method comprises one or more of binning, clustering to remove isolated points and establishing regression function smooth data; integrating the data table structure and the field types of the multi-source heterogeneous data into a unified format; and deleting the non-required data.
- 3. The method of claim 1, wherein the data transformation comprises one or more of data normalization, reduction, switching, rotation, projection.
- 4. The method of claim 1, wherein the attribute sequence of data is a sequence of attributes of a piece of data, and the attribute space is a matrix of attribute sequences of a plurality of pieces of data, the matrix being expressed as Wherein the sequence of attributes , Representation of M represents m pieces of network security source data.
- 5. The method of claim 1, wherein each sub-attribute space is a power set made up of a number of attributes that affect classification of network security events.
- 6. The method of claim 1, wherein the information entropy is calculated according to the following information entropy calculation formula: ; Wherein, the The entropy of the information representing the random event X, Representing random events as I represents an index of possible values of the random event, Is a logarithmic base.
- 7. The method of claim 1, wherein each markov model is trained by obtaining training samples for each markov model from a different source based on a definition of a respective parameter within a respective sub-attribute space.
- 8. The method of claim 7, wherein the training results in a markov model with optimal parameters and is tested by the test set to be a trained markov model.
- 9. The method of claim 1, wherein the D-S evidence theory framework is expressed as Where k represents the number of sub-attribute spaces and F k represents the attack class corresponding to the kth sub-attribute space.
- 10. The method of claim 1, wherein the probability of occurrence of the attack result is obtained in step 5) according to the following formula: ; Wherein, the Representing the probability of occurrence of the attack result, F j representing the attack result, Representing the probability of assigning evidence H i to attack result F j , Represents the sum of k probabilities of assigning evidence H i to attack result F j .
Description
Network attack classification method based on network security data cleaning fusion Technical Field The invention relates to the field of network security, in particular to a network attack category identification method based on cleaning fusion decisions of multi-source heterogeneous data. Background In recent years, with the rapid development of technologies such as mobile internet, cloud computing, information security, and machine learning, hundreds of millions of users generate a large amount of data every day. Through data mining and other technologies, the large-scale data can be applied to a plurality of different fields, and convenience is brought to life. However, massive data is accompanied by a problem of multi-source isomerization, so that the data quality is poor. The quality of data is a bottleneck restricting the use of data, and without high-quality data, there is no high-quality data mining result. As a tool for processing multi-source heterogeneous data, data cleaning and data fusion are important technologies for improving data quality, and have important value and significance. The traditional data cleaning and fusion method is not suitable for the development requirements of modern technological society, and particularly the data sources in the current large-scale network security projects are increasingly characterized by multiple sources, and the data cleaning and fusion technology is in need of updating. The multi-source data cleaning and fusion process mainly comprises the following steps of 1, a multi-source data acquisition process, 2, a multi-source data preprocessing process and 3, a multi-source data fusion calculation process. The multi-source data in the network security field has different characteristics from the multi-source data in the general field, in the network security project, the data generally need to be obtained from security data, log data and flow data of different manufacturers, and because the different manufacturers also have differences in the used network security devices, the software settings of the same devices are different, the multi-source data lack of uniformity, and the unique characteristics are presented. If network security data is to be utilized, data preprocessing is required for the characteristics of data from different sources. The quality of the network security data is the basic support of different research projects in the network security field. Data preprocessing is the first step to be performed after multi-source heterogeneous network data is obtained, and the possibility of occurrence of abnormality of the data is greatly increased due to the fact that the data volume of the data is huge and comes from various different types of data sources. The high-quality data is a better result and performance basis, so the importance of data cleaning fusion is increasing gradually, and the data cleaning fusion is a basic step of all projects and platform construction, and the network security data after the data cleaning fusion can be applied to the fields of statistical analysis, machine learning, artificial intelligence and the like for analyzing the current network security situation and improving the network security protection capability of individuals, enterprises and countries. In the process of collecting data, various factors may exist to interfere with the quality of the data, such as the accuracy of data collection, the integrity of the data itself, and the consistency of the data during aggregation of the multi-source data. The existing data cleaning fusion technology comprises the steps of supplementing missing values, processing interference data and integrating multi-source heterogeneous data, and as the network security data has the characteristics of multiple data sources and wide application range, how to reasonably fuse the network security data to classify network security events is an important task, and the network security state data from the multi-source heterogeneous data need to be collected and fused for network security diagnosis, so that the overall network security situation can be mastered. Disclosure of Invention The invention aims to provide a network attack classification method based on network security data cleaning fusion, which is characterized in that firstly, data cleaning is carried out in stages according to the characteristics of network security data, then, the data in the network security field is cleaned and fused by utilizing the traditional D-S evidence theory and combining with a Markov model, and the classification of network attacks is identified, so that the overall network security situation perception is realized. The technical scheme adopted by the invention is as follows: a network attack classification method based on network security data cleaning fusion comprises the following steps: 1) Collecting multi-source heterogeneous network security data, cleaning the data, combining the data