CN-122027342-A - Vehicle-mounted CAN intrusion detection model training data quality evaluation method and device
Abstract
The invention relates to the technical field of data processing, in particular to a vehicle-mounted CAN intrusion detection model training data quality evaluation method and device, comprising the steps of acquiring a vehicle-mounted CAN network intrusion detection model training reference data set and dividing the vehicle-mounted CAN network intrusion detection model training reference data set into a variable CAN ID data set and a fixed value CAN ID data set; extracting a basic reconstruction error of a variable CAN ID data set, inputting training data to be evaluated into a change rule feature learning model to obtain the reconstruction error to be evaluated, calculating a variable overall quality evaluation index, constructing a basic library aiming at a fixed value CAN ID data set, calculating a statistical deviation of the fixed value CAN ID data set relative to the basic library, calculating a fixed value overall quality evaluation index, carrying out product aggregation on the variable overall quality evaluation index and the fixed value overall quality evaluation index to obtain a comprehensive quality evaluation index, and analyzing the data quality. The invention can improve the accuracy and efficiency of data quality evaluation.
Inventors
- XIA XIAOFENG
- LI QIMIN
- LI GUANGYI
- WEN YUXUAN
- WANG PENGCHENG
- SANG JUN
- CAI BIN
- HU HAIBO
- XIANG HONG
Assignees
- 重庆大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260325
Claims (10)
- 1. A vehicle-mounted CAN intrusion detection model training data quality evaluation method is characterized by comprising the following steps: Acquiring a training reference data set of the vehicle-mounted CAN network intrusion detection model, and dividing the training reference data set of the vehicle-mounted CAN network intrusion detection model into a variable CAN ID data set and a fixed value CAN ID data set according to the data change characteristics of each CAN identifier ID in a continuous period; Aiming at the variable CAN ID data set, extracting a reference reconstruction error of the variable CAN ID data set by using a variable rule characteristic learning model based on a long-short-period memory network self-encoder; Acquiring training data of a vehicle-mounted CAN intrusion detection model to be evaluated, inputting a change rule characteristic learning model of a change type CAN ID data set to be evaluated in the training data of the vehicle-mounted CAN intrusion detection model to be evaluated to obtain a reconstruction error to be evaluated, calculating the relative ratio of the reconstruction error to be evaluated to a reference reconstruction error to obtain a quality score, distributing weight to each change type CAN ID data in the change type CAN ID data set to be evaluated according to a preset information entropy strategy, and weighting and polymerizing to obtain a change type overall quality evaluation index; Constructing a reference library of the fixed value CAN ID data set aiming at the fixed value CAN ID data set, calculating the statistical deviation of the fixed value CAN ID data set relative to the reference library in the training data of the vehicle-mounted CAN intrusion detection model to be evaluated, and calculating the fixed value type overall quality evaluation index by utilizing binarization judgment and accumulation logic; And carrying out product aggregation on the variable type overall quality evaluation index and the fixed value type overall quality evaluation index to obtain a comprehensive quality evaluation index, and analyzing the data quality of the training data of the vehicle-mounted CAN intrusion detection model to be evaluated according to the comprehensive quality evaluation index.
- 2. The method for evaluating the quality of training data of an in-vehicle CAN intrusion detection model according to claim 1, further comprising preprocessing an in-vehicle CAN network intrusion detection model training reference dataset before dividing the in-vehicle CAN network intrusion detection model training reference dataset into a variable CAN ID dataset and a fixed value CAN ID dataset according to the data change characteristics of each CAN identifier ID in successive periods: reading a training reference data set of the vehicle-mounted CAN network intrusion detection model, and deleting RTR redundant data in the training reference data set of the vehicle-mounted CAN network intrusion detection model; Traversing each CAN data message, and when the effective data segment carried by the data length code corresponding to a certain CAN ID is smaller than eight bytes, filling null fields in the data segment of the message with hexadecimal zero, and outputting a primary processing message set containing complete eight-byte hexadecimal data segments; Uniformly converting complete eight-byte hexadecimal data segments carried in a primary processing message set into decimal integers, and converting timestamp fields corresponding to the primary processing message set into floating point numbers; For each independent CAN ID contained in the primary processing message set, sequentially stacking decimal integer messages belonging to the same CAN ID by taking a corresponding floating point number timestamp as a sequencing basis; removing the time stamp field after stacking is completed to form a message sequence relation, and obtaining a common time sequence of each specific CAN ID; analyzing the change condition of the data field of each CAN ID in all continuous periods in the common time sequence: If the decimal integers of the data field of a certain CAN ID in all continuous periods are completely the same, judging the CAN ID as a fixed value CAN ID, otherwise, judging the CAN ID as a variable CAN ID; for a fixed value type CAN ID, directly extracting eight fixed decimal data segments in a corresponding common time sequence to construct a fixed value type reference library in a two-dimensional matrix form; The method comprises the steps of respectively constructing positive integer sets as sequence identifications aiming at a variable CAN ID, outputting corresponding common time sequence to a normalization processing link, obtaining the common time sequence of the variable CAN ID, respectively calculating the maximum value and the minimum value of a current sequence on eight data segments corresponding to the common time sequence, introducing small constants avoiding zero, linearly mapping each decimal integer in the sequence to a closed interval from zero to one by adopting a minimum maximum normalization method, outputting the normalized CAN time sequence, intercepting the normalized CAN time sequence according to time sequence by utilizing a sliding window with the length of twenty communication periods and the sliding step length of one to obtain a CAN sequence sample set formed by a numerical matrix of twenty rows and eight columns, and dividing the CAN sequence sample set into a training set and a test set according to the time sequence.
- 3. The method for evaluating the quality of training data of a vehicle-mounted CAN intrusion detection model according to claim 1, wherein the extracting the reference reconstruction error of the variable CAN ID dataset by using the change rule feature learning model of the self-encoder based on the long-term memory network comprises: Inputting the training set into a change rule feature learning model to perform feature learning, taking the mean square error between the minimized input sequence and the reconstructed sequence as a model parameter of a target training change rule feature learning model, and obtaining a trained change rule feature learning model after training is completed; Inputting the test set into a trained change rule feature learning model, and calculating a reconstruction error of each sequence sample in the test set; And extracting the mean value of the reconstruction errors of the test set, sorting the mean value from small to large, extracting ninety-five percent digits of the sorted mean value from small to large as a reference reconstruction error, and storing the trained change rule feature learning model and the corresponding reference reconstruction error into a feature learning model library.
- 4. The method for evaluating the quality of training data of a vehicle-mounted CAN intrusion detection model according to claim 1 or 3, wherein the step of inputting a change law characteristic learning model into a change type CAN ID dataset in the training data of the vehicle-mounted CAN intrusion detection model to be evaluated to obtain a reconstruction error to be evaluated, and calculating the relative ratio of the reconstruction error to be evaluated to a reference reconstruction error to obtain a quality score comprises the following steps: extracting a to-be-evaluated variable CAN ID data set in training data of the to-be-evaluated vehicle-mounted CAN intrusion detection model, and matching a trained change rule characteristic learning model corresponding to each CAN ID in a characteristic learning model library; Inputting the to-be-evaluated variable CAN ID data set into a matched change rule feature learning model to reconstruct the sequence, and calculating and extracting the average mean square error between the input sequence and the reconstructed sequence as the to-be-evaluated reconstruction error; Dividing the reconstruction error to be evaluated by the average value in the corresponding reference reconstruction error to obtain a relative error; Mapping the relative error to a closed interval from 0 to 1 by adopting a nonlinear mapping function to obtain a mapping value, and taking the mapping value as the mass fraction of the corresponding variable CAN ID to be evaluated.
- 5. The method for evaluating the quality of training data of the vehicle-mounted CAN intrusion detection model according to claim 1, wherein the assigning weights to each of the variable CAN ID data in the variable CAN ID data set to be evaluated according to a preset information entropy policy comprises: Extracting mass fractions of all the variable CAN IDs in the variable CAN ID data set to be evaluated, and carrying out normalization processing on the mass fractions to obtain normalized mass fractions; dividing the normalized quality score into equal-width intervals with preset quantity, counting the frequency of the quality score of each variable CAN ID falling into each equal-width interval, and calculating the probability distribution of each equal-width interval according to the frequency; Calculating the actual information entropy of each variable CAN ID according to the probability distribution of the equal-width intervals, and calculating the maximum possible information entropy when the probability distribution is completely uniform in all the equal-width intervals; calculating the relative deviation degree of the actual information entropy and the maximum possible information entropy of each variable CAN ID, and taking the deviation degree as a difference coefficient; and carrying out normalization processing on the difference coefficients of all the variable CAN IDs to obtain the weight of each variable CAN ID data.
- 6. The method for evaluating the quality of training data of a vehicle-mounted CAN intrusion detection model according to claim 1, wherein the calculating the statistical deviation of a fixed value type CAN ID dataset in the training data of the vehicle-mounted CAN intrusion detection model to be evaluated relative to a reference library and calculating a fixed value type overall quality evaluation index by using binarization judgment and multiplication logic comprises: Comparing a data sequence of a fixed value type CAN ID in training data of the vehicle-mounted CAN intrusion detection model to be evaluated with reference data in a reference library, and calculating a standard deviation mean value between the data sequence and the reference data in the reference library to be used as a statistical deviation; When the standard deviation mean value is zero and the data field of the CAN ID in the vehicle-mounted CAN intrusion detection model training data is consistent with the reference data, the mass fraction of the fixed-value CAN ID is given to be 1; when the standard deviation mean value is larger than zero, namely the training data of the vehicle-mounted CAN intrusion detection model changes, the mass fraction of the fixed value type CAN ID is endowed with 0; And extracting mass fractions of all the fixed value type CAN IDs, executing a cumulative multiplication operation, and outputting a result of the cumulative multiplication calculation as a fixed value type overall quality evaluation index.
- 7. The method for evaluating the quality of training data of a vehicle-mounted CAN intrusion detection model according to claim 1, wherein the step of performing product aggregation on the variable overall quality evaluation index and the fixed overall quality evaluation index to obtain the comprehensive quality evaluation index comprises the steps of: multiplying the value of the variable overall quality evaluation index by the value of the fixed overall quality evaluation index to obtain a comprehensive quality evaluation index; When the fixed value type overall quality evaluation index is 1, the value of the comprehensive quality evaluation index is equal to the value of the variable type overall quality evaluation index; when the fixed value type overall quality evaluation index is 0, triggering a ticket overrule mechanism, and forcedly setting the value of the integrated quality evaluation index to 0.
- 8. A vehicle-mounted CAN intrusion detection model training data quality evaluation device, characterized in that the device is used for realizing the vehicle-mounted CAN intrusion detection model training data quality evaluation method according to any one of claims 1 to 7, and the device comprises: The data preprocessing module is used for acquiring a training reference data set of the vehicle-mounted CAN network intrusion detection model, and dividing the training reference data set of the vehicle-mounted CAN network intrusion detection model into a variable CAN ID data set and a fixed value CAN ID data set according to the data change characteristics of each CAN identifier ID in a continuous period; The quality evaluation index calculation module is used for extracting a reference reconstruction error of the variable CAN ID data set by utilizing a change rule characteristic learning model based on a long-short-period memory network self-encoder for the variable CAN ID data set, acquiring training data of the vehicle-mounted CAN intrusion detection model to be evaluated, inputting the variable CAN ID data set to be evaluated in the training data of the vehicle-mounted CAN intrusion detection model to be evaluated into the change rule characteristic learning model to obtain the reconstruction error to be evaluated, calculating the relative ratio of the reconstruction error to be evaluated to the reference reconstruction error to obtain a quality score, distributing weight to each variable CAN ID data in the variable CAN ID data set to be evaluated according to a preset information entropy strategy and weighting and polymerizing to obtain a variable overall quality evaluation index; the quality evaluation module is used for carrying out product aggregation on the variable type overall quality evaluation index and the fixed value overall quality evaluation index to obtain a comprehensive quality evaluation index, and analyzing the data quality of the training data of the vehicle-mounted CAN intrusion detection model to be evaluated according to the comprehensive quality evaluation index.
- 9. An electronic device, the electronic device comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the in-vehicle CAN intrusion detection model training data quality assessment method according to any one of claims 1 to 7.
- 10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the vehicle-mounted CAN intrusion detection model training data quality evaluation method according to any one of claims 1 to 7.
Description
Vehicle-mounted CAN intrusion detection model training data quality evaluation method and device Technical Field The invention belongs to the technical field of data processing, and particularly relates to a vehicle-mounted CAN intrusion detection model training data quality evaluation method and device. Background With the rapid development of intelligent network-connected automobiles, the safety of the vehicle-mounted CAN (Controller Area Network) bus network is increasingly important. In order to train an efficient in-vehicle network intrusion detection model, a huge amount of benign in-vehicle CAN network communication data is typically required. However, due to the high cost of collecting data in a real vehicle environment and limited scene coverage, generating benign data of an on-board CAN network using a generative model has become a mainstream trend of the industry. In the process, how to perform high-quality automatic screening and evaluation on the generated vehicle-mounted CAN network data becomes a key link for guaranteeing the training effect of the intrusion detection model. At present, aiming at the processing and quality evaluation of vehicle-mounted CAN intrusion detection model training data of a vehicle-mounted CAN network, the prior art in the industry mainly focuses on two layers of basic data preprocessing and traditional data quality evaluation. In the aspect of data preprocessing, the main flow scheme usually adopts tools such as Python or MATLAB and the like, and CAN only complete shallow layer operations such as data cleaning, CAN ID grouping, time stamp ordering, basic field extraction and the like, and in the aspect of data quality evaluation, the conventional method is mostly dependent on manual rule checking and simple basic statistical analysis paths. However, when the prior art deals with the real communication rule of the complex and changeable vehicle-mounted CAN bus, the whole implementation logic exposes a plurality of remarkable short boards, the feasibility of practical engineering application is extremely low, and the vehicle-mounted CAN network message simultaneously comprises a 'fixed value type CAN ID' carrying a basic communication protocol and a 'variable type CAN ID' carrying a dynamic physical state. In the prior art, a set of standards are generally adopted for unified evaluation, and cannot be distinguished in evaluation logic, so that the definition and the strictness of the evaluation standards are seriously unbalanced, the evaluation accuracy and the quality distinction degree are low, and in the prior art, all links of data preprocessing, feature extraction learning, quality reasoning and the like are mutually independent, and end-to-end design is lacked. All links are often connected by manual intervention, so that human errors are easy to introduce, and the evaluation of a single batch is long in time consumption, so that the cost is high. Disclosure of Invention The invention provides a vehicle-mounted CAN intrusion detection model training data quality evaluation method and device, which CAN improve the accuracy and efficiency of vehicle-mounted CAN intrusion detection model training data quality evaluation. In order to achieve the above purpose, the invention provides a vehicle-mounted CAN intrusion detection model training data quality evaluation method, which comprises the following steps: Acquiring a training reference data set of the vehicle-mounted CAN network intrusion detection model, and dividing the training reference data set of the vehicle-mounted CAN network intrusion detection model into a variable CAN ID data set and a fixed value CAN ID data set according to the data change characteristics of each CAN identifier ID in a continuous period; Aiming at the variable CAN ID data set, extracting a reference reconstruction error of the variable CAN ID data set by using a variable rule characteristic learning model based on a long-short-period memory network self-encoder; Acquiring training data of a vehicle-mounted CAN intrusion detection model to be evaluated, inputting a change rule characteristic learning model of a change type CAN ID data set to be evaluated in the training data of the vehicle-mounted CAN intrusion detection model to be evaluated to obtain a reconstruction error to be evaluated, calculating the relative ratio of the reconstruction error to be evaluated to a reference reconstruction error to obtain a quality score, distributing weight to each change type CAN ID data in the change type CAN ID data set to be evaluated according to a preset information entropy strategy, and weighting and polymerizing to obtain a change type overall quality evaluation index; Constructing a reference library of the fixed value CAN ID data set aiming at the fixed value CAN ID data set, calculating the statistical deviation of the fixed value CAN ID data set relative to the reference library in the training data of the vehicle-mounted CAN intrusio