Search

CN-121996525-A - Software defect prediction method and system based on deep ensemble learning

CN121996525ACN 121996525 ACN121996525 ACN 121996525ACN-121996525-A

Abstract

The invention discloses a software defect prediction method and system based on deep ensemble learning, and belongs to the field of machine learning and software engineering. The method comprises the steps of normalizing and dividing a software defect data set, using a principal component to analyze and expand feature dimensions to construct graphic data, adopting an SMOTE technology to balance a training set, constructing a three-layer convolutional neural network comprising a residual block and an attention mechanism to extract deep features, introducing a random forest as an auxiliary predictor to predict each layer of features after extracting, dynamically calculating weights according to the accuracy of the random forest, weighting the random forest, and finally integrating the weighted prediction results by using XGBoost as a meta learner to output defect prediction. According to the method, the one-dimensional characteristics are converted into the graph shape data, potential relations among the characteristics are mined by deep learning, and the accuracy of software defect prediction and the generalization capability of the model are effectively improved by combining with the self-adaptive weighting integration strategy.

Inventors

  • DU YE
  • TANG YU
  • XIE PENG
  • FAN HUAFEI
  • HE YONGZHONG
  • GAO JIANBO
  • LI MEIHONG
  • Zheng Tianshuai

Assignees

  • 北京交通大学唐山研究院
  • 中电科发展规划研究院有限公司
  • 北京交通大学

Dates

Publication Date
20260508
Application Date
20260113

Claims (10)

  1. 1. The software defect prediction method based on deep ensemble learning is characterized by comprising the following steps of: s1, carrying out normalization processing on an original software defect data set; s2, dividing the data set processed by the S1 into a training set for model training and a test set for performance evaluation; s3, respectively extracting and expanding the characteristics of the training set and the testing set which are obtained through the division of the S2 so as to meet the characteristic dimension required by constructing the graph-shaped data; S4, performing oversampling treatment on the training set processed in the S3 to enable the number of defective samples and non-defective samples in the training set to be the same; s5, respectively converting each piece of sample data in the training set processed by the S4 and the test set processed by the S3 into a two-dimensional matrix with a preset size to be used as graph-shaped data; S6, initializing convolutional neural network model parameters comprising the number of input channels and convolutional kernel parameters; s7, constructing a three-layer convolutional neural network based on the parameters initialized in the S6; S8, carrying out layer-by-layer feature extraction on the graph-shaped data obtained in the S5 by using the three-layer convolutional neural network constructed in the S7, and inputting the extracted features into a corresponding random forest auxiliary predictor for prediction after each layer of features are extracted, so as to obtain a plurality of auxiliary prediction results and corresponding prediction accuracy; And S9, calculating the weight of each auxiliary predictor according to the prediction accuracy of each auxiliary predictor obtained in the step S8, carrying out weighted fusion on corresponding auxiliary prediction results, inputting the weighted fusion results into a meta learner, and outputting a final software defect prediction result by the meta learner.
  2. 2. The method according to claim 1, wherein in S1, the original software defect dataset is processed by using a maximum and minimum normalization method, and the specific formula is: ; Wherein, the Is the value of the original data and, Is the minimum value of the feature and, Is the characteristic maximum.
  3. 3. The method of claim 1, wherein in S2, the data set is partitioned using a five-fold cross-validation method, and wherein in each of the two fold partitions, the training set is 80% and the test set is 20%.
  4. 4. The method according to claim 1, wherein in S3, the feature extraction and expansion specifically includes extracting a required number of principal component features by a principal component analysis method, and splicing the extracted principal component features to the original features, so that the total number of features of each sample reaches the number required for constructing a two-dimensional matrix of a preset size.
  5. 5. The method of claim 1, wherein in S4, the oversampling process is specifically that for the defective samples in the training set, a composite minority class oversampling technique is used to generate new composite samples until the number of defective samples in the training set is equal to the number of non-defective samples.
  6. 6. The method according to claim 1, wherein in S5, the two-dimensional matrix of preset size is an n x n square matrix, where n is an integer greater than 1.
  7. 7. The method of claim 1, wherein in S7, two residual blocks are added after the first layer and the second layer of the three-layer convolutional neural network, and wherein an attention mechanism module is integrated in each residual block, wherein the attention mechanism module calculates attention weights through a1×1 convolutional kernel and weights the outputs of the residual blocks.
  8. 8. The method according to claim 1, wherein in S8, a random forest is selected as the auxiliary predictor, after the convolutional neural network extracts the features of different layers, the extracted features are predicted using the random forest, and then the accuracy of each auxiliary predictor is calculated and a weight is obtained; The auxiliary predictor calculation formula is as follows: ; Wherein, the Weights respectively representing three auxiliary predictors Representing the prediction accuracy of the three auxiliary predictors, respectively.
  9. 9. The method of claim 1, wherein in S9, the meta learner is a XGBoost model.
  10. 10. A software defect prediction system, comprising: a data preprocessing module for executing the steps S1 to S5 of any one of claims 1 to 9, receiving an original software defect data set, and outputting a training set and a test set in a graphic data format; a model construction and training module for executing the steps S6 to S8 of any one of claims 1 to 9, constructing and training a deep ensemble learning model comprising a three-layer convolutional neural network, a plurality of random forest auxiliary predictors, based on the graphic training set data; A prediction execution module, configured to execute the step S9 of any one of claims 1 to 9, process the graphic test set data using the trained deep ensemble learning model, and output a software defect prediction result.

Description

Software defect prediction method and system based on deep ensemble learning Technical Field The invention relates to the technical field of machine learning, in particular to a software defect prediction method and system based on deep ensemble learning. Background With the increasing scale of software systems and the increasing complexity of structures, hidden defects in software have become key factors affecting the quality and reliability of the software systems. Software bugs can lead to serious security vulnerabilities, economic losses and even catastrophic consequences. Therefore, potential defect modules can be actively and accurately identified before the software testing stage or release, and the method has great significance for reasonably distributing limited testing resources, controlling development cost and improving software quality. The software defect prediction technique aims at constructing a prediction model by using historical measurement metadata (such as code complexity, code change times and the like) of a software module so as to judge whether a new module has defect tendency. Traditional methods rely mainly on shallow machine learning models such as logistic regression, support vector machines, decision trees, etc. However, these methods have significant limitations, firstly, in that complex nonlinear relationships and potential correlations often exist between software metrology features, shallow models are difficult to effectively mine and utilize these deep information, secondly, in that software defect data often has serious class imbalance problems (i.e., there are far more non-defective samples) resulting in a bias in model prediction performance, and finally, in that the generalization capability of a single model is limited, and the performance of the single model is unstable on different projects or data sets. In recent years, deep learning methods, particularly convolutional neural networks, have been greatly successful in the fields of images, voices and the like due to their strong feature automatic extraction and characterization learning capabilities. Some studies began to attempt to introduce it into the field of software defect prediction by learning high-level abstract patterns between features using CNNs by reshaping one-dimensional feature vectors into a two-dimensional matrix (e.g., in the form of images). However, the direct application of CNN to such structured data still faces the problems of gradient disappearance, overfitting, and how to effectively fuse different levels of feature information. Meanwhile, the integrated learning method (such as Stacking) is proved to be capable of effectively improving generalization and robustness of the model by combining prediction results of a plurality of base learners, but the traditional integrated strategy usually adopts simple average or voting for output of the base learners, and cannot fully consider contribution degree of performance difference of different base learners to final decision. Therefore, the prior art still has the problems of insufficient excavation of the deep relation of the characteristics, sensitivity to class unbalance, model generalization capability to be improved and the like. There is a need for a new software defect prediction scheme that can depth fuse depth feature extraction with adaptive ensemble learning advantages. Disclosure of Invention The invention aims to overcome the defects of the prior art, the defects of insufficient feature mining capability, weak generalization capability and sensitivity to unbalanced data in the existing software defect prediction method, and provides a software defect prediction method and system based on deep ensemble learning. In a first aspect, the present invention provides a software defect prediction method based on deep ensemble learning, including the following steps: s1, carrying out normalization processing on an original software defect data set; s2, dividing the data set processed by the S1 into a training set for model training and a test set for performance evaluation; s3, respectively extracting and expanding the characteristics of the training set and the testing set which are obtained through the division of the S2 so as to meet the characteristic dimension required by constructing the graph-shaped data; S4, performing oversampling treatment on the training set processed in the S3 to enable the number of defective samples and non-defective samples in the training set to be the same; s5, respectively converting each piece of sample data in the training set processed by the S4 and the test set processed by the S3 into a two-dimensional matrix with a preset size to be used as graph-shaped data; S6, initializing convolutional neural network model parameters comprising the number of input channels and convolutional kernel parameters; s7, constructing a three-layer convolutional neural network based on the parameters initialized in the S6; S