CN-121980264-A - Undersampling and software defect prediction method based on risk perception layering

CN121980264ACN 121980264 ACN121980264 ACN 121980264ACN-121980264-A

Abstract

The invention discloses an undersampling and software defect prediction method based on risk perception layering, which comprises the steps of firstly quantifying contribution degree of each feature in original data set to defect prediction by utilizing a basic classification model and combining SHAP values, establishing global feature importance weight, then calculating continuous risk scores representing information values of each majority sample by classifying sample features and combining local defect rates and feature importance, and then dividing the majority samples into a plurality of risk levels according to the risk scores, and implementing random sampling in each level to construct a balanced data set. By utilizing the balanced data set training prediction model, a more comprehensive and effective software defect prediction model is constructed, the unknown sample can be subjected to software defect prediction, the accuracy of defect prediction is remarkably improved, and effective support is provided for software quality assurance and maintenance efficiency improvement.

Inventors

WEI DAN
WANG AOYING
WANG XINGQI
CHEN BIN

Assignees

杭州电子科技大学

Dates

Publication Date: 20260505
Application Date: 20260119

Claims (9)

1. The undersampling method based on risk perception layering is characterized by constructing a basic classification model, quantifying the global importance of each feature in original software defect prediction data set to a prediction result by using SHAP values, removing noise features according to the importance, carrying out self-adaptive binning on the reserved features based on feature values, combining the global importance of the features with the duty ratio of a few class samples in each binning to generate the weight of each binning; dividing all the majority samples into a plurality of risk levels with the same sample quantity by using a quantile box dividing method according to the continuous risk scores of the majority samples, independently and randomly sampling the same number of majority samples in each risk level according to a set balance proportion, merging the sampling result with minority samples in the original software defect prediction data set, and outputting a balanced software defect prediction data set.
2. The method for undersampling based on risk perception stratification according to claim 1, wherein said underlying classification model is a tree integration model.
3. The undersampling method based on risk perception stratification as set forth in claim 1, characterized in that Is of global importance of (a) The method comprises the following steps: Wherein n represents the total number of samples in the original software defect prediction dataset; Representing features in the ith sample SHAP value of (a).
4. The method for undersampling based on risk perception stratification according to claim 3, characterized in that the kth bin based on characteristic values Weights of (2) The method comprises the following steps: Wherein, the 、 Indicating sub-box Is defined as the actual defect rate and the relative defect rate, Is a sub-box The number of the samples of the middle and small classes accounts for the sub-boxes A ratio of total sample number; Representing the rate of the global defect and, Representing the total number of minority class samples in the original software defect prediction dataset.
5. The method for risk-aware stratification-based undersampling method of claim 1, wherein said plurality of classes of samples Is a continuous risk score of (2) The method comprises the following steps: Wherein, the In order to remove the number of features remaining after the noise feature, Representing a sample Middle feature Is used for the value of (a) and (b), Representation of The sub-bin index to which the present invention belongs, And the weight is corresponding to the sub-bin.
6. The undersampling method based on risk perception stratification according to claim 5, characterized in that for a sampling method of risk perception stratification Normalized to the [0,1] interval by min-max scaling.
7. A risk-aware stratification-based undersampling method according to claim 1, characterized by sampling within each risk level A number of samples, wherein, The representation is rounded down and up, Representing the number of samples of the first risk level, max () represents taking the maximum value; the balance ratio is indicated by the expression, 、 Representing the number of minority class samples and how many class samples in the original software defect prediction dataset, respectively.
8. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1 to 7.
9. A software defect prediction method based on risk perception layering undersampling is characterized in that the undersampling method based on risk perception layering is used for undersampling a plurality of types of samples in a software defect prediction data set, then the undersampling method is combined with an original minority of types of sample sets to form a balanced training data set, the balanced training data set is used for training a software defect prediction model, and the trained software defect prediction model is used for predicting new samples.

Description

Undersampling and software defect prediction method based on risk perception layering Technical Field The invention belongs to the technical field of software engineering and artificial intelligence, and particularly relates to an undersampling and software defect prediction method based on risk perception layering. Background Software defect prediction (Software Defect Prediction, SDP) is a core task in software engineering, aimed at automatically identifying potentially defective modules by machine learning model analysis of historical data. SDP is not only a key means for guaranteeing software quality, but also an important tool for improving development efficiency, supporting decision making and promoting software engineering intelligence. However, during actual software development, the defective samples are naturally rare, whereas the non-defective samples dominate, resulting in a dataset exhibiting significant class imbalance. This imbalance distribution biases model training towards most classes, resulting in reduced predictive performance. The existing method for relieving the class imbalance mainly comprises a data resampling technology, cost sensitive learning and integrated learning. The cost-sensitive learning penalizes the misclassification behavior of a few classes by adjusting the weight of the loss function, but the weight is difficult to precisely quantify under complex distribution. Ensemble learning improves classification performance by combining multiple base learners, but its high computational overhead limits application to large-scale data. Data resampling techniques are further divided into over-sampling and under-sampling, and over-fitting is often induced by introducing artificial noise, although the over-sampling technique increases the minority class ratio by synthesizing new samples. The undersampling technology balances the data set by deleting most types of samples, can remarkably reduce the computational complexity, does not introduce few types of sample noise, and shows unique application value. However, conventional undersampling methods may distort the raw data distribution by blindly removing samples, resulting in decision boundary shifts and reduced generalization capability. Although improved methods attempt to optimize undersampling by ordering, there are still many classes of distribution distortion, high computational complexity, and poor interpretability. Thus, there remains a need for an undersampling method that effectively balances class distribution while maintaining data integrity. Disclosure of Invention Aiming at the defects of the prior art, the invention provides an undersampling and software defect prediction method based on risk perception layering, which utilizes SHAP values to analyze feature importance and combine a defect distribution mode to quantify the potential cost of each majority sample being erroneously removed, and performs layered sampling according to risk scores, thereby avoiding blind deletion and distribution distortion. An undersampling method based on risk perception layering specifically comprises the following steps: Step 1, acquiring a software defect prediction data set which comprises a feature vector and a class label, dividing a non-defect sample into a plurality of class sample sets according to the class label of the sample, dividing the defect sample into a plurality of class sample sets, and preprocessing data. And 2, constructing a basic classification model, quantifying the global importance of each feature in the whole data set to the prediction result by using the SHAP (SHAPLEY ADDITIVE exPlanations) value, and removing noise features according to the importance. And 3, carrying out self-adaptive binning on the reserved characteristics based on the characteristic values, calculating the relative defect rate of each binning, generating binning weights by combining the characteristic importance quantized by the SHAP value, and aggregating the weights of the bins to which the characteristics of the multiple types of samples belong, so as to calculate the continuous risk score of each multiple types of samples. And 4, dividing all the majority samples into a plurality of risk levels by using a quantile binning method according to the calculated continuous risk scores, wherein the sample quantity contained in each risk level is approximately equal so as to keep the spatial distribution characteristics of the samples. And 5, calculating the number of samples to be reserved in each risk level according to the set balance proportion, independently and randomly sampling in each level, combining the sampling result with a few sample sets, and outputting a balanced software defect prediction data set. A software defect prediction method based on risk perception layering undersampling is used for sampling a majority type sample set in a software defect prediction data set, then combining the majority type sample set with an origin