CN-121996563-A - Defect severity prediction method based on developer modification history

CN121996563ACN 121996563 ACN121996563 ACN 121996563ACN-121996563-A

Abstract

The invention discloses a defect severity prediction method based on a developer modification history, and belongs to the technical field of software development. The method comprises the steps of extracting modification history data and defect report data, constructing an original data set, preprocessing the original data set, establishing a standard data set according to a mapping relation of files-defects-developers, constructing a developer-file weighted directed network based on the standard data set, acquiring the file-file weighted directed network through weighted projection based on the developer-file weighted directed network, calculating the structural entropy score of file nodes based on the file-file weighted directed network, extracting software measurement indexes of each source code file, constructing a comprehensive feature set based on the structural entropy score, the software measurement indexes and defect label combinations, training and verifying a machine learning model based on the comprehensive feature set, and realizing the defect severity prediction of the file. The invention improves the accuracy and stability of defect severity prediction from the view of developer modification history.

Inventors

Mao Mo
Lang Zhuowei
JIANG BO
Zhu Erluan
WANG JIALE
PAN WEIFENG

Assignees

浙江工商大学

Dates

Publication Date: 20260508
Application Date: 20260409

Claims (10)

1. A method for predicting severity of defects based on a developer modification history, comprising: Step 1, extracting modification history data and defect report data of a project developer, and constructing an original data set; Step 2, preprocessing an original data set, and establishing a standard data set according to a mapping relation of a file-defect-developer; Step 3, based on the mapping relation of the file-defect-developer, respectively using the developer and the file as network nodes, establishing weighted directed edges between the developer and the modified file, calculating edge weights of the weighted directed edges, and generating a weighted directed network based on the edge weights; step 4, constructing a file-developer adjacency matrix based on a weighted directed network, performing two-part network projection based on a weighted resource allocation strategy, calculating resources of developer nodes, equally distributing the resources of the developer nodes to associated file nodes, calculating the resources of the file nodes, calculating file-file cooperative strength based on the resources of the developer nodes and the resources of the file nodes, and constructing a file-file weighted directed network; Calculating the outgoing degree and the incoming degree of the file nodes based on the file-file cooperative strength of the file-file weighted directed network, calculating the local probability of the file nodes for each file node, calculating the initial entropy of the file nodes based on the local probability, and spreading and superposing the initial entropy to obtain the final structure entropy score of the file nodes; step 6, extracting software measurement indexes of each source code file; And 7, constructing a comprehensive feature set based on the combination of the structural entropy score, the software measurement index and the defect label, and training and verifying a machine learning model based on the comprehensive feature set to realize defect severity prediction of the file.
2. The defect severity prediction method based on developer modification history according to claim 1, wherein in step 1, modification history data of a project developer is extracted, specifically: All commit records are extracted from the Git repository of the version control system, and the commit hash, the commit name, the commit box, the commit time, the commit specification, and the modified file path for each commit are parsed.
3. The defect severity prediction method based on developer modification history according to claim 2, wherein in step 1, defect report data of a project developer is extracted, specifically: the defect number, priority, creation time, resolution time, impact version, and repair version are extracted from the JIRA platform of the defect management system.
4. A method for predicting defect severity based on developer modification history according to claim 3, wherein in step 2, the original dataset is preprocessed, and a standard dataset is created according to a mapping relationship of file-defect-developer, specifically: Step 201, only retaining the java source code file with the modified state aiming at the modification history data, and eliminating test, configuration and document files; step 202, aiming at defect report data, only keeping defect records with types of defects, closed states or resolved states and repaired processed results, and eliminating repeated, invalid and unrepaired items; Step 203, unifying the defect numbers into the format of project name-number, clearing abnormal characters in the name of the submitter, carrying out column number alignment on fields affecting the version and the repair version, separating multi-version values by division numbers, and only reserving the defect numbers, the priority, the creation time, the solution time, the name of the submitter and the modified file path for ensuring consistency of data formats of cross-source matching; and 204, performing internal connection operation on the modification history data and the defect report data by taking the defect number as a main key, only reserving a matchable record, generating a mapping relation of a File-defect-developer, submitting a hash and a File Path for supporting subsequent network construction and weight calculation, and finally generating a standard data set.
5. The method for predicting the severity of a defect based on a modification history of a developer according to claim 4, wherein in step 3, the edge weight of the weighted directed edge is calculated according to the number of modification lines and a time attenuation factor, and the calculation formula is: (1) in the formula, For developer node Pointing to a file node Is used for the edge weight of the (c), As a point of time of the reference, To be at the time of Previous developer node Node for file A collection of all submissions of (1); Is the first File node in sub-commit Is used to modify the number of rows, Is a file node At the reference time point The maximum number of modification lines in all previous submissions; as a time-decay factor, As a coefficient of the decay in time, Is the first Time of sub-commit and reference time Is a time difference of (2).
6. The method of claim 5, wherein in step 3, the weighted directed network comprises developer nodes, file nodes, and edge sets.
7. The method for predicting severity of defects based on modification history of developer according to claim 6, wherein in step 4, a file-developer adjacency matrix is constructed based on weighted directed network, two-part network projection is performed based on weighted resource allocation policy, resources of developer nodes are calculated, resources of developer nodes are equally distributed to associated file nodes, resources of file nodes are calculated, file-file cooperative strength is calculated based on resources of developer nodes and resources of file nodes, and a file-file weighted directed network is constructed, specifically: recording a set of file nodes as based on a developer-file weighted directed network The developer set is noted as Thereby forming a two-part diagram The file-developer adjacency matrix is expressed as: (2) in the formula, Representing developer nodes Node for file Is used to modify the weight of the (c), the values are calculated If the developer node Never modified file node Then ; Adopting a weighted resource allocation strategy to carry out two-part network projection, reserving collaborative modification information of a developer level, and allocating resources to the developer nodes by file nodes, wherein each file node is provided with a resource allocation function Equally dividing the initial resources of the developer nodes among all the associated developer nodes, wherein the resource calculation formula of the developer nodes is as follows: (3) Wherein, the Weighting the file node; Initial resources are file nodes; the developer node equally distributes the obtained resources to the associated file nodes again, and the resource calculation formula of the file nodes is as follows: (4) Wherein, the Weighting a developer node; Based on the resource calculation formula of the developer node and the resource calculation formula of the file node, obtaining: (5) (6) in the formula, Representing file nodes Node with file The collaborative intensity of the behavior is modified based on the developer history.
8. The defect severity prediction method based on developer modification history according to claim 7, wherein in step 5, the degree of exit and degree of entry of file nodes are calculated based on file-file collaborative intensity of a file-file weighted directed network, for each file node, local probability of the file node is calculated, initial entropy of the file node is calculated based on the local probability, and the initial entropy is propagated and superimposed to obtain final structure entropy score of the file node, specifically: Constructing a corresponding adjacency matrix based on file-file collaboration intensity , Representing file nodes Node for file Through the adjacency matrix, calculating the outbound degree and inbound degree of each file node: (7) in the formula, Representing file nodes I.e., the sum of all outgoing edge weights of the node; Representing file nodes I.e., the sum of all ingress weights of the node; For each file node Calculating local probability thereof The method comprises the following steps: (8) in the formula, Is a node Is a degree of departure of (2); The degree of entry for node k; Is a node The sum of the outbound degree and the inbound degree of the own neighbor node and all neighbor nodes thereof; Calculating initial entropy of the file node based on the local probability of the file node, wherein the initial entropy is as follows: (9) in the formula, Representing nodes Is a neighbor of (a); Is a neighbor node Is a local probability of (2); propagating and superposing based on initial entropy of the file node to obtain a final structure entropy score of the node, wherein the final structure entropy score is as follows: (10) in the formula, Is a node Is a function of the initial entropy of (1); Is a node Is a neighbor of (a); Is a node Entropy value of (2); Representing slave nodes Pointing to Is a side weight of (1); Is a node Is a degree of departure of (2).
9. The method for predicting the severity of a defect based on a modification history of a developer according to claim 8, wherein in step 6, a software metric of each source code file is extracted, specifically: step 601, extracting class-level structure complexity indexes of source codes of each version based on a static analysis tool CK; step 602, aggregating the class-level structure complexity index into a file-level index to obtain a software measurement index.
10. The method according to claim 9, wherein in step 7, the machine learning model is XGBoost classifier, and in model evaluation, the repeated hierarchical K-fold cross-validation method is used for evaluation.

Description

Defect severity prediction method based on developer modification history Technical Field The invention relates to the technical field of software development, in particular to a defect severity prediction method based on a developer modification history. Background As software systems continue to scale up and functional complexity continues to increase, the number and severity of defects created by software during development and evolution appear to be diversified and unpredictable. Defects not only affect the reliability and stability of the software, but also directly increase maintenance costs and iteration risks. In the actual development process, the modification behavior and the history change record of the developer have important influence on defect generation. The continuous modification patterns, collaborative modification behaviors and structural dependencies between files of a developer may significantly affect the probability of occurrence of defects and their severity. Therefore, how to fully utilize the historical modification data of the developer and effectively predict the severity of the defects by combining the relation among the files becomes a core problem to be solved in the field of software engineering. An effective way to understand defects of a large software system is to start from file modification history and developer behavior characteristics (such as the number of modifications of a developer to a file, the number of modification lines and the collaborative modification relation among the files) and gradually extend to the evolution process of the whole software system. Prior studies have been made in various attempts at defect Prediction, including Rajni et al, which have extracted relevant data from defect reports Based on text mining techniques, model Prediction (Prediction of DEFECT SEVERITY by mining Software project reports) using multiple logistic regression and multi-layer perceptrons and decision trees, umamaheswara et al, which have proposed a Software Defect Severity Prediction (SDSP) Model Based on a mix of unlabeled and labeled defect severity data (Towards Developing AND ANALYSING METRIC-Based Software DEFECT SEVERITY Prediction Model), yang Fengyu et al, which have proposed a defect Prediction method Based on a combination of traditional metrics, semantic structural metrics, multiple metrics (combined with multi-metric Software defect Prediction study progress), liu Wenjie et al, which have improved defect severity Prediction performance (feature-Based Software defect report severity assessment) by feature optimization and weight adjustment of raw Software defect report data, data dimension reduction, and training and testing in combination with a polynomial bayesian algorithm. However, the existing work still has the following disadvantages: (1) Most of the existing work is based on static code features, and lacks research on dynamic modeling from the aspect of developer behavior. The existing defect prediction method mainly depends on static code measurement indexes (such as CK index sets, mcCabe complexity, code line numbers and the like) and lacks the capability of dynamically modeling the behavior of a developer. These methods ignore dynamic information in the development process, such as developer submission behavior and collaborative modification relationships between files. However, these dynamic behaviors often reflect the system evolution process and potential defect propagation paths directly. Based on the Git commit log and JIRA defect data, the invention merges developer behaviors and file modification histories to construct a weighted directed network of developer-file and a weighted projection weighted directed network of file-file, thereby introducing modeling view angles of collaborative evolution of the development behaviors and the files in defect prediction. (2) The existing defect prediction model is insufficient in excavation of structural features, and key files in a system are difficult to identify. The existing method has insufficient excavation of file structure characteristics, and key files in a system are difficult to identify. Traditional methods typically rely on static metrics to evaluate file importance, but cannot characterize the structural location and information flow characteristics of a file in a global collaborative network. According to the method, the structure entropy of the file-file weighted directed network is calculated, node ingress, egress and neighbor weight distribution are comprehensively considered, and the file node importance score reflecting the global dependency relationship is obtained and used for assisting in accurate prediction of defect severity. (3) Existing methods tend to focus only on the "presence" of defects, lacking systematic modeling of the "severity" of defects. Most models only judge whether the file has defects or not, and do not distinguish defect levels, and the defect severity is a ke