CN-122019333-A - Software aging prediction method and device based on feature reconstruction and label guidance

CN122019333ACN 122019333 ACN122019333 ACN 122019333ACN-122019333-A

Abstract

The application relates to a software aging prediction method and device based on feature reconstruction and label guidance, wherein the method comprises the steps of obtaining a source code of target software and an aging label sample corresponding to the source code; the method comprises the steps of extracting measurement features and graph structure features from source codes, reconstructing the measurement features and graph structure features into feature images, obtaining aging feature representations of the source codes through self-training optimized by contrast learning based on the feature images, generating pseudo tags for unlabeled code units in the source codes by utilizing buffer-assisted tag propagation and aging tag samples, constructing a data set based on the source codes, the aging tag samples and the pseudo tags, training a prediction model based on the aging feature representations and the data set, and performing aging prediction on target software. According to the method, the stability and the reliability of the pseudo tag are improved through feature reconstruction and tag propagation assisted by the buffer area, and the problem that the quality of the pseudo tag is poor in the initial training stage of the traditional self-training method is solved.

Inventors

ZHANG CHEN
TIAN HAO
DI YI
LI RUIHENG

Assignees

湖北经济学院

Dates

Publication Date: 20260512
Application Date: 20251219

Claims (10)

1. The software aging prediction method based on feature reconstruction and label guidance is characterized by comprising the following steps of: acquiring source codes of target software, and acquiring aging label samples of a plurality of code units corresponding to the source codes; Based on the feature image, carrying out feature learning by a self-training method optimized by contrast learning to obtain aging feature representation of each code unit of the source code; Generating a pseudo tag for a code unit which is not marked in the source code by using a tag propagation method assisted by a buffer area and the aging tag sample, wherein the buffer area is used for temporarily storing tag information to be propagated; Training a predictive model based on the aging characteristic representation and the dataset; and carrying out aging prediction on the target software based on the trained prediction model.
2. The method for predicting software aging based on feature reconstruction and label guidance according to claim 1, wherein the feature learning by a self-training method optimized by contrast learning based on the feature image, to obtain an aging feature representation of each code unit of the source code, comprises: constructing a first loss function based on supervised cross entropy loss of the tagged data, unsupervised loss of the tagged data, and class-aware contrast loss; Respectively determining the weights of the unsupervised loss and the class perception contrast loss of the tag data; Constructing a second loss function based on the first loss function and the weights; And obtaining the aging characteristic representation of each code unit of the source code from the source code and the aging label sample through self-supervision learning of the loss function.
3. The feature reconstruction and tag guidance based software burn-in prediction method of claim 2, wherein the weights for determining the unsupervised and class-aware contrast loss of tag data, respectively, comprise: and dynamically adjusting the weight of the class perception contrast loss based on the total training wheel number and the training wheel number from the highest point to the lowest point of the loss function value.
4. The method of claim 1, wherein generating pseudo tags for unlabeled code units in the source code using the buffer-assisted tag propagation method and the aged tag samples comprises: Dividing the data into marked data and unmarked data based on a preset confidence threshold; in each iteration, performing a put-back sampling on the marked data and updating its characteristic representation based on the sampling result; Constructing a symmetrical adjacency matrix based on the similarity between the updated marked data features and the unmarked data features; utilizing the symmetrical adjacency matrix, combining the current characteristics of the marked data or the historical characteristics stored in the buffer area, and carrying out tag propagation on the unmarked data to obtain a plurality of candidate pseudo tags of each unmarked data; and determining the candidate pseudo tags with the confidence coefficient not lower than the confidence coefficient threshold value as the pseudo tags corresponding to the unlabeled data.
5. The feature reconstruction and tag guidance based software burn-in prediction method of claim 4, further comprising: and performing smoothing processing on the generated pseudo tags through distribution alignment.
6. The software aging prediction method based on feature reconstruction and label guidance of claim 1, the method is characterized in that the fusing and reconstructing the metric features and the graph structural features into the feature images comprises the following steps: Mapping the measurement feature and the graph structure feature into a first two-dimensional image and a second two-dimensional image respectively; normalizing the first and second dimensional images; extracting two-dimensional features with the same dimension as the first two-dimensional image in the normalized second two-dimensional image through a graph-transducer model to obtain a third two-dimensional image; the first two-dimensional image and the third two-dimensional image are expanded into a three-way image.
7. A software burn-in prediction apparatus based on feature reconstruction and label guidance, comprising: The acquisition module is used for acquiring source codes of target software and aging label samples of a plurality of code units corresponding to the source codes; The learning module is used for fusing the measurement features and the graph structure features and reconstructing the measurement features and the graph structure features into feature images, and performing feature learning by a self-training method optimized by contrast learning based on the feature images to obtain aging feature representation of each code unit of the source code; The construction module is used for generating a pseudo tag for an unlabeled code unit in the source code by using a tag propagation method assisted by a buffer area and the aging tag sample, wherein the buffer area is used for temporarily storing tag information to be propagated; a training module for training a predictive model based on the aging characteristic representation and the dataset; And the prediction module is used for performing aging prediction on the target software based on the trained prediction model.
8. The feature reconstruction and tag guidance based software burn-in prediction apparatus of claim 7, wherein the learning module comprises: The first construction unit is used for constructing a first loss function based on the supervised cross entropy loss of the tagged data, the unsupervised loss of the tagged data and the class perception contrast loss; the determining unit is used for respectively determining the weights of the unsupervised loss and the category perception contrast loss of the tag data; a second construction unit configured to construct a second loss function based on the first loss function and the weight; and the learning unit is used for obtaining the aging characteristic representation of each code unit of the source code from the source code and the aging label sample through self-supervision learning of the loss function.
9. An electronic device comprising one or more processors and storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the feature reconstruction and tag guidance based software burn-in prediction method of any of claims 1 to 6.
10. A computer readable medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the software burn-in prediction method based on feature reconstruction and label guidance as claimed in any one of claims 1 to 6.

Description

Software aging prediction method and device based on feature reconstruction and label guidance Technical Field The application belongs to the technical field of deep learning and software operation and maintenance, and particularly relates to a software aging prediction method and device based on feature reconstruction and label guidance. Background In recent years, with rapid development of machine learning techniques, aging defect prediction based on big data has obtained good prediction effects. Related research can effectively predict whether a new code sample has an aging defect or not by learning the aging defect characteristics in a large number of code samples. Although the machine learning method is capable of effectively predicting whether a test file contains an aging defect, and does not rely on specific expert knowledge. However, such methods generally require a large amount of labeled data as a training set, and expensive labeling cost and time overhead greatly restrict the landing and popularization of the prediction model in actual scenes. Although cross-project prediction can solve the problem of unlabeled data, it requires a suitable feature migration method to obtain good prediction results at the target project. How to train the prediction model effectively on a sample set with only a small number of labels or without labels becomes a problem to be solved in software testing. While software aging has long attracted attention from researchers and industry personnel, software aging defect prediction is still a relatively emerging research direction in software reliability engineering, and there is a significant gap in the availability of data resources compared to mature traditional software defect research. Currently, the mainstream software defect reference data set (such as NASA, PROMISE, reLink, SOFTLAB and the like) is mainly constructed aiming at the conventional defect type, while the labeling data set specially aiming at the aging related defects is still relatively deficient, which limits the progress of demonstration research in the field to a certain extent. The cost of the labeling data is too high, the labeling process is complicated, and the labeling quality and consistency are difficult to ensure by using the traditional manual labeling mode. In recent years, researchers have pointed out using Cross-Project defect prediction (CPAP) to solve this problem. Cross-project defect prediction is expected to obtain a better prediction result by learning feature distribution in a labeled source project and then testing in an unlabeled target project. Generally, the source project and the target project belong to different software development projects. Through investigation, the subject group of Qin et al was the first systematic study of cross-project prediction problems on aging datasets. They have proposed a cross-project prediction idea based on transfer learning as early as 2015, and mapped a Source Domain (Source Domain) and a Target Domain (Target Domain) to the same feature space by adopting a transfer component analysis (Transfer Component Analysis, TCA) method through feature transformation, so that the distributions of the Source Domain (Source Domain) and the Target Domain (Target Domain) are as similar as possible, thereby improving the prediction performance of the Target task. in subsequent studies, the authors extended the study work to 9 subsystems in 3 data sets, reporting baseline performance across project predictions in existing data sets, guiding subsequent studies. Finally, in 2020 work, qin et al made an overall analysis of the parameter impact in the TCA method, including data normalization methods, kernel equations for feature migration, and selection of machine learning classifiers. Experimental results show that the kernel equation and the classifier have a larger influence on the prediction result. Based on this idea, research has begun to diversify in recent years, for example, wan et al consider cross-project key to learn better feature representations, and they propose a representation learning method based on a double-coding-layer self-encoder. Xie et al propose a CPAP method based on a kernel principal component analysis and a double edge noise reduction self-encoder. Xu et al propose a cross-project defect prediction method based on joint distributed adaptation (Joint Distribution Adaptation, JDA). The method synchronously minimizes the edge distribution (Marginal Distribution) difference and the condition distribution (Conditional Distribution) difference between the source item and the target item through a two-stage distribution alignment mechanism, thereby establishing a more generalized distribution representation in the feature space. Although cross-project defect prediction can solve the problem of expensive annotation data, such methods still face two major limitations (1) the performance is greatly affected by the feature migration me