CN-121979757-A - Code peculiar smell detection method based on deep semantics and complex structure
Abstract
The invention discloses a Code peculiar smell detection method based on deep semantics and a complex structure, which comprises the steps of 1, obtaining deep semantics information through analyzing codes by an encoder of a Code BERT Code large language model, 2, obtaining complex structure information through analyzing codes by a Joern frame, 3, respectively training different models aiming at the deep semantics information and the complex structure information, and 4, fusing prediction probabilities of the different models to obtain a final result. The invention adopts methods such as static code analysis, data preprocessing, fine tuning training and the like to construct a systematic and efficient code odor detection model, realizes odor detection on codes from different angles of semantics and structures, obviously improves the code quality of developers, is beneficial to the improvement of the code quality by the developers and the deep research on the code odor detection field, and promotes the deep development of the research in the field.
Inventors
- ZHANG JINGXUAN
- ZHANG JUN
- CHEN JUNHAO
- LIANG XINYUE
- LI LIN
Assignees
- 南京航空航天大学
Dates
- Publication Date
- 20260505
- Application Date
- 20251124
Claims (5)
- 1. The code peculiar smell detection method based on deep semantics and complex structure is characterized by comprising the following steps: step 1, analyzing codes by an encoder of a Code BERT Code large language model to obtain depth semantic information; Step 2, analyzing codes through Joern frames to obtain complex structure information; step 3, training different models according to the depth semantic information and the complex structure information respectively; and 4, fusing the prediction probabilities of different models to obtain a final result.
- 2. The method for detecting Code odor based on deep semantics and complex structure according to claim 1, wherein the step 1 obtains deep semantics information by an encoder of a Code BERT Code large language model, constructs preliminary deep semantic feature representation, and comprises: Firstly, a large-scale Code corpus containing various peculiar smell types is collected and arranged as an object of deep semantic coding, the context semantics and abstract grammar structures of representative Code segments in the corpus are deeply analyzed, preliminary feature vector conception is provided, then a high-quality Code peculiar smell reference dataset marked by experts is further collected for model pre-training and fine tuning, most of the Code peculiar smell is found to be closely related to a specific semantic mode of the context through systematic analysis of the dataset, therefore, the deep semantic indication of the peculiar smell is reversely deduced by focusing on vectors generated by a Code BERT coder by considering token sequences and abstract grammar tree AST of the Code, a large amount of computing resources are required to be input for model training and parameter tuning in the process, and a feature extraction model capable of characterizing the Code deep semantic is initially established through repeated iterative optimization and continuous optimization and adjustment of coding layers and attention weights of the model.
- 3. The method for detecting code odor based on deep semantics and complex structure according to claim 1, wherein the step 2 obtains complex structure information by Joern framework parsing codes, constructs CPG feature representation of a code attribute map, and comprises: Firstly, the system utilizes Joern analysis tools to carry out batched graph construction on the same code corpus in the step 1 as an object of structural analysis, and through deeply analyzing abstract syntax trees AST, control flow graphs CFG and program dependency graphs PDG corresponding to codes, a preliminary graph feature engineering conception is provided, then further utilizes Joern queriable languages to extract and analyze data flow DFG, and through carrying out systematic analysis on differences of peculiar smell codes and non-peculiar smell codes on graph structures, the system analysis finds that most of code peculiar smell is represented as a specific structural mode or an abnormal data flow path on the graph, therefore, considering node types, side relations of the code attribute graphs CPG and topological features of the graph, by focusing on sub-graph structures representing code dependency, calling and control relations, analyzing the corresponding relations between the code attribute graphs and the peculiar smell modes to reversely derive structural signs of the peculiar smell, and the process needs to input a large amount of graph calculation and traversal cost, and repeatedly iterating experiments to continuously optimize and adjust extraction modes and dimensions of graph features, so that a feature vector space capable of representing complex structural relations of the code is preliminarily established.
- 4. The method for detecting code odor based on deep semantics and complex structure according to claim 1, wherein the step 3 trains different models for deep semantics information and complex structure information respectively, and constructs heterogeneous odor classifiers, comprising: Firstly, the system selects a sequence sensitive model such as a transducer or a multi-layer perceptron MLP as a classifier of a semantic branch based on the deep semantic feature vector obtained in the step 1, a loss function is designed pertinently by deeply analyzing the distribution of semantic features in a vector space so as to capture fine semantic differences, then a code attribute map CPG or derivative map features thereof obtained in the step 2 is further selected as a classifier of a structural branch based on the map sensitive model such as a map neural network GNN or a map convolution network GCN, and the difference of capture capacity of different types of GNN layers such as GAT (global average Pool) and GIN to different peculiar smell is found through systematic analysis of a map topological structure and peculiar smell labels, therefore, the model selects a GAT_JK_pool architecture, and is connected with a double-pooling Mean/Max Pooling mechanism through integration Jumping Knowledge after the map is annotated with the semantic layers GAT so as to extract and fuse features from map convolutions with different depths.
- 5. The method for detecting code odor based on deep semantics and complex structure according to claim 1, wherein the step 4 fuses the prediction probabilities of different models to obtain a final result, and realizes decision-level fusion of multi-mode information, and the method comprises the following steps: Firstly, a system acquires a predicted probability vector output by a semantic model and a structural model aiming at the same code sample in the step 3 as an input of decision fusion, a preliminary probability weighted fusion concept is provided by deeply analyzing preference and confidence difference of the two models on different peculiar smell categories, then a more complex fusion strategy is further explored on a verification set, a predicted result of a CodeBERT model of a multi-modal model based on code depth semantic information and a GCN model based on code complex structural information is combined by adopting the probability weighted fusion strategy, the method aims at utilizing strong semantic understanding capability of a text sequence model and insight of a graph model on a code structure dependency relationship, so that final judgment with robustness and accuracy more than any single model is obtained, and the core of the fusion process is weighted average: the prediction probabilities of the smell class, namely the positive class, are given different weights weight and 1-weight to obtain weighted average probabilities, in order to determine the optimal fusion effect, grid search is performed on a verification set, different weight and final classification threshold combinations are systematically tried, and finally a parameter pair capable of maximizing F1-score is selected, so that balance between semantic information and structural information can be achieved more accurately through the optimized fusion mechanism, and recognition accuracy and recall rate of code peculiar smell are improved. The automatic Code odor identification method can be used for detecting the Code odor types by taking the Code BERT model and the GCN graph convolution network model as references and respectively utilizing semantic information and structural information of the Code BERT model and the GCN graph convolution network model, fully utilizing characteristic information of multiple aspects and deep layers of codes, improving the accuracy of Code odor detection, finally combining a plurality of trained two-class models into an integral model, and improving the flexibility of model prediction of different Code odor.
Description
Code peculiar smell detection method based on deep semantics and complex structure Technical Field The invention belongs to the field of software engineering, in particular to the field of code odor detection and software static analysis, and relates to a code odor detection method based on deep semantics and complex structures. Background Code odor refers to a code part which can reflect a deep problem of a system or possibly cause a program to report errors in the source code of a program. When a programmer programs, code peculiar smell is often generated due to unreasonable system design, large work load capacity and the like, and an error can exist in a certain place in the code, so that the system is further caused to have larger problems in the subsequent development, evolution and maintenance processes. Only a portion of the code peculiar smell may be explicitly named by the programmer by means of code annotation or the like. However, most code odors have no obvious indicator. Code odor is part of the content of the source code that may need to be reconstructed, so a developer can help reconstruct the program by detecting code odor, however, it is difficult and lack accuracy to manually identify code odor, so an automated code odor detection method is needed to help the programmer make the detection. The programmer can be assisted in finding potential code defects through code odor detection, and the code quality is improved. For this reason, many methods have been proposed to automatically or semi-automatically identify code odors. Most such methods rely on manually designed heuristics to map manually selected source code metrics into predictions. Therefore, a model capable of fusing code depth semantic information and complex structure information, and a method for detecting code odor from different angles and fusing results are lacking. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a code peculiar smell detection method based on deep semantics and complex structures, which uses methods such as code large model coding, joern frame analysis code structure and the like to obtain a data set, adopts methods such as static code analysis, data preprocessing, fine tuning training and the like to construct a systematic and efficient code peculiar smell detection model, uses a weight-threshold mechanism to fuse different models to realize peculiar smell detection of codes from different angles of semantics and structures, and is beneficial to improving code quality and in-depth research on the code peculiar smell detection field by developers. In order to achieve the technical purpose, the invention adopts the following technical scheme: a code peculiar smell detection method based on deep semantics and complex structures comprises the following steps: step 1, analyzing codes by an encoder of a Code BERT Code large language model to obtain depth semantic information; Step 2, analyzing codes through Joern frames to obtain complex structure information; step 3, training different models according to the depth semantic information and the complex structure information respectively; and 4, fusing the prediction probabilities of different models to obtain a final result. In order to optimize the technical scheme, the specific measures adopted further comprise: Step 1 of obtaining depth semantic information by analyzing codes by an encoder of a Code BERT Code large language model, constructing preliminary depth semantic feature representation, comprising: First, the system collects and organizes a large-scale corpus of codes containing multiple odor types as deep semantically encoded objects. Preliminary feature vector concepts are presented by deep parsing of the context semantics of representative code segments in these corpora and their abstract syntax structures. And then further collecting a high-quality code peculiar smell reference data set marked by an expert for model pre-training and fine tuning. Through systematic analysis of this dataset, it was found that the vast majority of code odors are closely related to the specific semantic patterns of their contexts. Thus, consider a deep semantic indicator of an odor that is derived inversely by focusing on the vector generated by the Code BERT encoder, starting from the token sequence of the Code and the Abstract Syntax Tree (AST), and analyzing its distribution and association in high-dimensional space. The process needs to input a large amount of computing resources to perform model training and parameter optimization, and a feature extraction model capable of representing code depth semantics is initially established by repeatedly performing iterative optimization and continuously optimizing and adjusting the coding layer and attention weight of the model. Step 2 of parsing the code through Joern frames to obtain complex structure information, and constructing a code attribute graph (CPG) feature representation, i