CN-121980065-A - Code searching method based on double-order feature optimization mechanism
Abstract
The invention discloses a code searching method based on a double-order feature optimization mechanism, which comprises the steps of obtaining a data sample set, constructing a code searching model based on the double-order feature optimization mechanism, and carrying out model training on the constructed code searching model based on the double-order feature optimization mechanism according to the data sample set to obtain an optimal code searching model so as to realize code searching based on the double-order feature optimization mechanism. The invention solves the problems that the existing method can not better capture the deep control dependency relationship and the data dependency relationship among various code features, so that semantic information of codes is not fully represented after the fusion of the various feature vectors, and the accuracy of code search results is lower.
Inventors
- GONG YUQI
- SHAO XINXIN
- Luo Ximeng
- ZHU ZHENGBO
- YU LINGWEI
- ZHAO CHENGSHUO
- QIN YIYANG
- Nong Wusheng
Assignees
- 大连东软信息学院
Dates
- Publication Date
- 20260505
- Application Date
- 20260130
Claims (8)
- 1. The code searching method based on the double-order feature optimization mechanism is characterized by comprising the following steps of: S1, acquiring a data sample set; the data sample set comprises a plurality of code segments and natural language descriptions corresponding to the code segments; S2, constructing a code search model based on a double-order feature optimization mechanism, wherein the code search model comprises a preprocessing module, a feature extraction module, a first-stage feature enhancement optimization module, a second-stage feature fusion optimization module, a query feature vector generation module, a feature space mapping module and a code search module; the preprocessing module is used for acquiring a method name sequence of each code segment based on hump naming and is based on hump naming NLP performs word segmentation processing on each code segment to obtain a code Token sequence based on The tool acquires a program dependency graph corresponding to each code segment; the query feature vector generation module is used for extracting semantic feature information in natural language description based on a preset first bidirectional LSTM model to obtain a natural language feature vector; The feature extraction module is used for extracting semantic feature information in a method name sequence based on a second bidirectional LSTM model to obtain a method name feature vector, screening Token sequence feature information in a code Token sequence based on a multi-layer perceptron model to obtain a code Token sequence feature vector, and extracting diagram structure feature information in a program dependency diagram based on a third bidirectional LSTM model to obtain a program dependency diagram feature vector; The first-stage feature enhancement optimization module is used for carrying out feature enhancement on a method name feature vector, a code Token sequence feature vector and a natural language feature vector by adopting a multi-head Attention mechanism to obtain an optimization method name feature vector, an optimization Token sequence feature vector and a query feature vector corresponding to the natural language feature vector; the second-stage feature fusion optimization module is used for carrying out multi-mode feature vector fusion on the optimization method name feature vector, the optimization Token sequence feature vector and the code diagram structure sequence feature vector to obtain a feature vector fusion vector; the feature space mapping module is used for mapping the feature vector fusion vector to a preset high-dimensional vector space to obtain a final code feature vector; The code search module is used for calculating cosine similarity between the query feature vector and the final code feature vector, carrying out descending order arrangement on the final code feature vector based on the cosine similarity to obtain a sequence table, and taking code fragments corresponding to the first k final code feature vectors in the sequence table as code search results; And S3, carrying out model training on the constructed code search model based on the double-order feature optimization mechanism according to the data sample set to obtain an optimal code search model, and realizing code search based on the double-order feature optimization mechanism according to the optimal code search model.
- 2. The code searching method based on the dual-order feature optimization mechanism as claimed in claim 1, wherein the method for obtaining the method name feature vector in S2 is as follows: Word segmentation is carried out on the code method names in the method name sequence, and the length is obtained Method name sequence And is also provided with Wherein, the method comprises the steps of, Representing the first of the method name sequences A word unit; forward LSTM extractor in a second preset bidirectional LSTM model And backward LSTM extractor Extracting corresponding method name sequences respectively Forward dependency characteristics and backward dependency characteristics of the hidden layer to obtain hidden layer state sequence vectors at all times; Splicing the corresponding hidden layer sequence vectors according to the sequence of the forward dependency characteristics behind the backward dependency characteristics to obtain a joint feature matrix; Using a preset maximum pooling layer Feature screening and aggregation are carried out on the bidirectional hidden states, namely hidden layer state sequence vectors, at all moments in the combined feature matrix, namely, the maximum value of the bidirectional hidden states at all moments is selected, and then the final method name feature vectors are fused and obtained 。
- 3. The method for searching codes based on the dual-order feature optimization mechanism as claimed in claim 2, wherein the method for obtaining the feature vector of the code Token sequence in S2 is as follows: Defining the code Sequence length is And expressed as: wherein Representing the first of the code Token sequences Personal (S) Word, code is coded by preset linear layer The first in the sequence Personal (S) After the words are mapped into vector representations, codes are obtained through a preset multi-layer perceptron The first in the sequence Personal (S) Vector representation of words at hidden layer, use of preset max pooling layer Performing maximum pooling operation on vector representations in a hidden layer to obtain code Token sequence feature vectors 。
- 4. The method for searching codes based on the two-order feature optimization mechanism according to claim 3, wherein the method for feature enhancement of the method name feature vector, the code Token sequence feature vector and the natural language feature vector is characterized in that a multi-head attention mechanism is adopted in S2, and the method comprises the following steps: The method comprises the steps of taking a method name feature vector, a code Token sequence feature vector and a natural language feature vector as input vectors, respectively carrying out linear transformation on each input vector to obtain three vectors for multi-head attention mechanism processing, namely a query vector Key vector Value vector ; And based on a multi-head attention mechanism, acquiring a plurality of attention vectors as follows: wherein: represent the first An output of the individual attention heads; Respectively represent the first A number of query vectors, a first Key vector number and number A value vector; Performing linear transformation after splicing all attention vectors to obtain input vectors with enhanced features, wherein the input vectors comprise the name feature vectors of the optimization method Optimizing Token sequence feature vectors And query feature vectors corresponding to the natural language feature vectors.
- 5. The method for code search based on the dual order feature optimization mechanism as recited in claim 4, wherein S2 is based on The method for acquiring the program dependency graph corresponding to each code segment by the tool comprises the following steps of calling The method comprises the steps of generating a dot graph file corresponding to each code segment by a tool, traversing information between nodes and edges of codes in the dot graph file by any one of depth-first traversal and breadth-first traversal methods, and obtaining a corresponding program dependency graph, wherein the nodes in the program dependency graph comprise elements corresponding to variables or functions in the codes, and the edges are used for representing dependency relations among the elements.
- 6. The method for searching codes based on the dual-order feature optimization mechanism as claimed in claim 5, wherein the method for feature enhancement of the program dependency graph feature vector based on the Agent Attention mechanism in S2 is as follows: defining a program dependency graph feature vector as Wherein, the method comprises the steps of, Representing nodes in a program dependency graph Is a vector of embedding; representing the number of nodes of the graph sequence, and defining the number as the number according to the characteristic vector of the program dependency graph Is a first set of proxy vectors of (a) And is also provided with Wherein Represent the first A proxy vector; Taking each Agent vector as a query vector in an Agent Attention mechanism, and taking nodes in a program dependency graph as key value pairs of the Agent Attention mechanism at the same time, so as to obtain Attention weights between the Agent vectors and each node as follows: wherein: representing a matrix of learnable parameters; Representing a scaling factor; representing a transpose; Representing the first of the program dependency graphs A plurality of nodes; represent the first The third agent vector and the fourth agent vector are in the program dependency graph Attention weights between individual nodes; According to the attention weight Obtain the first Personal agent vector Opposite node Contribution degree of (2) : Wherein: representing a matrix of learnable parameters; According to the degree of contribution Acquiring feature enhanced representations of all proxy vectors for each node : Enhancing representations of features Executing maximum pooling processing to aggregate the feature enhancement representations of all nodes, and acquiring the feature vectors of the code diagram structure sequence as follows: 。
- 7. the method for searching codes based on the dual-order feature optimization mechanism as claimed in claim 6, wherein the method for obtaining feature vector fusion vectors through the second-stage feature fusion optimization module in S2 is as follows: Defining a second set of proxy vectors And is also provided with , wherein, Represent the first The method comprises the steps of obtaining a final feature matrix by performing a splicing operation on an optimization method name feature vector, an optimization Token sequence feature vector and a code diagram structure sequence feature vector, taking each Agent vector as a query vector of an Agent Attention mechanism, and taking the final feature matrix as a key value pair of the Agent Attention mechanism to obtain the contribution degree of the Agent vector to each feature vector in the final feature matrix The method comprises the following steps: wherein: Representing a second set of proxy vectors Middle (f) The agent vector and the final feature matrix Attention weights of the individual feature vectors; representing the first of the final feature matrices A feature vector; representing a matrix of learnable parameters; According to the degree of contribution Acquiring a second set of proxy vectors Feature enhanced representation of all agent vectors for each feature vector The method comprises the following steps: , Feature enhanced representation of feature vectors And performing splicing operation to obtain the feature vector fusion vector.
- 8. The code search method based on the dual-order feature optimization mechanism as claimed in claim 7, wherein the method for obtaining the optimal code search model in S3 is as follows: s31, randomly dividing a training set and a verification set according to a preset proportion by a data sample set; S32, carrying out model training on the constructed code search model based on the double-order feature optimization mechanism according to the training set to obtain a trained code search model; S33, performing model verification on the trained code search model through the verification set based on a loss function, wherein the loss function comprises any one of cross entropy loss or contrast learning loss function; Judging whether the output of the trained code search model converges or not; If the output of the trained code search model converges, confirming that the trained code search model is the optimal code search model; Otherwise, based on the back propagation method, the weight parameters of the trained code search model are adaptively adjusted, and the step S32 is repeatedly executed until the weight parameters of the trained code search model with converged output are confirmed to be optimal weight parameters, and the code search model is reconstructed, so that the optimal code search model is obtained.
Description
Code searching method based on double-order feature optimization mechanism Technical Field The invention relates to the technical field of code searching, in particular to a code searching method based on a double-order feature optimization mechanism. Background The code search is defined as that code fragments with semantically similar query intention can be returned in a massive code fragment corpus by using natural language query sentences, and the code search is a hot research direction in the field of software engineering research. According to research and study, in the software development process, the developer has about 19% of time to search codes similar to the development purpose on the network as references to reuse the codes, the purpose of code reuse is to modify the existing codes into available codes suitable for self development projects according to specific requirements of the developer, the searched high-quality project development codes can be reused to greatly improve the production efficiency of software development and the development quality of the projects, and the code search is to find the available codes, so that the assisting software developer realizes code reuse. The initial stage of the code search research is to search by comparing the word text similarity degree between the code segment and the natural language query, but since the programming language is usually a high-tech and structured language, the semantic information representation of the programming language is greatly different from the natural language, so the code search method based on text similarity matching has low search efficiency. With the continuous development of information retrieval technology, some researchers propose to add in comparing text similarityInformation retrieval technologies such as BM25 and the like to improve the accuracy of code search results, but the accuracy of the method is lower when code fragments with rich semantics are retrieved. With the continuous development of deep learning technology in the field of natural language processing, more and more researchers use deep learning algorithms to learn characteristic information among codes and natural language descriptions so as to perform code searching. In recent years, more classical multi-mode-based code search algorithms in the field of code search research haveThe three-term code search algorithm is specifically described below for MMAN and G2 SC. :A deep learning algorithm is firstly put forward to be applied to a code searching technology in a code searching scene, the algorithm uses a deep learning feature learning model to learn feature information of codes and natural language descriptions, the feature information comprises a code Token mark sequence, a method name and a code API sequence, a high-dimensional feature vector corresponding to the codes and a feature sequence corresponding to the natural language descriptions are generated, and the code searching is carried out by comparing cosine similarity between the vectors. Compared with a code searching method based on text matching and a code searching method based on information retrieval, the code searching method based on deep learning greatly improves the accuracy of code searching results. MMAN learning the semantic feature information of the code and simultaneously learning the abundant structural feature information in the code. The MMAN model extracts code control flow chart information, and learns control flow structure information among code nodes in the control flow chart by using the GNN (Graph Neural Networks, graph neural network model) model, so that the model can capture abundant structural feature information in the code during feature learning. The Chinese patent CN 115268869 proposes a graph sequence conversion mechanism G2SC which can convert a code graph structure into a sequence representation, and learns the influence dependency relationship before and after the code graph structure sequence by using a bidirectional LSTM (Long Short Term Memory, long and short time memory network) model so as to learn more abundant code semantic characteristic information and structure characteristic information. However, the current code search algorithm has the following problems: The method only considers three semantic feature information of code Token sequence, code API sequence and code method name when learning code features, and does not consider unique structural feature information in the code. MMAN inOn the basis of the above, a code control flow chart is extracted, and a control flow relation between the code control flow charts is learned by using a graph neural network model, so that semantic characteristic information is considered and code structure characteristic information is considered. However, the graph neural network does not capture the code control flow relation well, so that the code search result accuracy improvement effect is not obviou