CN-114296792-B - Code annotation generation method and device

CN114296792BCN 114296792 BCN114296792 BCN 114296792BCN-114296792-B

Abstract

The application discloses a code annotation generating method and device, which are characterized in that the source code of a first item is obtained, the source code is compiled into byte code, and generating a control flow chart according to the byte codes, inputting the byte codes and the control flow chart into an annotation translation model, and obtaining the annotation of the source code output by the annotation translation model. According to the scheme, the source codes of the second item are converted into the byte codes and the control flow chart, the annotation translation model is trained through the byte codes of the second item and the control flow chart, and the annotation of the source codes of the second item output by the annotation translation model is obtained.

Inventors

He Kunning
CHEN XIANGPING
HUANG YUAN
ZHOU XIAOCONG
ZHENG ZIBIN

Assignees

中山大学

Dates

Publication Date: 20260508
Application Date: 20211230

Claims (8)

1.A code annotation generation method, comprising: the method comprises the steps of obtaining source codes of a first item, wherein the source codes of the first item are source codes of language items capable of running on a virtual machine; The source Code is compiled into a byte Code, wherein the byte Code is a Code which can be read and executed by a virtual machine and comprises a Code area, a local variable table, an exception table, a Code line offset mapping table and a constant pool; Generating a control flow chart according to the byte codes, wherein the control flow chart represents the possible flow direction of all basic block execution in a program process in the form of a graph; inputting the byte codes and the control flow chart into an annotation translation model to obtain the annotation of the source codes output by the annotation translation model; the annotation translation model is obtained by taking the byte code of the second item and a control flow chart of the byte code of the second item as training samples and taking annotations of the source code of the second item as training labels; before the training process of the annotation translation model, the method further comprises the following steps: preprocessing the sample byte codes and the notes to obtain new sample byte codes and new notes; Taking the new sample byte codes and the control flow chart as training samples and taking new notes as training labels; preprocessing the sample byte code and the annotation to obtain a new sample byte code and a new annotation, wherein the method comprises the following steps: acquiring a local variable table, wherein the local variable table is a table stored locally and recorded with a user annotation habit; performing impurity removal treatment on the byte Code region of the sample byte Code to obtain an impurity-removed byte Code region; Combining the byte Code region after impurity removal with the local variable table to obtain a new sample byte Code; dividing the annotation into a plurality of words through a preset separation rule; And converting the form of the word into a preset form to obtain the word with the preset form, and composing to obtain a new annotation.
2. The method of claim 1, wherein the annotation translation model comprises a position coding layer, an encoder, a decoder, a full connection layer, and a GRU coding layer; the training process of the annotation translation model comprises the following steps: Performing position coding on the acquired sample byte codes and the sample control flow chart through a position coding layer to obtain a language sequence; processing the sample byte codes through an encoder to obtain coding characteristics; Generating, by a decoder, a decoding result corresponding to the encoding feature according to the language sequence; processing the sample control flow chart through a GRU coding layer to obtain a feature vector; generating the annotation according to the decoding result and the feature vector through a full connection layer; Calculating the difference value of the generated annotation label corresponding to the sample byte code and the sample control flow chart by using a loss function to obtain a translation loss value; And updating parameters of the annotation translation model according to the translation loss value.
3. The method of claim 2, wherein generating the annotation from the decoding result and the feature vector comprises: based on a preset vocabulary, each word in the annotation is predicted and generated by the following formula: wherein P represents that the predicted next word is Is a function of the probability of (1), Representing the ith word in the generated decoding result, The result of the decoding is indicated, The byte code is represented as such, Representing a control flow graph, W representing a weight matrix of the fully connected layer, b representing a bias matrix of the fully connected layer, Representing the feature vector of the control flow diagram, softmax as an activation function, Representing P to which the tag smoothing parameter is added, K representing the total number of words in the preset vocabulary, Is a preset parameter.
4. A method according to claim 3, wherein the loss function comprises: Where L represents a loss function.
5. The method of claim 1, wherein the generating a control flow graph from the bytecode comprises: decompiling the byte code into a three-address code; The three address codes are divided into a plurality of basic blocks and connected according to the writing sequence of the byte codes, and a control flow chart is generated.
6. The method according to claim 1, further comprising, after converting the morphology of the word into a preset morphology, obtaining a word of the preset morphology, and composing a new annotation: And de-duplicating the new sample byte code and the new annotation to obtain the de-duplicated new sample byte code and the new annotation.
7. A code annotation generation apparatus, comprising: The system comprises a source code acquisition unit, a source code generation unit and a storage unit, wherein the source code acquisition unit is used for acquiring source codes of a first item, wherein the source codes of the first item are source codes of language items capable of running on a virtual machine; The byte Code is a Code which can be read and executed by a virtual machine and comprises a Code area, a local variable table, an abnormal table, a Code row offset mapping table and a constant pool; the control flow chart generation unit is used for generating a control flow chart according to the byte codes, wherein the control flow chart represents the possible flow direction of all basic block execution in one program process in the form of a chart; the model output unit is used for inputting the byte codes and the control flow chart into an annotation translation model to obtain the annotation of the source code output by the annotation translation model; the annotation translation model is obtained by taking the byte code of the second item and a control flow chart of the byte code of the second item as training samples and taking annotations of the source code of the second item as training labels; before the training process of the annotation translation model, the method further comprises the following steps: preprocessing the sample byte codes and the notes to obtain new sample byte codes and new notes; Taking the new sample byte codes and the control flow chart as training samples and taking new notes as training labels; preprocessing the sample byte code and the annotation to obtain a new sample byte code and a new annotation, wherein the method comprises the following steps: acquiring a local variable table, wherein the local variable table is a table stored locally and recorded with a user annotation habit; performing impurity removal treatment on the byte Code region of the sample byte Code to obtain an impurity-removed byte Code region; Combining the byte Code region after impurity removal with the local variable table to obtain a new sample byte Code; dividing the annotation into a plurality of words through a preset separation rule; And converting the form of the word into a preset form to obtain the word with the preset form, and composing to obtain a new annotation.
8. The apparatus of claim 7, wherein the control flow graph generation unit comprises: A first control flow diagram generation subunit, configured to decompile the byte code into a three-address code; and the second control flow chart generation subunit is used for dividing the three address codes into a plurality of basic blocks and connecting the basic blocks according to the writing sequence of the byte codes to generate a control flow chart.

Description

Code annotation generation method and device Technical Field The application relates to the field of code translation, in particular to a code annotation generation method and device. Background With the development of social science and technology, various computer code languages appear, and code annotation is one of important means for helping programmers understand codes, and the code annotation usually uses natural language records to describe codes, expresses intention of the programmers and implementation details of the codes, and has very important significance in aspects of program understanding, software maintenance and the like. Conventional code annotation generation methods mainly use information retrieval techniques to select appropriate term deep summaries from the original code segments, or code cloning techniques to retrieve similar code segments from a code library and generate new annotations based on the annotations of these similar code segments. They rely excessively on semantic information such as identifiers and their performance is greatly compromised in the face of projects written in different languages, so that it is necessary to analyze source codes for different languages and to train different models specifically, learning costs are high, and how to reduce learning costs is a concern. Disclosure of Invention In view of the above, the present application provides a code annotation generating method and apparatus for reducing learning cost. In order to achieve the above object, the following solutions have been proposed: a code annotation generation method, comprising: acquiring source codes of a first item; Compiling the source code into byte code; generating a control flow chart according to the byte code; inputting the byte codes and the control flow chart into an annotation translation model to obtain the annotation of the source codes output by the annotation translation model; The annotation translation model is obtained by taking the byte code of the second item and a control flow chart of the byte code of the second item as training samples and taking the annotation of the source code of the second item as training labels. Optionally, the annotation translation model includes a position coding layer, an encoder, a decoder, a full connection layer, and a GRU coding layer; the training process of the annotation translation model comprises the following steps: Performing position coding on the acquired sample byte codes and the sample control flow chart through a position coding layer to obtain a language sequence; processing the sample byte codes through an encoder to obtain coding characteristics; Generating, by a decoder, a decoding result corresponding to the encoding feature according to the language sequence; processing the sample control flow chart through a GRU coding layer to obtain a feature vector; generating the annotation according to the decoding result and the feature vector through a full connection layer; Calculating the difference value of the generated annotation label corresponding to the sample byte code and the sample control flow chart by using a loss function to obtain a translation loss value; And updating parameters of the annotation translation model according to the translation loss value. Optionally, the generating the annotation according to the decoding result and the feature vector includes: based on a preset vocabulary, each word in the annotation is predicted and generated by the following formula: P({yi|y1,..,yi-1},xc,xv)＝softmax(W*Concat(Transformer{y1,..,yi-1},xc),GRU(xv))+b) Wherein P represents the probability that the predicted next word is y i, y i represents the ith word in the generated decoding result, { y 1,..,yi-1 } represents the decoding result, x c represents the byte code, x v represents the control flow graph, W represents the weight matrix of the fully connected layer, b represents the bias matrix of the fully connected layer, GRU (x v) represents the feature vector of the control flow graph, softmax is the activation function, P i' represents P with the tag smoothing parameter added, K represents the total number of words in the preset vocabulary, Is a preset parameter. Optionally, the loss function includes: Where L represents a loss function. Optionally, the generating a control flow chart according to the byte code includes: decompiling the byte code into a three-address code; The three address codes are divided into a plurality of basic blocks and connected according to the writing sequence of the byte codes, and a control flow chart is generated. Optionally, before the training process of the annotation translation model, the method further includes: preprocessing the sample byte codes and the notes to obtain new sample byte codes and new notes; and taking the new sample byte codes and the control flow chart as training samples and taking new notes as training labels. Optionally, preprocessing the sample byte code and