CN-121996290-A - Cross-language clone code detection method for pre-training model fine tuning and triplet learning

CN121996290ACN 121996290 ACN121996290 ACN 121996290ACN-121996290-A

Abstract

The invention discloses a cross-language clone code detection method combining pretrained model fine tuning and triplet learning. The traditional code clone detection method is often limited by grammar and structure differences of different languages, and is difficult to effectively solve the detection problem of cross-language clone codes. The method comprises the steps of firstly carrying out preliminary high-dimensional vector representation on source codes of different programming languages by utilizing a pre-training model, and optimizing the pre-training model through a fine tuning step so as to adapt to data distribution and characteristics of specific tasks. Then, using a triplet learning strategy, the model is further trained to capture similarities and differences between source code fragments by constructing triples that contain positive, negative and anchor samples. By the method, the accuracy and recall rate of cross-language code clone detection are improved, unified code characterization among different languages is realized, and an efficient and reliable solution is provided for solving the problem of cross-language code clone detection.

Inventors

FANG YONG
DENG HUAXIN
ZHANG QIANG
ZHOU FANGZHENG
YUAN LISHA

Assignees

四川大学

Dates

Publication Date: 20260508
Application Date: 20241105

Claims (4)

1. A cross-language clone code detection method for pre-training model fine tuning and triplet learning is characterized by comprising the following steps: A. in the data preprocessing part, the source codes are normalized so as to reduce the influence caused by the difference of cross-language codes and construct a data set; B. a code vector representing part for performing finer fine tuning measures on the source code obtained in the step A and outputting a vector with a fixed length as a characteristic representing vector of the corresponding source code; C. And C, a triplet learning part, namely selecting a positive sample, a negative sample and an anchor sample triplet by using the vector obtained in the step B, inputting a triplet data set into a model, and training the model to distinguish the similarity and the difference between different code segments.
2. The method for testing cross-language clone codes for fine tuning and triad learning of pre-training model according to claim 1, wherein the data preprocessing part in the step a comprises the following steps: A1, code formatting, namely carrying out unified code formatting operation on codes of different languages, ensuring that the codes reach consistency in the aspects of line, indentation, space and the like, wherein the formatting treatment is beneficial to reducing the diversity in code characterization, so that the similarity detection is more accurate; A2, removing notes, namely removing notes and document description parts in the codes, wherein the notes and document description parts comprise single-line notes, multi-line notes, document notes and the like, the notes are generally irrelevant to code logic, unnecessary text interference can be reduced after removal, and the detection accuracy of the similarity of the codes is improved; And A3, normalizing the identifiers in the codes, wherein unified naming, hash replacement or mask processing and other methods can be adopted to ensure the logical consistency of the codes without being influenced by specific identifier naming.
3. The method for testing cross-language clone codes for fine tuning and triad learning of a pre-training model according to claim 1, wherein the code representation in the step B includes the following steps: B1, token processing, namely decomposing a source code segment into minimum semantic units, namely dividing the code into basic elements such as keywords, variable names, operators, literal quantities and the like in the Token processing, wherein the step is helpful for standardizing the code structures of different programming languages and ensuring that the basic grammar units can be compared and analyzed among different languages; b2, batch division, namely generating batches containing aggregate sequences, combining the sequences into a plurality of small batches, wherein each batch contains sequence pairs belonging to the same task but using different programming languages, and simultaneously contains sequence pairs belonging to different tasks and having different programming languages; And B3, embedding a pre-training model, namely converting Token sequences or AST representations into high-dimensional vectors by utilizing a pre-training code embedding model (for example UniXcoder), wherein the pre-training model can capture deep grammar and semantic information of codes through learning of large-scale code data, so that vector representations with rich semantic features are generated.
4. The method for testing cross-language clone codes for fine tuning and triple learning of claim 1, wherein the triple learning part in the step C includes the following steps: c1, constructing triples, namely selecting a positive sample, a negative sample and an anchor sample triplet for triplet learning, wherein the positive sample and the anchor sample are similar code fragments, and the negative sample is dissimilar code fragments, so that the distinguishing capability of the model is enhanced; C2, calculating the triple Loss, namely calculating the similarity among the anchor point sample, the positive sample and the negative sample by using a cosine-marginal triple Loss function (Triplet Loss; and C3, code clone detection, namely inputting cross-language code fragments to be detected, generating vector representation by using an optimized model, calculating similarity measurement among the code fragments by a similarity measurement method, and identifying cross-language code clone examples.

Description

Cross-language clone code detection method for pre-training model fine tuning and triplet learning Technical Field The invention belongs to the fields of computer science and software engineering, and particularly relates to a cross-language code clone detection technology. With the popularity of multi-language collaboration in modern software development, reuse and conversion of source code between different programming languages is a common requirement. Therefore, developing a technology capable of effectively identifying and detecting cross-language code clones has important significance for improving software development efficiency, maintaining software quality and preventing code repetition and redundancy. The invention realizes the high-efficiency detection of cross-language code cloning by combining the pre-training model and the triplet learning, and provides advanced technical support for the field of software engineering. Background In the context of the rapid development of information technology today, software development is becoming increasingly complex and diverse. Reuse of source code and collaboration across language code has become commonplace in modern software engineering. Different programming languages have unique grammatical, structural and functional advantages, so that the development efficiency and flexibility can be greatly improved by using multiple programming languages in a mixed mode in one project. However, this also presents a significant challenge for cross-language source code similarity detection. Identifying and detecting code clones between different programming languages is an important task to improve software quality, optimize code reuse, and prevent code redundancy. Most of the traditional code clone detection methods are aimed at a single programming language, and are difficult to deal with code similarity problems in a multi-language environment. With the continuous expansion of software project size, the complexity and diversity of codes is increasing, and single language clone detection tools are struggling in the face of multilingual codebases. Cross-language code clone detection technology has been developed to solve the problem of code similarity between different programming languages and ensure the consistency and reusability of codes in different language environments. In recent years, the rapid development of deep learning technology provides new ideas and methods for cross-language code clone detection. The pre-training model has achieved great success in the field of Natural Language Processing (NLP), and has the advantage of being able to capture complex semantic and structural information in language, providing high quality semantic characterization. The pre-training model is applied to the code field, so that grammar and semantic information of source codes can be effectively extracted, and a solid foundation is laid for cross-language code clone detection. The invention provides a cross-language code clone detection method based on pretraining model fine tuning and triplet learning, and aims to solve the defects of the traditional method in cross-language code detection. The pre-training model is capable of capturing commonalities and differences between different programming languages and generating a unified code representation by training on a large-scale code dataset. On the basis, the model is further optimized through a fine tuning technology, so that the model is better adapted to the code characteristics of specific languages, and the generalization capability of the model in a cross-language environment is enhanced. The triplet learning is an effective similarity learning method, and by constructing the triplet comprising the positive sample, the negative sample and the anchor point sample, the model can learn the similarity and the difference between different code fragments, thereby improving the accuracy of code clone detection. In cross-language code clone detection, triplet learning can help models better understand and identify code similarities between different programming languages. In the field of cross-language source code similarity detection, although many researches have been advanced, many problems and challenges still remain to be solved. (1) Complexity across language differences, significant grammatical and structural differences exist between different programming languages. Programming languages vary in design, including naming rules, statement structures, data types, control flows, and the like. These differences complicate code similarity detection between languages. For example, java and Python differ essentially in syntax structure and code style, so how to effectively build a map of similarity between the two becomes a challenge. (2) Difficulty of code semantic understanding cross-language code similarity detection not only needs to consider the grammar of the code, but also has to understand the semantics of the co