Search

CN-122018915-A - Source code similarity detection method and system based on deep learning

CN122018915ACN 122018915 ACN122018915 ACN 122018915ACN-122018915-A

Abstract

The invention discloses a method and a system for detecting source code similarity based on deep learning, which relate to the technical field of similarity detection and are characterized in that two sections of source codes to be detected are obtained, code trees corresponding to the source codes are respectively constructed, the code trees are disassembled into a plurality of sub-tree sequences based on statement identifiers, respective transmission path channels are constructed for each sub-tree sequence, a data operation pool is established, the transmission path channels corresponding to each sub-tree sequence are connected with the data operation pool through ports, each sub-tree sequence is processed into corresponding sequence images by the data operation pool, comparison among different sequence images is carried out, and the similarity degree between the two sections of source codes is obtained based on comparison results of all sequence images corresponding to the same code tree, so that the source code similarity detection is facilitated to be remarkably improved in detection efficiency, precision and visual degree.

Inventors

  • ZHANG WEIGUO
  • YANG XIAOBIN
  • WANG HONGYAN
  • QIN QIAN

Assignees

  • 宁夏凯信特信息科技有限公司

Dates

Publication Date
20260512
Application Date
20260119

Claims (9)

  1. 1. The method for detecting the similarity of the source codes based on the deep learning is characterized by comprising the following steps of: S1, acquiring two sections of source codes to be detected, respectively constructing code trees corresponding to the source codes, decomposing the code trees into a plurality of sub-tree sequences based on statement identifiers, and constructing respective transmission path channels for each sub-tree sequence; step S2, establishing a data operation pool, connecting a transmission path channel corresponding to each sub-tree sequence with a data operation pool through a port, and processing each sub-tree sequence into a corresponding sequence image by the data operation pool; and S3, comparing images of different sequences, and obtaining the similarity degree between two sections of source codes based on the comparison result of all the sequence images corresponding to the same code tree.
  2. 2. The method for detecting the similarity of source codes based on deep learning as claimed in claim 1, wherein the process of obtaining two sections of source codes to be detected and constructing code trees corresponding to the source codes respectively comprises the following steps: selecting two sections of source codes as target codes needing to be subjected to similarity detection; Setting a start reading point and a stop reading point of the target code, setting each code segment of the target code as a code item, setting code extraction characters, respectively constructing a root node for two sections of target codes, corresponding the code item at the start reading point to the root node, traversing the code extraction Fu Cong root node along the code segment of the target code until each code extraction character is traversed, and obtaining a corresponding tree path; Splitting the tree path at the position corresponding to the code extraction character, marking the splitting position as a secondary root node, processing the current tree path into the tree path under two branches forming the same root relation until the code extraction character traverses to the termination reading point of the target code, and constructing a code tree consisting of one root node, a plurality of secondary root nodes and a plurality of tree paths.
  3. 3. The deep learning-based source code similarity detection method of claim 2, wherein the process of breaking down the code tree into a plurality of sub-tree sequences based on the sentence identifiers, and constructing a respective transmission path channel for each sub-tree sequence comprises: sentence identifiers represent sentence fragments of different types, and code trees of target codes are processed through the sentence identifiers of a plurality of types, so that code subtrees which are respectively represented by all the sentence identifiers and correspond to the code trees are obtained; classifying code subtrees of sentence identifiers of the same type in two sections of object codes into a subtree sequence, obtaining a plurality of subtree sequences based on the classification of the sentence identifiers of a plurality of types, and setting a corresponding number of transmission channels based on the subtree sequences; Setting a plurality of routing points and analysis points on a transmission channel of each sub-tree sequence, capturing a behavior event on the transmission channel of the corresponding sub-tree sequence at each routing point, analyzing the behavior event by the analysis point at the corresponding position, judging whether the behavior event is an abnormal event, and acquiring a safety path based on a judgment result; When each route point and the transmission channel segment formed between the route points are marked as a safety path, integrating all the safety paths corresponding to the same transmission channel, converting the corresponding transmission channel into a transmission path channel, and each transmission path channel is used for executing the safety transmission of a corresponding sub-tree sequence.
  4. 4. The method for detecting similarity of source codes based on deep learning as claimed in claim 3, wherein the process of judging whether the behavior event is an abnormal event and obtaining the safety path based on the judgment result comprises: if yes, deciding to obtain a defending measure of the abnormal event, broadcasting the abnormal event and the corresponding defending measure to other identical or different transmission channels at the current routing point, and receiving the abnormal event and the corresponding defending measure by unprocessed routing points on the transmission channels; When a received routing point captures a behavior event, comparing the captured behavior event with the behavior event which is originally received, if the comparison is consistent, directly calling a corresponding defending measure of the compared behavior event, if the comparison is inconsistent, executing data analysis of the corresponding behavior event by an analysis point which is positioned at the same position as the current routing point, and deciding the corresponding defending measure; If not, marking the transmission channel segment formed between the current route point and the next adjacent route point as a safe path.
  5. 5. The method for deep learning based source code similarity detection of claim 4, wherein the process of creating the data manipulation pool comprises: A resource pool is deployed and connected with a cloud server, a plurality of operation partitions are created in the resource pool, data operation points and data index points are respectively arranged in each operation partition, the data operation points are used for obtaining a type of code-imaging processing architecture, imaging processing is carried out on code data stored in the current operation partition based on the obtained code-imaging processing architecture, and the data index points are used for building architecture multiplexing grids among different data operation points; The framework multiplexing grid comprises a complete multiplexing sub-grid and a partial iteration sub-grid, the complete multiplexing sub-grid is used for calling corresponding processing procedures of the imaging processing in different operation partitions which are completely identical in framework, the partial iteration sub-grid is used for calling the processing procedures of the imaging processing which are partially identical in framework in different operation partitions, and the iteration is carried out on the different framework based on the current code-imaging processing framework; When each operation partition in the resource pool completes the acquisition of the corresponding code-imaging processing architecture of the corresponding data operation point, the whole resource pool is used as the data operation pool.
  6. 6. The method for detecting source code similarity based on deep learning according to claim 5, wherein the process of port-connecting the transmission path channel corresponding to each sub-tree sequence with the data operation pool comprises: setting a port receiving address for each operation partition, obtaining a channel address of a transmission path channel corresponding to each sub-tree sequence as a port uploading address, and initiating a data uploading request of each data operation pool by each transmission path channel; After receiving the data uploading request, the data operation pool associates the port receiving address of one operation partition in an idle state with the port uploading address of the port initiating the data uploading request, performs communication association between the corresponding transmission path channel and the data operation pool, and sends the sub-tree sequence to the operation partition corresponding to the communication association through the transmission path channel for storage.
  7. 7. The deep learning-based source code similarity detection method of claim 6, wherein the processing of each sub-tree sequence by the data manipulation pool into a corresponding sequence image comprises: in each operation partition, the sub-tree sequence is disassembled into a plurality of sub-sequence objects; A compiling unit, a notification unit and a scheduling unit are arranged at each sub-sequence object; the compiling unit is used for completing compiling and analyzing of the code-imaging processing architecture on the corresponding subsequence objects and converting partial code fragments corresponding to the subsequence objects into sequence images; The notifying unit is used for notifying the result data of compiling and analyzing any sub-sequence object under one sub-tree sequence to other sub-sequence objects which are not compiled and analyzed; the scheduling unit is used for obtaining result data of the sub-sequence objects which are not compiled and analyzed, copying and copying the corresponding sequence images.
  8. 8. The method for detecting the similarity of source codes based on deep learning according to claim 7, wherein the process of comparing images of different sequences and obtaining the similarity between two sections of source codes based on the comparison result of the images of the same code tree corresponding to all sequences comprises: grouping a plurality of sequence images based on code trees to which the sequence images belong to respectively to obtain respective image sets of two code trees, wherein each image set comprises a plurality of sequence images which are pushed in time sequence on code compiling time sequence; setting a processing layer for two sequence images under the same code compiling time sequence, stacking the sequence images in the processing layer, setting a minimum segmentation area, and processing the stacked sequence images into a plurality of sub-stacked objects; obtaining a similar sub-object and a dissimilar sub-object based on the sub-stacked object and a preset first overlapping threshold, obtaining a layer overlapping rate of a processing layer based on the similar sub-object and the dissimilar sub-object, and obtaining a similar layer and a dissimilar layer based on the layer overlapping rate and a preset second overlapping threshold; And obtaining a tree similarity ratio between two code trees based on the similarity layer and the dissimilar layer, taking the tree similarity ratio as the code similarity between the two corresponding target codes, and determining the similarity degree of the source codes respectively represented by the two target codes through the code similarity.
  9. 9. A deep learning-based source code similarity detection system for implementing the source code similarity detection method according to any one of claims 1 to 8, characterized in that the system comprises: The code tree module is used for acquiring two sections of source codes to be detected, respectively constructing code trees corresponding to the source codes, decomposing the code trees into a plurality of sub-tree sequences based on statement identifiers, and constructing respective transmission path channels for each sub-tree sequence; The sequence image module is used for establishing a data operation pool, connecting a transmission path channel corresponding to each sub-tree sequence with the data operation pool through a port, and processing each sub-tree sequence into a corresponding sequence image by the data operation pool; And the similarity comparison module is used for comparing images of different sequences and obtaining the similarity between two sections of source codes based on the comparison result of all the sequence images corresponding to the same code tree.

Description

Source code similarity detection method and system based on deep learning Technical Field The invention relates to the technical field of similarity detection, in particular to a method and a system for detecting source code similarity based on deep learning. Background In the field of software engineering, code similarity detection has important significance for identifying code cloning, detecting software plagiarism, assisting program understanding, promoting code multiplexing and the like, and conventional code similarity detection methods are usually based on text comparison or token sequence matching, and can reflect the similarity characteristics of codes to a certain extent, but often have the limitation that efficiency and precision are difficult to be compatible. In addition, the existing method directly matches code symbols in a multi-dependence way, lacks the capability of high-level aggregation and abstraction of code logic, causes complicated comparison process, and is not easy to intuitively present the integral similarity relationship between two sections of codes. Disclosure of Invention The invention aims to provide a source code similarity detection method and system based on deep learning, which are used for solving the problem of the deficiency in the background technology. In order to achieve the above purpose, the invention provides a source code similarity detection method based on deep learning, which comprises the following steps: S1, acquiring two sections of source codes to be detected, respectively constructing code trees corresponding to the source codes, decomposing the code trees into a plurality of sub-tree sequences based on statement identifiers, and constructing respective transmission path channels for each sub-tree sequence; step S2, establishing a data operation pool, connecting a transmission path channel corresponding to each sub-tree sequence with a data operation pool through a port, and processing each sub-tree sequence into a corresponding sequence image by the data operation pool; and S3, comparing images of different sequences, and obtaining the similarity degree between two sections of source codes based on the comparison result of all the sequence images corresponding to the same code tree. In a preferred embodiment, the process of obtaining two sections of source codes to be detected and respectively constructing code trees corresponding to the respective source codes includes: selecting two sections of source codes as target codes needing to be subjected to similarity detection; Setting a start reading point and a stop reading point of the target code, setting each code segment of the target code as a code item, setting code extraction characters, respectively constructing a root node for two sections of target codes, corresponding the code item at the start reading point to the root node, traversing the code extraction Fu Cong root node along the code segment of the target code until each code extraction character is traversed, and obtaining a corresponding tree path; Splitting the tree path at the position corresponding to the code extraction character, marking the splitting position as a secondary root node, processing the current tree path into the tree path under two branches forming the same root relation until the code extraction character traverses to the termination reading point of the target code, and constructing a code tree consisting of one root node, a plurality of secondary root nodes and a plurality of tree paths. In a preferred embodiment, the process of breaking down the code tree into sub-tree sequences based on the sentence identifier, and constructing a respective transmission path channel for each sub-tree sequence comprises: sentence identifiers represent sentence fragments of different types, and code trees of target codes are processed through the sentence identifiers of a plurality of types, so that code subtrees which are respectively represented by all the sentence identifiers and correspond to the code trees are obtained; classifying code subtrees of sentence identifiers of the same type in two sections of object codes into a subtree sequence, obtaining a plurality of subtree sequences based on the classification of the sentence identifiers of a plurality of types, and setting a corresponding number of transmission channels based on the subtree sequences; Setting a plurality of routing points and analysis points on a transmission channel of each sub-tree sequence, capturing a behavior event on the transmission channel of the corresponding sub-tree sequence at each routing point, analyzing the behavior event by the analysis point at the corresponding position, judging whether the behavior event is an abnormal event, and acquiring a safety path based on a judgment result; When each route point and the transmission channel segment formed between the route points are marked as a safety path, integrating all the safety paths correspo