CN-122018958-A - Code difference dividing technology based on data flow analysis

CN122018958ACN 122018958 ACN122018958 ACN 122018958ACN-122018958-A

Abstract

The invention relates to a code difference dividing technology based on data flow analysis, which is characterized in that a static analysis framework with abstract syntax tree as granularity is designed for Java language, data dependency among syntax tree nodes is extracted by utilizing a data flow analysis algorithm, meanwhile, code difference blocks in Git are selected as basic units, the relation among the syntax tree nodes is improved to be among code fragments, and finally, the code difference fragments are divided by a graph algorithm to form different code change clusters. The invention can assist the developer to understand the code difference, and reduce the time cost consumed in the code examination process.

Inventors

XU LEI
ZHU YIHANG

Assignees

南京大学

Dates

Publication Date: 20260512
Application Date: 20241108

Claims (5)

1. A code difference dividing technique based on data flow analysis is characterized in that a static analysis framework with abstract syntax tree as granularity is designed for Java language, data dependency among syntax tree nodes is extracted by utilizing a data flow analysis algorithm, code difference blocks in Git are selected as basic units, the relation among the syntax trees is improved to be among code fragments, and finally code difference sets are classified based on a graph dividing algorithm, so that a developer is helped to understand code differences.
2. A code differential partitioning technique based on data stream analysis as described in claim 1, wherein the method comprises the steps of: 1) Analyzing the code warehouse to obtain code metadata information before and after Git submission; 2) Analyzing text content into an abstract syntax tree, analyzing syntax tree nodes intersected with the difference fragments, designing corresponding hierarchical analysis and data flow analysis algorithms from the structure and the data dependency relationship, and defining the dependency relationship among the syntax tree nodes; 3) And (3) selecting the difference fragments in Git as a dividing basic unit, converting the dependency relationship in the step (2) into a relationship graph among the difference fragments, and processing the difference fragments with strong relevance by a design graph dividing algorithm to obtain a final dividing result set.
3. The code differential partitioning technique based on data stream analysis as set forth in claim 2, wherein in step 1), the Git repository is parsed to extract metadata information before and after code submission: Since the subsequent analysis step depends on the existence of corresponding physical files locally, the information of the file tree submitted by the new and old needs to be cloned to the local; the analysis process involves the conversion of binary data in the Git warehouse and the calculation of the mapping relation of the file, determines the new addition, deletion, modification and renaming change of the file, generates metadata of code difference fragments after the analysis is finished, is used for recording text differences between two codes and comprises line position information of file paths and fragments, and all the data obtained after the analysis needs to be globally cached and managed to avoid repeated calculation.
4. The code difference dividing technique based on data stream analysis according to claim 2, wherein in step 2), text content is parsed into abstract syntax tree, position information is filtered, corresponding hierarchical parsing and data stream analysis algorithm is designed from the structure and data dependency relationship, and dependency relationship among syntax tree nodes is defined: accessing a grammar tree in an order traversal mode, judging whether the source code position corresponding to the current access node has an intersection with a segment with code difference or not, and if the intersection does not exist, eliminating a complete subtree corresponding to the node; The structural relationship comprises Import sentences and a reference relationship between introduced information, an inheritance relationship between child and parent classes, an implementation relationship between interfaces and classes and a rewriting relationship of child and parent class methods; the control flow graph adopts a reverse construction mode based on sentences and nodes in the tree, and designs 7 edge types to represent paths of program execution, wherein the paths are respectively sequential execution, whether If conditions are met, whether Switch conditions are met or not, and capturing and throwing out of abnormality; The data flow analysis focuses mainly on all DefUse relationships and reference types that expressions may produce, and iterative-based forward analysis is used to propagate and update along the control flow graph until the algorithm converges.
5. The code difference dividing technique based on data flow analysis as claimed in claim 2, wherein in step 3), get the Git difference segment as the dividing basic unit, transform the dependency between the grammar tree nodes into the relation graph between the difference segments, and design the relevance in the graph dividing algorithm processing the directed graph, generate the final grouping result: Before the partitioning algorithm starts, the original dependency graph needs to be expanded, the reachability and the distance between the nodes are recalculated, reconstruction, formatting and annotation detection are introduced, and the original graph information is converted into code difference fragments; The gravity center of the attention of the dividing algorithm is the connectivity between points, namely strong connectivity and weak connectivity, the algorithm preferentially utilizes Targan algorithm to find out all strong connectivity components, which respectively correspond to a group of more closely related difference fragments, then the strong connectivity components are used and searched, a plurality of groups of weak connectivity components are formed by combining the residual directed edges, and finally all the strong connectivity components jointly form a code difference dividing result.

Description

Code difference dividing technology based on data flow analysis Technical Field The invention belongs to the category of computer field, and mainly aims at the technical field of software engineering. The invention provides a code difference dividing technology based on data flow analysis, which converts the relevance between nodes into code difference fragments by applying the data flow analysis on an abstract syntax tree, groups the different types of difference fragments and improves the efficiency of a developer in understanding the code difference. Background With the continuous expansion of the software scale and the increase of the complexity, the version control system Git has become one of the indispensable tools in the software development process. The Git not only can maintain the code submitting history, but also can support the functions of parallel development, branch management, code merging and the like, thereby greatly improving the efficiency of collaborative development. Code examination is an important link in the parallel development process and is a main means for improving the quality and stability of codes. By analyzing the submitted code changes, potential problems can be found in time, and the error codes are prevented from continuously polluting downstream. However, most of the existing code examination methods are manually performed by a developer, and the difference fragments before and after code submission are compared by using Git to analyze the rationality of code modification and determine whether the modification meets the expectations. With the gradual expansion of code library size, one commit often involves multiple modules and file changes, and the cost of the audit will rise rapidly. Thus, how to help developers quickly understand code differences is becoming a research and practice hotspot in the soft engineering field. Git diff is a command for calculating text difference in Git, and by using a Histogram difference algorithm to construct a frequency Histogram for text lines, lines with lower occurrence frequency are preferentially matched, so that more accurate code difference presentation can be realized for files containing a large number of repeated lines, code block movement or code rearrangement. In order to solve the problem of coarser granularity of text lines, the existing code difference analysis is mostly developed based on an abstract syntax tree. Abstract syntax tree AST (Abstract Syntax Tree) is an intermediate tree structure of compilers and interpreters in analyzing and processing code, representing the syntax structure of source code, and organizing the various parts of code (e.g., classes, methods, expression statements, etc.) into a tree form according to their syntax relationships. After converting the source codes before and after the Git submission into abstract syntax trees, the change information of the syntax trees can be obtained by calculating the matching relation among the tree nodes. A matching algorithm in a general sense is to solve for the largest common sub-structure between two data structures, such as the well-known Longest Common Subsequence (LCS), which can represent the degree of matching between two text strings. However, solving the largest common sub-structure in a tree or a large part of graph theory data structures is very complex, for example, the largest common embedded sub-tree and the largest common sub-tree are both NP-Hard problems, and the approximate solution can be solved within polynomial time complexity with the help of some constraints and heuristic rules. Currently, many code difference analysis algorithms (e.g., gumTree) use a top-down and bottom-up combination of matching to find the approximate solution of the largest common substructures. The method makes some assumptions according to the grammar rules of the programming language, and can quickly obtain the maximum common substructure of two codes in the codes represented by the graph. Briefly, top-down matching calculates the overall framework of the syntax tree structure, and since two codes to be matched generally have a certain similarity, the similarity is reflected in possibly having the same class name and function name, so that the elements are matched first. Bottom-up matching is to match the remaining program elements as much as possible in the case of most classes and functions and matches. The combination of the two matching strategies, plus some heuristic rules, allows existing algorithms to solve for the maximum match at the complexity of O (n 2). TSANTALIS et al, on the basis of the tree difference algorithm of GumTree, custom define the mode of modification of code reconstruction at the node level of the syntax tree, and realize a more accurate code reconstruction detection framework RMiner. When faced with large-scale code differences, git or grammar tree-based difference algorithms can only show more fragmented difference information, whil