CN-121996289-A - Code merging conflict automatic resolution method based on editing script identification and pre-training supervision fine adjustment

CN121996289ACN 121996289 ACN121996289 ACN 121996289ACN-121996289-A

Abstract

The invention discloses an automatic code merging conflict resolution method based on editing script identification and pre-training supervision fine tuning, which mainly solves the problem of code merging conflict in the field of software development: firstly, extracting all merged submissions in a submission history from a screened open-source Git code warehouse, reenacting the merged submissions to extract information construction characteristics such as contents, contexts and the like of all conflict blocks, extracting an editing script positioning solution by using an algorithm, sending the editing script into an embedding of a pre-training model output code change, sending the editing script into a downstream circulating neural network to output a prediction condition accepted by the script, and finely adjusting the pre-training model at the same time, and processing a prediction result to generate a recommended solution. The method has the characteristics of strong interpretability, strong non-invasive adaptability to production environment, strong expansibility and the like, can realize automatic resolution of code conflict, and improves the working efficiency of a developer for solving code merging conflict.

Inventors

XU LEI
WANG CHANGXIN

Assignees

南京大学

Dates

Publication Date: 20260508
Application Date: 20241108

Claims (6)

1. A code merging conflict automatic resolution method based on editing script identification and pre-training supervision fine adjustment is characterized in that the editing script in a code conflict block is identified through a code difference comparison technology, the specific change position of the conflict block is determined in a refined mode, a conflict resolution problem is constructed as an editing script receiving prediction problem, a pre-training model is utilized to embed the change semantics, the editing script is predicted by matching with a downstream model supervision fine adjustment pre-training model, and then an automatic conflict resolution scheme is generated.
2. The automatic code merging conflict resolution method based on editing script identification and pre-training supervision and fine adjustment according to claim 1, comprising the following steps: 1) Finding out the merging submission of all the two father nodes on each branch in the screened open-source Git code warehouse, picking out the target branch to the working catalog by using Git through an automatic script, merging the source branches, and obtaining a merging conflict scene in the real development environment in a way of replaying the conflict; 2) Scanning the obtained combined file after replay, extracting a specific conflict part wrapped by conflict marks, namely collecting context code information of the conflict blocks, identifying characteristic information such as an editing script and an editing position which are changed by two parties by using a diff algorithm, and then collecting a solution provided by a developer in a history of a true code warehouse by using an algorithm combining the diff algorithm and a heuristic rule as an actual correct result to construct a data set, wherein the specific conflict part is called a conflict block (conflict chunk) in the combined file; 3) The content of the code conflict block, the code conflict position, the context code and other information are used as the input of a pre-training model, the processed characteristics are used as the input to be sent into the pre-training model which is subjected to 'supervised fine Tuning' (Supervised Fine-Tuning), the output code is utilized to change and embed, the bidirectional LSTM or other RNN model is utilized to classify each editing script at the downstream, and whether the editing scripts of two parties are accepted or not is predicted; 4) And generating a conflict solution according to the prediction result of the model, and applying the conflict solution in the code.
3. The automatic resolution method of code merging conflict based on edit script identification and pre-training supervision fine tuning according to claim 2, wherein in step 1), the GitHub code warehouse to be collected is screened through a plurality of rules to ensure high quality of data sets, cloning of the code warehouse, detection of merging branches and replay of merging scenes are automatically completed by using scripts, then code files related to conflict are scanned, the range of each conflict block is identified by analyzing conflict identifiers, the data set finally used for supervision training is collected, and the whole data set is subjected to tokenization processing by adopting BPE (Byte-Pair Encoding) as a model training corpus.
4. The automatic resolution method for merging conflict of codes based on identification of editing scripts and pretraining supervision fine tuning according to claim 2, wherein in step 2), after determining conflict scope (i.e. conflict block) in conflict file, a custom diff algorithm is used to reversely analyze specific editing scripts submitted by each conflict block between conflict versions relative to a public father so as to analyze modification differences of both parties, information such as scope, content, position and context code of the editing scripts is recorded as metadata of conflict blocks, then latest modification of conflict file after conflict submission in a code warehouse is searched, and a diff algorithm and heuristic rule are combined to locate codes of the final resolved conflict blocks of a developer, and the training data set is automatically collected as expected result of the data set.
5. The automatic resolution method of code merging conflict based on edit script identification and pre-training supervision fine tuning according to claim 2, wherein in step 3), the problem is modeled as a receiving classification task for each edit script instead of a merging decision based on a whole code, namely, the edit script in each conflict block is split into independent classification tasks, whether each modification of the current branch or each modification of the branch to be merged should be received is predicted through a pre-training model, after the characteristics of each edit script are received, the model combines context information and semantics of conflict areas, a classification result for each edit script is output, and the context and code semantic characteristics of the conflict block can be better understood through fine-tuning of a pre-training language model (such as CodeBERT), so that the accuracy of the classification of the edit script is improved.
6. The method for automatically resolving the code combining conflict based on the identification of the editing scripts and the fine adjustment of the pre-training supervision according to claim 2 is characterized in that in the step 4), based on the prediction result of the pre-training model, the editing scripts are selectively applied, wherein all the editing scripts predicted to be accepted are reserved, the editing scripts predicted to be rejected are skipped, the whole process can automatically call the conflict resolution model after the conventional Git-merge process through a hook (Hooks) function, the final combining result is recommended, and the model can be seamlessly integrated into the daily workflow of a developer to achieve the automatic resolution of the code conflict.

Description

Code merging conflict automatic resolution method based on editing script identification and pre-training supervision fine adjustment Technical Field The invention belongs to the field of computers, in particular to the technical field of software. The invention provides an automatic code merging conflict resolution method based on editing script identification and pre-training supervision fine tuning, which is used for mining editing scripts of code changes in code merging conflicts, recommending a solution of the code merging conflicts by using semantic embedding technology and improving the working efficiency of a developer for solving the code merging conflicts. Background As modern software development tends to be complex, distributed software development gradually becomes a popular development method, so that cooperation among software developers is promoted, and development efficiency is greatly improved. In distributed workflows, version control systems (VCS, commonly referred to as Git, SVN, etc.) have evolved to make collaboration between developers simpler and more efficient. However, code on different branches may introduce conflicts (conflict) in merging or metamorphic operations, mainly because the developer modifies the same location or adjacent locations of the same file on different branches. When these modifications cannot be automatically incorporated, the version control system prompts the developer to manually resolve the conflict. This process of manually resolving conflicts is often cumbersome and time consuming, especially in large projects, as the complexity of the code and the frequency of code changes increase, so too does the number of conflicts and the difficulty of resolving conflicts. In the code merging flow, the Git's ancestor determines the latest common ancestor commit of the current branch and the branch to be merged, which is the starting point of the code change. And comparing the code changes by using a merging algorithm, namely applying a diff algorithm to the current branch and the code of the branch to be merged relative to a common ancestor code, and judging whether the changes can be automatically merged. When the same position or adjacent positions of the same file are found to have different modifications on the two branches, the Git generates merging conflict on the part which cannot be automatically merged, and inserts the conflict block to mark and wrap conflict content. Mainstream version control tools (e.g., git, mercurial, etc.) typically employ a text Three-way Merge algorithm (Three-way Merge) based on code lines for code merging. One three-way merge bump is as follows: <<<<<<<HEAD Code part A (modification of the current branch) |||||||base Code portion B (common ancestor branch code) ======= Code portion C (modification of branches to be merged) >>>>>>>branch Symbols such as < < < < < < < >, i/i = = = = = = = sum > > > > > > is used to tag different versions of content in the conflict area, commonly referred to as a conflict marker (conflict markers). The existing merging tools depend on the difference comparison of the text layers, and the semantics and the functional logic of codes cannot be understood. This limitation is particularly evident when complex code block modifications are encountered, where text differences may not accurately express the actual semantic changes of the code. In such cases, the developer typically requires a great deal of effort to manually analyze each conflict point to determine the rationality of the modification and to properly merge the different versions of the modification. To solve this problem, research has been devoted in recent years to developing automated tools to assist developers in handling code conflicts. For example, a method based on abstract syntax tree AST (Abstract Syntax Tree) generates its syntax tree structure by parsing the code and performs code merging based thereon. AST merging is able to capture the syntax and structural information of the code, not just relying on text differences. However, AST merging methods still face the limitations of complex tree matching algorithms, difficulty in handling semantic conflicts, limited performance in handling complex structures, and the like. Other methods translate the resolution of code conflicts into multi-classification problems, regarding conflict resolution as a policy selection process. These strategies include retaining the current branch change, retaining the branch change to be merged, merging the code (requiring manual intervention), splicing the two branch changes, or writing a new code (requiring manual intervention). Although this method has been applied in practice, it is essentially predictive based on the results of the solution, belonging to generalization of the results, not starting from the root cause of the conflict. In addition, the method has the limitations of rough classification category, limited effect on different warehouse dis