MergeGen is a generation-based merge conflict resolution approach, which first proposes a structural representation for the merge conflicts and then proposes an encoder-decoder generative model to produce the resolutions through generation. In this repository, we provide our replication package.
We provide a file named packages.txt
that contains a list of all the essential packages required for replication. To create a conda environment and install the necessary packages, you can execute the following commands.
$ conda create -n ase-merge python=3.8
$ conda activate ase-merge
$ python -m pip install -r packages.txt
Before executing the model, we need to pre-process the dataset first. Since pre-processing is time-consuming, we use the subprocess
module of python
to prepare the dataset parallelly. The maximum number of processes executing simultaneously is 100, and each subprocess will deal with 1,000 data of the dataset in our experiment. The raw data we use to pre-process is established by the previous work [1] and can be found in their provided zenodo repository. You can download the raw data and put it in the RAW_DATA
folder. Then, you can execute the following command to obtain the processed dataset.
$ python runtotal_dataset.py dataset_parallel.py
The file runtotal_dataset.py
starts the parent process to prepare the dataset parallelly and assigns the sub-tasks to different subprocesses. The file dataset_parallel.py
is the program executed in the subprocesses and used to prepare the dataset. The processed split dataset will be stored in the folder PROCESSED
.
Our model is built upon CodeT5, and it is defined in the run_mergegen.py
file. To fine-tune the model, execute the following command:
$ python run_mergegen.py train
The resulting model will be saved as best_model.pt
, and the training process will be stored in the OUTPUT
folder.
To test the model on the testing set, use the following command
$ python run_mergegen.py test
The resolutions and testing process will be stored in the OUTPUT
folder.
When you train or test the model for the first time, the dataset.py
file will be executed, and the split dataset will be combined to obtain the final prepared dataset.
The OUTPUT
directory holds the conflict resolutions that were produced by MergeGen.
[1] Svyatkovskiy A, Fakhoury S, Ghorbani N, et al. Program merge conflict resolution via neural transformers[C]//Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2022: 822-833.