NLBSE-25: Code Comment Classification with Data Augmentation and Transformer-Based Models

Welcome to the repository for "Code Comment Classification with Data Augmentation and Transformer-Based Models", submitted as a solution for the NLBSE'25 Code Comment Classification Tool Competition. This repository hosts all relevant scripts, datasets, and model configurations used in the project.

Find the full paper in here: Code comment classification

Abstract

Code comment classification is vital for software comprehension and maintenance. This repository demonstrates a multi-step solution that achieves a 6.7% accuracy improvement over baseline models by combining synthetic dataset generation and fine-tuned transformer-based models.

Key Points:

Translation-retranslation for linguistic diversity in data augmentation.
Transformer architectures (BERT, RoBERTa, CodeBERT, DistilBERT) for multi-label classification.
Tailored frameworks for Java, Python, and Pharo databases.

Repository Structure

.
├── Dataset Generation/     # Scripts for data augmentation (translation-retranslation pipelines).
├── Datasets/               # Original, augmented, and filtered datasets.
├── HyperParameter tuning Python/ # Optuna-based hyperparameter optimization scripts run with python dataset to select the best model.
├── Model-Saving/           # Fine-tuned transformer models saved to huggingface.
├── roBERTa-large-hyperparameter-java-pharo/ # Scripts for RoBERTa tuning on specific languages.
└── prediction.ipynb       # Results, evaluation metrics.

Note

All notebooks and scripts in this repository were executed in the Kaggle environment using Kaggle T4 * 2 GPUs . If you plan to run the notebooks in Kaggle, the platform provides an ideal environment with pre-installed dependencies and powerful GPUs, requiring minimal setup. prediction.ipynb was run in google colab. You can also run all the training and fine-tuning notebooks in google colab.

Example: Uploading a `.ipynb` File to Kaggle

Download the repository:

git clone https://github.com/Mushfiqur6087/NLBSE-25.git
cd NLBSE-25.git

Go to Kaggle:
- Log in to your Kaggle account and navigate to the Code section.
Create a New Notebook:
- Click on the New Notebook button in the top-right corner.
Upload Your Notebook:
- In the new notebook interface, click the File dropdown menu in the top-left corner.
- Select Upload Notebook from the dropdown.
- Choose the .ipynb file from your local computer and upload it.

How to Use (in Kaggle):

In the Datasets folder, there are three datasets. Go to Kaggle and use the Create Dataset option to upload these datasets as a Kaggle dataset.
Once uploaded, open any of the notebooks you want to run in Kaggle (e.g., final-score.ipynb).
In the Kaggle environment:
- Navigate to the Add Dataset option on the right panel of the notebook interface.
- Search for and add the dataset you uploaded in Step 1.
Ensure that the notebook is configured to use a Kaggle runtime with T4 * 2 GPUs:
- Go to Settings in the notebook interface.
- Enable Accelerator and select GPU (T4).
Running the Notebook:
- Execute the notebook cells sequentially to run the experiments.
Setting Up Secrets for W&B and Hugging Face:
- Generate API keys from:
  - Weights & Biases (W&B): Go to your W&B account settings > API Keys.
  - Hugging Face: Go to your Hugging Face account settings > Access Tokens.
- In your Kaggle notebook:
  - Navigate to the Add-ons menu > Secrets.
  - Add the secrets for W&B and Hugging Face using the names specified in the notebook (e.g., WANDB_API_KEY and HUGGINGFACE_API_KEY).
Adjust File Paths:
- Update the dataset file paths in the notebook. For example:
```
pd.read_csv('/kaggle/input/your-dataset-name/filename.csv')
```
- Replace /kaggle/input/your-dataset-name/filename.csv with the actual path of your uploaded dataset in Kaggle.

Note: The Kaggle environment comes pre-installed with most dependencies. However, if you need additional packages, install them using the !pip install command in a notebook cell.

How to Use (locally):

Download the repository:

git clone https://github.com/Mushfiqur6087/NLBSE-25.git
cd NLBSE-25.git

Install dependencies (if running locally):
```
pip install -r requirements.txt
```
Adjust Input Paths
- Update the file paths in the notebooks or scripts to match your local directory structure. For example, replace:
```
pd.read_csv('/kaggle/input/your-dataset-name/filename.csv')
```
with
```
pd.read_csv('path-to-your-local-dataset/filename.csv')
```
Remove wandb login.
- Comment out
```
# import wandb
# wandb.init()
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLBSE-25: Code Comment Classification with Data Augmentation and Transformer-Based Models

Find the full paper in here: Code comment classification

Abstract

Repository Structure

Note

Example: Uploading a `.ipynb` File to Kaggle

How to Use (in Kaggle):

How to Use (locally):

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Dataset Generation		Dataset Generation
Datasets		Datasets
HyperParameter tuning Python		HyperParameter tuning Python
Model-Saving		Model-Saving
roBERTa-large-hyperparameter-java-pharo		roBERTa-large-hyperparameter-java-pharo
README.md		README.md
final-score.ipynb		final-score.ipynb
prediction.ipynb		prediction.ipynb
requirements.txt		requirements.txt

Mushfiqur6087/NLBSE-25

Folders and files

Latest commit

History

Repository files navigation

NLBSE-25: Code Comment Classification with Data Augmentation and Transformer-Based Models

Find the full paper in here: Code comment classification

Abstract

Repository Structure

Note

Example: Uploading a .ipynb File to Kaggle

How to Use (in Kaggle):

How to Use (locally):

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Example: Uploading a `.ipynb` File to Kaggle

Packages