A Comparative Study of Fine-Tuning Scenarios for Transformer Models in Indonesian Sentiment Analysis
This repository contains the code and notebooks for a research project aimed at evaluating various fine-tuning strategies on transformer models for the task of Indonesian-language sentiment analysis.
This research conducts a comparative study to evaluate three fine-tuning scenarios—Standard Fine-Tuning, Gradual Unfreezing, and Differential Learning Rates—across three model architectures: IndoBERT-base, IndoBERTweet, and RoBERTa. The experiments were performed on two datasets from different domains: app reviews (BBM Dataset) and political comments (Pemilu Dataset), using F1-Score as the primary evaluation metric. The results show that IndoBERTweet consistently emerged as the top-performing model, while the Standard Fine-Tuning strategy with an optimized learning rate proved to be superior to the other two, more complex techniques.
- IndoBERTweet is the Best Model: This model consistently outperformed other architectures. In its best-case scenario (Standard Fine-Tuning), IndoBERTweet achieved an F1-Score of 0.9218 on the BBM Dataset and 0.7431 on the Pemilu Dataset.
- Hyperparameter Tuning is the Best Strategy: Scenario 1, with an optimized learning rate, yielded peak performance. For instance, IndoBERTweet on the BBM Dataset reached an F1-Score of 0.9218 with this method, surpassing the results from Scenario 3 (Differential LR, 0.9159) and Scenario 2 (Gradual Unfreezing, 0.9042).
- Differential LR is Superior to Gradual Unfreezing: Among the two advanced techniques, Differential Learning Rates (S3) proved to be far more robust. On the RoBERTa model with the BBM Dataset, Scenario 3 achieved an F1-Score of 0.8530, whereas Scenario 2 suffered a performance failure with a score of only 0.4890, indicating a massive difference in effectiveness.
- Data Quality Has a Major Impact: Model performance varied drastically between datasets. The highest F1-Score on the BBM Dataset (0.9218) was nearly 25% higher than the highest score on the Pemilu Dataset (0.7431). This disparity was also reflected in the validation loss, where the lowest loss on the Pemilu Dataset (~0.64) was significantly higher than on the BBM Dataset (~0.27), suggesting that domain complexity, noise, and data ambiguity are major challenges in political sentiment analysis.
kp-penelitian/
│
├── data/
│ ├── dataset_bbm.csv # BBM app review dataset
│ └── dataset_pemilu.csv # General Election comments dataset
│
├── notebooks/
│ ├── 1_Skenario_Fine_Tuning.ipynb
│ ├── 2_Skenario_Gradual_Unfreezing.ipynb
│ └── 3_Skenario_Differential_LR.ipynb
│
├── .gitignore
└── README.md
The experiment was conducted by comparing three models across three training scenarios:
- IndoBERT-base (
indobenchmark/indobert-base-p1
) - IndoBERTweet (
indobenchmark/indobertweet-base-p1
) - RoBERTa (
roberta-base
or other Indonesian variants)
- Scenario 1: Standard Fine-Tuning & LR Optimization: Training the entire model simultaneously while searching for the best learning rate via cross-validation.
- Scenario 2: Gradual Unfreezing: Training the model layer by layer, starting with the classifier head and progressively unfreezing the layers beneath it.
- Scenario 3: Differential Learning Rates: Training the entire model simultaneously but applying different learning rates to distinct layer groups.
To reproduce the results of this research, follow these steps:
- Python 3.8+
pip
andvenv
(recommended)- GPU with sufficient VRAM (at least 8GB recommended) for training the models.
git clone https://github.com/rifqimaruf/kp-penelitian.git
cd kp-penelitian
# Create a virtual environment
python -m venv venv
# Activate the environment (Windows)
.\venv\Scripts\activate
# Activate the environment (macOS/Linux)
source venv/bin/activate
Detailed results from each scenario, including comparison tables, F1-Scores, and loss curves for each model and dataset, can be found in the final research report. In summary, the best combination found was:
- Model:
IndoBERTweet
- Strategy:
Standard Fine-Tuning
- Optimal Learning Rate:
3e-05
(for both datasets)
According to the researcher, there are three reasons why Scenarios 2 and 3 failed to improve upon the baseline performance:
- The baseline was already tested with several learning rates and validated using 3-Fold Cross-Validation, setting a very high benchmark.
- The determination of which layers to freeze in the Gradual Unfreezing scenario was too rigid and should have been further adapted to each model's specific architecture.
- The combination of learning rates in the Differential Learning Rates scenario was not yet optimized.