This repository contains two Python scripts that fine-tune a roberta-base
model for Named Entity Recognition (NER) on the PLOD-CW-25 dataset, with 25% and 50% additional samples from the PLODv2-filtered
dataset. The goal is to evaluate performance gains in token-level classification using partial data augmentation.
Filename | Description |
---|---|
RoBERTa+25%data.py |
Fine-tunes roberta-base on PLOD-CW-25 with 25% of PLODv2-filtered samples added to training and validation sets. |
RoBERTa+50%data.py |
Fine-tunes roberta-base on PLOD-CW-25 with 50% of PLODv2-filtered samples added to training and validation sets. |
- PLOD-CW-25: Legal NER dataset with annotated tokens from case law.
- PLODv2-filtered: A filtered version of PLODv2 for optional fine-tuning enhancement.
Each script:
- Loads both datasets.
- Randomly samples a fraction of
PLODv2-filtered
. - Merges it with the original training and validation splits.
- Converts the data into Hugging Face Datasets format.
- Model:
roberta-base
- Task: Token classification
- Tags:
O
,B-AC
,B-LF
,I-LF
- Tokenizer:
RobertaTokenizerFast
withadd_prefix_space=True
- Optimizer:
Adafactor
- Epochs: 3
- Batch Size: 16
- Scheduler: Constant LR
- Evaluation Metric:
seqeval
- Reports:
- Overall: Precision, Recall, F1, Accuracy
- Entity-wise: Per-class F1
- Visuals: Confusion Matrix & Bar Plot for metrics
Each script generates:
- 📋 Classification report
- 📊 Bar chart for precision/recall/F1 by entity
- 🔲 Confusion matrix for true vs. predicted labels
Metric | 25% Additional Data | 50% Additional Data |
---|---|---|
Precision | ~0.88 | ~0.90 |
Recall | ~0.89 | ~0.91 |
F1 Score | ~0.88–0.89 | ~0.90+ |
Accuracy | ~0.89 | ~0.90 |
pip install datasets transformers huggingface_hub evaluate seqeval nbconvert
- Scripts are optimized for execution in Google Colab.
- Designed for experimentation with data augmentation in NER tasks.
- Results can help in assessing trade-offs between more data vs. training time.
{
"precision": 0.89,
"recall": 0.91,
"f1": 0.90,
"accuracy": 0.90
}
📁 Check the detailed output in the confusion matrix and classification report printed at the end of each script.
Aaditya Singh – MSc Data Science
For academic use and performance benchmarking of NLP models in the legal domain.