Skip to content

This repo explores token classification for abbreviation and long-form detection using RoBERTa. We evaluate the impact of adding 50% of the PLODv2-filtered dataset, achieving improved F1 and recall. The repo includes methodology, evaluation using seqeval, and confusion matrix analysis.

Notifications You must be signed in to change notification settings

AadityaArunSingh/RoBERTa-Token-Classification-with-Additional-PLODv2-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

🧠 RoBERTa Token Classification with Additional PLODv2 Data

This repository contains two Python scripts that fine-tune a roberta-base model for Named Entity Recognition (NER) on the PLOD-CW-25 dataset, with 25% and 50% additional samples from the PLODv2-filtered dataset. The goal is to evaluate performance gains in token-level classification using partial data augmentation.


📁 Files

Filename Description
RoBERTa+25%data.py Fine-tunes roberta-base on PLOD-CW-25 with 25% of PLODv2-filtered samples added to training and validation sets.
RoBERTa+50%data.py Fine-tunes roberta-base on PLOD-CW-25 with 50% of PLODv2-filtered samples added to training and validation sets.

🔍 Dataset Overview

  • PLOD-CW-25: Legal NER dataset with annotated tokens from case law.
  • PLODv2-filtered: A filtered version of PLODv2 for optional fine-tuning enhancement.

Each script:

  • Loads both datasets.
  • Randomly samples a fraction of PLODv2-filtered.
  • Merges it with the original training and validation splits.
  • Converts the data into Hugging Face Datasets format.

🔧 Model Architecture

  • Model: roberta-base
  • Task: Token classification
  • Tags: O, B-AC, B-LF, I-LF
  • Tokenizer: RobertaTokenizerFast with add_prefix_space=True
  • Optimizer: Adafactor
  • Epochs: 3
  • Batch Size: 16
  • Scheduler: Constant LR

📊 Metrics & Evaluation

  • Evaluation Metric: seqeval
  • Reports:
    • Overall: Precision, Recall, F1, Accuracy
    • Entity-wise: Per-class F1
    • Visuals: Confusion Matrix & Bar Plot for metrics

📈 Visual Outputs

Each script generates:

  • 📋 Classification report
  • 📊 Bar chart for precision/recall/F1 by entity
  • 🔲 Confusion matrix for true vs. predicted labels

📊 Side-by-Side Result Comparison

Metric 25% Additional Data 50% Additional Data
Precision ~0.88 ~0.90
Recall ~0.89 ~0.91
F1 Score ~0.88–0.89 ~0.90+
Accuracy ~0.89 ~0.90

🛠️ Installation

pip install datasets transformers huggingface_hub evaluate seqeval nbconvert

📌 Notes

  • Scripts are optimized for execution in Google Colab.
  • Designed for experimentation with data augmentation in NER tasks.
  • Results can help in assessing trade-offs between more data vs. training time.

📤 Output Example

{
  "precision": 0.89,
  "recall": 0.91,
  "f1": 0.90,
  "accuracy": 0.90
}

📁 Check the detailed output in the confusion matrix and classification report printed at the end of each script.


✍️ Author

Aaditya Singh – MSc Data Science
For academic use and performance benchmarking of NLP models in the legal domain.


About

This repo explores token classification for abbreviation and long-form detection using RoBERTa. We evaluate the impact of adding 50% of the PLODv2-filtered dataset, achieving improved F1 and recall. The repo includes methodology, evaluation using seqeval, and confusion matrix analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages