This project tackles the problem of subjectivity detection in natural language 🌐—a fundamental task for applications like fake news detection ❌📰 and fact-checking ✅. The goal is to classify sentences as subjective (SUBJ) or objective (OBJ) across various languages: Arabic, German, English, Italian, and Bulgarian.
We employ two primary approaches for subjectivity detection:
- mDeBERTaV3-base 📖
- ModernBERT-base 🔍
- Fine-tuned on language-specific datasets with integrated sentiment information 💬 for enhanced performance.
- Llama3.2-1B 🦙
- Evaluated on its ability to capture subjectivity from general knowledge representations.
- BERT-like models exhibit superior performance in capturing nuanced information compared to LLMs.
- Incorporating sentiment information improves the subjective F1 score significantly for English and Italian; less so for other languages.
- Decision threshold calibration is essential for improving performance when handling imbalanced label distributions.
- Data Preparation: 📂 Data augmentation using sentiment scores, tokenization, and preprocessing.
- Model Training: 🔧 Fine-tuning mDeBERTaV3, ModernBERT, and Llama3.2-1B.
- Evaluation: 📈 Evaluation metrics include macro-average F1 score and SUBJ F1 score with focus on threshold optimization.
The architecture of the proposed system is illustrated below:
- Python 3.x 🐍
- PyTorch 🔥
- Hugging Face Transformers 🤗
- Dependencies specified in
requirements.txt
📋
- Clone the repository:
git clone https://github.com/MatteoFasulo/clef2025-checkthat.git
cd clef2025-checkthat
- Install dependencies:
pip install -r requirements.txt
To evaluate the model performance on the development set for English, use:
python scorer/evaluate.py -g data/english/dev_en.tsv -p results/dev_english_predicted.tsv
To evaluate the sentiment-enhanced model:
python scorer/evaluate.py -g data/english/dev_en.tsv -p results/dev_english_sentiment_predicted_.tsv
- GitHub Repository 📂
- Dataset 🗃️
This project highlights the effectiveness of BERT-like models for subjectivity detection and emphasizes the importance of handling linguistic variability and class imbalance. Future work will focus on enhancing LLM performance and addressing challenges identified in the error analysis.
Licensed under the MIT License - see the LICENSE file for details.