CTX-SARC-Embed is a state-of-the-art sarcasm detection system that uses transfer learning with RoBERTa-base to classify social media text into three categories: Sarcasm, Irony, and Regular. The system achieves 93.9% accuracy and 93.6% macro F1-score through advanced data preprocessing, model optimization, and comprehensive experiment tracking.
https://drive.google.com/drive/folders/1GJ0pQNgMWu3WHQq1-GHKDNse4nEpfcW1?usp=sharing
- ✅ 93.9% Test Accuracy with balanced performance across all classes
- ✅ Fixed Critical Data Issues: Eliminated 31.89% data leakage, corrected 23,276 mislabeled samples
- ✅ Parameter Efficiency: 97.7% efficiency through transfer learning (231K trainable out of 124.9M total)
- ✅ Fast Training: 10-15 minutes training time with frozen backbone approach
- ✅ Production Ready: Comprehensive MLflow tracking and deployment artifacts
- ✅ Reproducible Pipeline: Complete experiment tracking and version control
### 🧠 Technical Details
**Base Model**: RoBERTa-base (124.6M parameters)
- **Frozen Backbone**: Transfer learning approach for efficiency
- **Trainable Classifier**: 3-layer MLP (768→256→128→3)
- **Parameter Efficiency**: 97.7% (only 231K trainable parameters)
**Classifier Architecture**:
Input: RoBERTa [CLS] token (768d) ├── Dropout(0.2) + Linear(768→256) + ReLU + BatchNorm ├── Dropout(0.1) + Linear(256→128) + ReLU + BatchNorm └── Dropout(0.06) + Linear(128→3) → [Sarcasm, Irony, Regular]
**Regularization Techniques**:
- Multi-layer dropout (0.2, 0.1, 0.06)
- Batch normalization
- Weight decay (0.01)
- Early stopping (patience: 4)
## 📊 Data Pipeline

### 🔧 Data Processing Workflow
1. **Raw Data Analysis** (81,408 train + 8,128 test)
2. **Critical Issue Detection & Fixing**:
- **Data Leakage**: 31.89% → 0% ✅
- **Label Mislabeling**: 23,276 corrections ✅
- **Duplicates**: 16.47% removed ✅
- **Encoding Issues**: 3,609 HTML entities fixed ✅
3. **Advanced Preprocessing**:
- RoBERTa tokenization (max length: 128)
- Context simulation for social media text
- Hashtag preservation for semantic meaning
- Special token handling ([CLS], [SEP])
4. **Final Clean Dataset** (64,657 train + 7,185 test)
### 📈 Data Quality Metrics
| Metric | Original | After Cleaning | Improvement |
| ---------------------- | -------- | -------------- | ----------- |
| **Data Leakage** | 31.89% | 0% | ✅ 100% |
| **Mislabeled Samples** | 23,276 | 0 | ✅ Fixed |
| **Duplicates** | 16.47% | 0% | ✅ Removed |
| **Encoding Issues** | 3,609 | 0 | ✅ Fixed |
| **Clean Rate** | ~60% | 99.9% | ✅ +39.9% |
## 📊 Training & Performance

### 🎯 Training Configuration
```yaml
Model: RoBERTa-base (frozen) + 3-layer MLP
Optimizer: AdamW
Learning Rate: 3e-5
Batch Size: 64
Epochs: 10
Scheduler: ReduceLROnPlateau
Weight Decay: 0.01
Early Stopping: Patience 4
Device: CUDA (if available)
Metric | Score | Percentage |
---|---|---|
Test Accuracy | 0.9389 | 93.9% |
Macro F1-Score | 0.9364 | 93.6% |
Weighted F1-Score | 0.9388 | 93.9% |
Per-Class Performance:
- Sarcasm: 93.1% F1-score
- Irony: 93.7% F1-score
- Regular: 94.0% F1-score
Training Set Distribution:
- Sarcasm: 33.3% (21,552 samples)
- Irony: 33.4% (21,584 samples)
- Regular: 33.3% (21,521 samples)
Text Statistics:
- Average Length: 67.8 characters
- Token Range: 12-142 tokens
- Language: English (social media text)
- Context: Twitter-like short messages
Tracked Metrics:
- Training/Validation Loss & F1-Score
- Learning rate per epoch
- Final test performance metrics
- Model convergence monitoring
Logged Parameters:
- Model architecture settings
- Training hyperparameters
- Data preprocessing configuration
- Regularization parameters
Artifacts:
- Best model checkpoints (
best_model_*.pt
) - Training visualizations
- Performance reports
- Configuration files
# Clone repository
git clone <repository-url>
cd CTX-SARC-Embed
# Install dependencies
pip install -r requirements.txt
# Run optimized training pipeline
python optimized_training_pipeline_2025-01-01.py
# Monitor with MLflow
mlflow ui
# Generate comprehensive analysis
python comprehensive_analysis_2025-01-01.py
# Create detailed visualizations
python advanced_comprehensive_visualization_2025-01-01.py
python detailed_performance_visualization_2025-01-01.py
import torch
from transformers import AutoTokenizer
from your_model import AdvancedSarcasmClassifier
# Load model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = AdvancedSarcasmClassifier().to(device)
model.load_state_dict(torch.load('best_model_20250601_173155.pt', map_location=device))
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
# Predict
text = "Oh great, another Monday morning! 🙄"
inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=128)
with torch.no_grad():
logits, _ = model(inputs['input_ids'].to(device), inputs['attention_mask'].to(device))
prediction = torch.argmax(logits, dim=1)
# Output: 0=Sarcasm, 1=Irony, 2=Regular