This repository contains our implementation for SemEval 2025 Task 9: The Food Hazard Detection Challenge. The challenge focuses on explainable classification systems for food-incident report titles collected from the web. The goal is to develop automated systems that identify and extract food-related hazards with high transparency and explainability.
Our system focuses on Subtask 1 (ST1): Text classification for food hazard prediction.
We utilize three distinct datasets in our approach:
- Original Dataset: The unmodified baseline dataset
- Aug1 Dataset: Targeted augmentation with 100 additional samples for:
- 9 lowest-represented product categories
- 4 lowest-represented hazard categories
- Aug2 Dataset: Comprehensive augmentation achieving near-balanced class distributions
-
Strategic Data Augmentation:
- Aug1: Focused on addressing severe class imbalances
- Aug2: Comprehensive balancing across all classes
-
Robust Model Ensemble: The ensemble consists of 13 specialized models:
- 6 models for
hazard-category
classification - 5 models for
product-category
classification - 1 multitask model handling both classifications
All models are based on two primary architectures:
deberta-v3-large
roberta-large
Model variations are created through different token chunking strategies during preprocessing.
- 6 models for
Our system achieved:
- #1 Position on the Final Leaderboard
- Top performance during the Conception Phase These results validate our ensemble approach and preprocessing strategies.
- Python == 3.10
-
Clone the repository:
git clone https://github.com/Zhennor/Semeval-Task9-The-Food-Hazard-Detection-Challenge-2025 cd Semeval-Task9-The-Food-Hazard-Detection-Challenge-2025
-
Install necessary libraries:
pip install -r requirements.txt
-
Train model:
This approach trains both hazard-category and product-category models simultaneously, which can lead to better performance through shared learning.
python3 train_multitask.py \ --input_file /path/to/your/train_chunk.json \ --output_dir ./results \ --model_output_dir ./result \ --learning_rate 2e-5 \ --num_epochs 10 \ --train_batch_size 4 \ --eval_batch_size 2 \ --gradient_accumulation_steps 4 \ --oversample_count 50 \ --undersample_count 500 \ --seed 42
input_file
: Path to training data JSON fileoutput_dir
: Directory for saving training resultsmodel_output_dir
: Directory for saving model checkpointslearning_rate
: Learning rate for training (default: 2e-5)num_epochs
: Number of training epochs (default: 10)train_batch_size
: Batch size for training (default: 4)eval_batch_size
: Batch size for evaluation (default: 2)gradient_accumulation_steps
: Number of steps to accumulate gradients (default: 4)oversample_count
: Count for oversampling minority classes (default: 50)undersample_count
: Count for undersampling majority classes (default: 500)seed
: Random seed for reproducibility (default: 42)
Use this approach when you want to train hazard-category and product-category models separately.
python3 train_independent.py \ --data_path /path/to/data.json \ --model_path microsoft/deberta-v3-large \ --task [hazard/product] \ --max_length 512 \ --output_dir output_classification \ --batch_size 1 \ --learning_rate 1e-5 \ --num_epochs 15
data_path
: Path to training data JSON filemodel_path
: Path to pretrained model (default: microsoft/deberta-v3-large)task
: Options: 'hazard' or 'product'. Specifies whether to train hazard-category or product-category classifiermax_length
: Maximum sequence length (default: 1280)output_dir
: Directory for saving outputsbatch_size
: Batch size for training and evaluation (default: 1)learning_rate
: Learning rate for training (default: 1e-5)num_epochs
: Number of training epochs (default: 15)
-
Predict:
Use this approach when you have a single model trained for both tasks:
python3 predict_multitask.py \ --model_name "microsoft/deberta-v3-large" \ --input_json "path/to/data" \ --output_dir "output" \ --batch_size 8 \ --label_mapping "data/label_mappings.json" \
model_name
: HuggingFace model name or path (default: Quintu/deberta-v3-large-multitask-food)input_json
: Path to test data JSON fileoutput_dir
: Directory to save predictionsbatch_size
: Batch size for inference (default: 8)label_mapping
: Path to label mapping file (default: data/label_mappings.json)
Use this approach when you have separate models for hazard and product classification:
python3 predict_independent.py \ --hazard_model "huggingface_hazard_model_path" \ --product_model "huggingface_product_model_path" \ --input_json "data/private_test_512.json" \ --output_csv "submission.csv" \ --output_zip "submission.zip" \ --output_hazard_json "hazard_predictions.json" \ --output_product_json "product_predictions.json"
hazard_models
: Space-separated list of hazard model pathsproduct_models
: Space-separated list of product model pathsinput_json
: Path to test data JSON fileoutput_csv
: Path to the output CSV fileoutput_zip
: Path to the output ZIP fileoutput_hazard_json
: Path to the output hazard predictions JSONoutput_product_json
: Path to the output product predictions JSON
- Quintu/deberta-multitask-v0: Combined model for both hazard and product classification