This repository contains my solution for the GO DATA SCIENCE 4.0 Hackathon hosted on Zindi, where I achieved 42rd place out of 194 participants. The challenge focused on classifying mental health-related text discussions into predefined categories using Natural Language Processing (NLP) techniques.
- Rank: 42 out of 194 participants
- Validation Accuracy: 74.4%
- Public Leaderboard Score: 0.7528
- Private Leaderboard Score: 0.7371
Rank | Team Name | Public Score | Private Score |
---|---|---|---|
1 | Recursive Duo | 0.8189 | 0.7996 |
... | ... | ... | ... |
9 | one crew | 0.7786 | 0.7792 |
25 | Llama | 0.7610 | 0.7648 |
43 | SamehAissa (Me) | 0.7686 | 0.7528 |
44 | ... | ... | ... |
-
Top Scores:
- The winning team achieved a public score of 0.8189 and a private score of 0.7996.
-
My Performance:
- Achieved a public score of 0.7686 and a private score of 0.7528.
- Ranked 42rd, placing in the top 22% of participants.
-
Leaderboard Insights:
- A small gap between public and private scores indicates robust models.
- The competition was highly competitive, with close scores among top teams.
The goal was to develop a model that accurately classifies text entries (titles and content) from online discussions into categories representing mental health issues. Each entry in the dataset included:
id
: Unique identifiertitle
: Discussion titlecontent
: Main body of the texttarget
: Mental health category (only in training data)
id | title | content | target |
---|---|---|---|
101 | Feeling Hopeless and Lost | I've been struggling with depression for a while... | Depression |
102 | Panic Attacks Are Getting Worse | Lately, my panic attacks have been more frequent... | Anxiety |
The model's performance was evaluated using Private Accuracy as the primary metric.
-
Data Preprocessing:
- Combined
title
andcontent
into a single text feature. - Handled missing values and cleaned text data.
- Encoded target labels into numerical format.
- Combined
-
Modeling:
- Experimented with BERT and RoBERTa architectures.
- Implemented class weighting to handle imbalanced data.
- Used Text Augmentation (EDA) to improve generalization.
-
Training:
- Fine-tuned transformer models using Hugging Face's
Trainer
API. - Applied Focal Loss to focus on hard-to-classify examples.
- Used Test-Time Augmentation (TTA) for robust predictions.
- Fine-tuned transformer models using Hugging Face's
-
Evaluation:
- Achieved ~76.8% Public Accuracy on the validation set.
- Secured 42rd place on the final leaderboard.
-
Deployement:
- Build a Streamlit/Gradio app for real-time predictions.
- Deploy the model using FastAPI or Flask.
-
Explainability:
- Use SHAP or LIME to explain model predictions.
- Visualize attention weights for transformer models.
-
Advanced Models:
- Experiment with DeBERTa, GPT-based models, or ensemble methods.
- Use knowledge distillation to combine multiple models.