This project explores how language and behavior on Twitter can reveal potential indicators of mental health struggles. We use a data-driven, responsible AI approach to detect linguistic and behavioral patterns associated with depression, without making medical claims or diagnoses.
⚠️ Note: This tool does not diagnose depression. It simply highlights digital markers statistically correlated with depressive behavior in social media data.
We focus on two major components of the MDDL dataset:
- Contains user metadata for 6,022 labeled Twitter accounts (both depressed and non-depressed).
- Includes:
followers_count
,friends_count
,verified
,location
,statuses_count
, etc.
- Up to 3,000 tweets per user, with timestamps and interaction metrics (likes, retweets, quotes, replies).
- Enables longitudinal analysis of user behavior rather than snapshot-based judgment.
- After preprocessing (e.g., keeping only English tweets for compatibility with BERTweet), the dataset includes 1,348,915 tweets.
Notebook: (Depression_Prediction_From_Tweeter_Data.ipynb)
- Used users profile structured metadata (e.g., number of friends, followers, verification status, etc.).
- Tested standard classifiers (Logistic Regression, Random Forest, SVM, Gradient Boosting).
- Baseline Accuracy: 69% (profile-only features)
- Conclusion: Metadata alone provides limited insight.
Aggregated information over time to capture behavioral trends:
- Tweet frequency
- Retweet & favorite patterns
- Quote & reply behavior
- Day/Night posting distribution
- Interactions with other depressed users
- Achieved 95.18% accuracy with GBM
Notebook: (Depression_Prediction_From_Language_Data.ipynb)
While user timelines offer rich information, full access is often limited by privacy concerns. In contrast, tweets are more readily accessible—either via public scraping or direct user input.
- Finetuned (vinai/bertweet-base) on individual user tweets.
- Validation Accuracy: 83.95%
- Test Accuracy: 83.42%
Enables a lighter version of depression detection using only text data (no profile scraping), making it more deployable and privacy-conscious.
- Extracted leaf embeddings from the Gradient Boosting model (structured data).
- Aggregated tweets into samples (max 128 tokens), and extracted sentence embeddings from finetuned BERTweet model.
- Concatenated both and passed to several MLP classifiers.
- Despite high feature richness, results were underwhelming due to:
- High-dimensional feature space
- Limited sample size (6022)
- 🔍 Input: Paste one or more tweets
- 📢 Output: Depression prediction + confidence score
- 🖥️ Frameworks: Built using Python, hosted on Streamlit
- 🧠 Input a Twitter handle (with consent)
- 🛠️ Backend scrapes timeline, computes structured + behavioral features
- 💡 Generates full explained report: timeline trends, engagement stats, depression patterns trends with explanations
Model Type | Description | Accuracy |
---|---|---|
Profile-only Classifier | Structured metadata | 69% |
Behavioral Features | Aggregated timeline activity | 95.18% |
Tweet-based (BERTweet) | Raw tweet text only | 83.42% |
Multimodal Fusion | Structured + Text embeddings | ❌ Not viable (yet) |
We take a careful, non-diagnostic approach to mental health modeling. This tool should not be used for clinical purposes but can serve as:
- A research experiment in mental health signal detection
- A prototype to showcase responsible, explainable AI
- A starting point for digital well-being assessments If this tool encourages even one person to reflect on their digital behavior, we've made progress.
- MDDL GitHub: MDDL
Interested in deploying, improving, or applying this tool in real life? Let’s collaborate on impactful, ethical AI. Reach out via LinkedIn.