This project explores the most streamed Spotify songs of 2024 using data from Kaggle. It combines data cleaning, exploratory data analysis (EDA), and classification modeling to uncover trends in music popularity, cross-platform engagement, and the characteristics of explicit content.
- Understand the popularity and distribution of artists and songs
- Analyze music release trends by year and month
- Explore cross-platform relationships between streaming metrics (Spotify, YouTube, TikTok, Shazam)
- Predict whether a track is explicit based on streaming and release patterns
- Reflect on statistical concepts (e.g. ROC, coefficient analysis) via interpretable models
- Source: Kaggle – Most Streamed Spotify Songs 2024
- Includes streams/views/likes across Spotify, YouTube, TikTok, Shazam, Pandora, and more
- Contains metadata like track name, artist, release date, and explicit flag
- Top artists and most streamed songs on Spotify and YouTube
- Month-wise release trends and artist streaming averages
- Cross-platform scatterplots (e.g. TikTok Views vs Spotify Streams)
- Distribution of explicit vs. non-explicit songs (via boxplots)
The target variable is explicit content (binary: 1 = explicit, 0 = not explicit). However, the dataset shows a class imbalance, with more non-explicit tracks. To address this and avoid biased predictions, both models were trained using class_weight='balanced'
, which adjusts for this skew by weighting minority class samples more heavily.
-
Balanced Random Forest
- Robust and interpretable with strong overall performance
- Automatically balances class weights during training
- Captures key predictors like Shazam Counts, YouTube Likes, and TikTok activity
-
Balanced Logistic Regression
- Interpretable model with strong recall for explicit tracks
- Useful for coefficient-based insights and statistical diagnostics
- Prioritizes identifying explicit content even at the cost of precision
Balanced model selection helps ensure fairer evaluation and better generalization for minority-class (explicit) detection.
Model | Accuracy | Recall (Explicit) | AUC | Notes |
---|---|---|---|---|
Balanced Random Forest | 67% | 0.39 | 0.70 | Most balanced overall |
Balanced Logistic Regression | 51% | 0.72 | 0.59 | High recall, low precision |
Both models outperform a baseline majority-class predictor and uncover meaningful patterns behind explicit labeling.
To compare the models across thresholds, ROC curves were plotted:
- Random Forest: AUC ≈ 0.70 — good separation between classes
- Logistic Regression: AUC ≈ 0.59 — less separation but better recall
- ROC helps visualize trade-offs between true positive rate (recall) and false positive rate, independent of a 0.5 threshold
- Python, Pandas, NumPy
- Seaborn, Matplotlib
- Scikit-learn (classification models, metrics, preprocessing)
- Virality matters: Tracks with more TikTok and YouTube activity tend to have more Spotify streams
- Explicit content thrives: Explicit songs are common among top-streamed tracks, defying the idea that they perform worse
- YouTube Likes and Tiktok Likes were one of the strongest predictors of explicit labeling — likely reflecting the engagement
- Balanced models helped uncover these insights without being biased toward the majority class