YouTuber Income Prediction (R) — GDP-Contextualized Classification + Clustering

A research-style data analysis project exploring how national economic context (GDP) relates to YouTuber income levels, with classification (high vs. low income) and unsupervised clustering of channels. Built in R with modeling, visualization, and a small Shiny component for interactive exploration.

Authors: Yuanfu Cao, Yuxin Gu

Report date: 2023-10-21

Project artifact: project2.html (rendered R Markdown)

🔍 Problem & Approach

Goal. Predict whether a YouTuber belongs to a high or low earning group relative to the economic context of their country, and uncover natural channel segments with clustering.

Key idea. Rather than using a raw global income threshold, we:

Forecast each country’s 2023 GDP using 2019–2022 time series (ARIMA),
Normalize both income and GDP into z-scores, and
Label a channel as High Income if its income z-score exceeds its country’s GDP z-score (context-aware labeling).

📦 Data

YouTube Top Creators dataset: channel metrics (subscribers, views, uploads, monthly/yearly earnings, etc.).
World Bank GDP (per country, 2019–2022) → ARIMA forecast to 2023.
Additional socio-economic indicators (e.g., tertiary education enrollment, unemployment, urban population).

🧰 Methods

1) GDP Forecasting & Labeling

2019–2022 GDP per country → ARIMA → Predicted 2023 GDP.
Convert Average Yearly Income (mean of lowest & highest yearly estimates) and Predicted 2023 GDP to z-scores.
Label rule: High Income if Z_Income > Z_GDP; else Low Income.

2) Feature Selection

Two feature sets:
- Set A (AUC-driven): subscribers_for_last_30_days, video_views_for_the_last_30_days, Unemployment.rate, Urban_population, Population, Gross.tertiary.education.enrollment.
- Set B (LASSO-driven): subscribers, category, subscribers_for_last_30_days, created_year, Gross.tertiary.education.enrollment, video_views_for_the_last_30_days.

3) Models & Evaluation

Decision Tree and SVM classifiers trained on both feature sets.
Metrics: AUC, Accuracy, Precision, Recall, F1 on Train / Calibration / Test splits.
Reported highlight (Decision Tree): Test AUC ≈ 0.97, with strong accuracy/precision/recall/F1 across splits.

4) Clustering & Embedding

k-means (k=4) on channel + socio-economic features, profiled with cluster means.
Visualized structure via PCA and t-SNE embeddings.

📈 Results (Highlights)

Classification. Context-aware labels (income vs. GDP z-scores) are predictable from channel & socio-economic signals. A Decision Tree model reached Test AUC ≈ 0.97 and high Accuracy/Precision/Recall/F1, indicating good generalization.
Clustering. Four segments emerged: a powerhouse cluster (very high subs/views/earnings), a low-performing cluster, and two mid-tier groups with distinctive socio-economic patterns. Cluster sizes were [272, 45, 152, 358] (k=4).

📁 Repo Structure (suggested)

/ (repo root)
├─ project2.html          # Rendered R Markdown report (open in browser)
├─ data/                  # Raw & cleaned data (YouTube + GDP), if shareable
├─ R/                     # R scripts (wrangling, modeling, evaluation, plotting)
├─ shiny/                 # (Optional) Shiny app files
└─ README.md              # This file

Note for Shiny: the app first sources the analysis code (to load data into memory) before UI runs. Keep scripts importable from global.R or server.R.

▶️ Reproducibility (suggested steps)

Clone the repo and open in RStudio.
Install packages listed in the Rmd: tidyverse, forecast, glmnet, e1071, rpart, ROCR, etc.
Run the Rmd to generate project2.html, or open the provided HTML directly.
(Optional) Run Shiny app from the shiny/ folder once data-loading scripts are sourced.

✨ What I did (for Recruiters)

Co-designed the problem framing (GDP-contextualized labels) and end-to-end pipeline in R.
Implemented ARIMA forecasting (GDP → 2023), z-score normalization, and labeling logic.
Built and evaluated Decision Tree & SVM classifiers; reported AUC/Accuracy/Precision/Recall/F1 on proper splits.
Performed k-means clustering, and visualized structure with PCA and t-SNE.
Authored the analysis and documentation; delivered the HTML report and (optional) Shiny demo.

📹 Demo / Report

HTML report: see project2.html in this repo.

⚖️ Notes

The YouTube dataset includes top creators, so global medians are not representative; hence our GDP-relative labeling strategy.
Remove secrets/tokens if adding any data-fetching scripts.
If sharing data is restricted, keep only aggregated/derived tables necessary to reproduce figures.

📜 Citation

If you reference this work, please cite the repo and the HTML report (project2.html).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
GDP.csv		GDP.csv
LICENSE		LICENSE
README.md		README.md
Shinny parts for project2.R		Shinny parts for project2.R
project2.Rmd		project2.Rmd
project2.html		project2.html
shiny_of_project2.Rmd		shiny_of_project2.Rmd
youtube_UTF_8.csv		youtube_UTF_8.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

YouTuber Income Prediction (R) — GDP-Contextualized Classification + Clustering

🔍 Problem & Approach

📦 Data

🧰 Methods

1) GDP Forecasting & Labeling

2) Feature Selection

3) Models & Evaluation

4) Clustering & Embedding

📈 Results (Highlights)

📁 Repo Structure (suggested)

▶️ Reproducibility (suggested steps)

✨ What I did (for Recruiters)

📹 Demo / Report

⚖️ Notes

📜 Citation

About

Uh oh!

Releases

Packages

Languages

License

SoleilGU/Youtube_earning_Analysis_2023

Folders and files

Latest commit

History

Repository files navigation

YouTuber Income Prediction (R) — GDP-Contextualized Classification + Clustering

🔍 Problem & Approach

📦 Data

🧰 Methods

1) GDP Forecasting & Labeling

2) Feature Selection

3) Models & Evaluation

4) Clustering & Embedding

📈 Results (Highlights)

📁 Repo Structure (suggested)

▶️ Reproducibility (suggested steps)

✨ What I did (for Recruiters)

📹 Demo / Report

⚖️ Notes

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages