This academic project explores the online interactions and shared experiences of users in the Reddit community focused on Graves’ disease — an autoimmune thyroid condition. Using Social Network Analysis (SNA) and Natural Language Processing (NLP) techniques, the study uncovers communication patterns, influential users, key discussion topics, and sentiment trends within the community.
🗂 Full project report: 📄 Text As Data.pdf
Category | Tools / Libraries |
---|---|
Data Collection | Reddit API, PRAW |
SNA & Graphs | NetworkX, iGraph, Gephi |
NLP & Text Mining | spaCy, NLTK, Scikit-learn, Gensim |
Visualization | Matplotlib, WordCloud, custom butterfly-shaped plots |
Topic Modeling | TF-IDF, LDA, Coherence Score, Perplexity |
Sentiment Analysis | VADER |
- Detect influential users and communication flows within the Reddit community.
- Identify central users and community structures using SNA metrics.
- Analyze emotional tone and shared concerns using NLP techniques.
- Surface common keywords and latent discussion topics.
- Visualize sentiment-driven word clouds using a custom butterfly (Thyroid) shape.
- Fetched top 30 posts and all comments from
/r/gravesdisease
using the Reddit API. - Saved as structured CSV files for posts, comments, nodes, and edges.
- Labeled users as posters or commenters, and constructed interaction graphs.
- Built directed graphs using NetworkX & iGraph.
- Computed:
- Degree Centrality
- Betweenness Centrality
- Harmonic Closeness Centrality
- Detected cut vertices and critical bridges in communication.
- Identified strong & weak connected components.
🖼 Initial Network visualization by Gephi
- Used the Girvan-Newman algorithm to detect communities.
- Visualized results with circular layouts and modularity tracking.
- Interpreted central vs. peripheral communities, inter-community ties, and influence hubs.
🖼 Community Layout (Circular Structure)
- Emoji-to-text, lowercasing, regex cleaning, stopword removal, lemmatization.
- Used
spaCy
,NLTK
, and regex for fine-grained cleaning.
- Used VADER to assign polarity scores to each post/comment.
- Classified as:
Positive
,Negative
, orNeutral
.
- Extracted most frequent terms using
CountVectorizer
.
- Performed LDA with TF-IDF to uncover dominant themes.
- Selected optimal number of topics using Coherence Score and Perplexity.
- Designed a custom butterfly-shaped mask to generate three word clouds:
- 🟢 Right wing → Positive sentiment
- 🔴 Left wing → Negative sentiment
- 🔵 Body → Neutral sentiment
This project demonstrates the combination of social graph theory, natural language processing, and visual storytelling to explore real-world online health communities.