Skip to content

Reddit Data Analysis: Analyzing a large dataset from Reddit to uncover insights about subreddits, topics, user engagement, and comment dynamics. Utilizes Python, NLTK, and MRJob for efficient data processing.

Notifications You must be signed in to change notification settings

kareemfarahat/Reddit-Data-Analysis-with-Hadoop-MRJob-Large-Dataset-31.6GB-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Reddit-Data-Analysis-with-Hadoop-MRJob-Large-Dataset-31.6GB-

This project focuses on analyzing a large dataset from Reddit, comprising diverse user comments and discussions. The dataset used for analysis is approximately 31.6GB in size. The analysis includes various aspects such as identifying the most frequent subreddits, the most discussed topics per subreddit, the most discussed topics per author, the rate of replies and controversiality, and topics with the highest upvotes. The implementation utilizes the MRJob framework and incorporates natural language processing techniques, including tokenization, lemmatization, and bigram generation. The results provide insights into the patterns and trends within the Reddit community.

About

Reddit Data Analysis: Analyzing a large dataset from Reddit to uncover insights about subreddits, topics, user engagement, and comment dynamics. Utilizes Python, NLTK, and MRJob for efficient data processing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages