This project focuses on analyzing a large dataset from Reddit, comprising diverse user comments and discussions. The dataset used for analysis is approximately 31.6GB in size. The analysis includes various aspects such as identifying the most frequent subreddits, the most discussed topics per subreddit, the most discussed topics per author, the rate of replies and controversiality, and topics with the highest upvotes. The implementation utilizes the MRJob framework and incorporates natural language processing techniques, including tokenization, lemmatization, and bigram generation. The results provide insights into the patterns and trends within the Reddit community.
-
Notifications
You must be signed in to change notification settings - Fork 0
Reddit Data Analysis: Analyzing a large dataset from Reddit to uncover insights about subreddits, topics, user engagement, and comment dynamics. Utilizes Python, NLTK, and MRJob for efficient data processing.
kareemfarahat/Reddit-Data-Analysis-with-Hadoop-MRJob-Large-Dataset-31.6GB-
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Reddit Data Analysis: Analyzing a large dataset from Reddit to uncover insights about subreddits, topics, user engagement, and comment dynamics. Utilizes Python, NLTK, and MRJob for efficient data processing.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published