Skip to content

dogukangundemir/Sentiment-Analysis-with-Big-Data

Repository files navigation

Reddit Sentiment Analysis Using PySpark with Big Data

This project develops a sentiment analysis system to classify SubReddit AITA posts into YTA, NTA, and Neutral categories. Using MongoDB for data storage and Apache Spark for scalable processing, the project involves building a data pipeline with PySpark to handle large datasets efficiently. Key steps include text preprocessing with tokenization, stop word removal, n-grams, and TF-IDF, and training classifiers using PySpark's MLlib to achieve high accuracy on validation and test datasets.

The system includes a sentiment prediction function to test the model with new inputs, showcasing practical applications. Additionally, common words and phrases are analyzed, and visualization tools like WordCloud and Plotly are used to enhance data interpretation. This project highlights the practical use of big data technologies and natural language processing in social media analysis, demonstrating the effectiveness of advanced analytics in understanding user sentiments.

About

A sentiment analysis system with PySpark and MongoDB to classify Reddit posts.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •