Reddit Sentiment Analysis Using PySpark with Big Data

This project develops a sentiment analysis system to classify SubReddit AITA posts into YTA, NTA, and Neutral categories. Using MongoDB for data storage and Apache Spark for scalable processing, the project involves building a data pipeline with PySpark to handle large datasets efficiently. Key steps include text preprocessing with tokenization, stop word removal, n-grams, and TF-IDF, and training classifiers using PySpark's MLlib to achieve high accuracy on validation and test datasets.

The system includes a sentiment prediction function to test the model with new inputs, showcasing practical applications. Additionally, common words and phrases are analyzed, and visualization tools like WordCloud and Plotly are used to enhance data interpretation. This project highlights the practical use of big data technologies and natural language processing in social media analysis, demonstrating the effectiveness of advanced analytics in understanding user sentiments.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
NTA_Dataset.jsonl		NTA_Dataset.jsonl
Neutral_Dataset.jsonl		Neutral_Dataset.jsonl
README.md		README.md
Sentiment_Analysis.ipynb		Sentiment_Analysis.ipynb
YTA_Dataset.jsonl		YTA_Dataset.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reddit Sentiment Analysis Using PySpark with Big Data

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

dogukangundemir/Sentiment-Analysis-with-Big-Data

Folders and files

Latest commit

History

Repository files navigation

Reddit Sentiment Analysis Using PySpark with Big Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages