This project develops a sentiment analysis system to classify SubReddit AITA posts into YTA, NTA, and Neutral categories. Using MongoDB for data storage and Apache Spark for scalable processing, the project involves building a data pipeline with PySpark to handle large datasets efficiently. Key steps include text preprocessing with tokenization, stop word removal, n-grams, and TF-IDF, and training classifiers using PySpark's MLlib to achieve high accuracy on validation and test datasets.
The system includes a sentiment prediction function to test the model with new inputs, showcasing practical applications. Additionally, common words and phrases are analyzed, and visualization tools like WordCloud and Plotly are used to enhance data interpretation. This project highlights the practical use of big data technologies and natural language processing in social media analysis, demonstrating the effectiveness of advanced analytics in understanding user sentiments.