This machine learning project goal is to topic modelling on Twitter data using LDA and BERTopic. This includes data cleaning, EDA, model building, evaluation, and visualizations to discover latent themes in political tweets.
This repository contains a topic modelling project completed as part of the ICT606 Machine Learning unit at Murdoch University. The project applies Natural Language Processing (NLP) techniques to analyze political Twitter data using Latent Dirichlet Allocation (LDA) and BERTopic models.
Twitter is a rich source of real-time public opinion. In this project, we explore how topic modelling can uncover hidden themes in tweets. Two main models were implemented:
- LDA (Latent Dirichlet Allocation) from scikit-learn
- BERTopic using transformer-based embeddings
The project involves:
- Cleaning and preprocessing 5,000 sampled tweets
- Performing Exploratory Data Analysis (EDA)
- Applying and evaluating topic models
- Visualizing and interpreting the discovered topics
The dataset was sourced from Kaggle:
🔗 Twitter Sentiment Dataset
cleaned_text
: The tweet textcategory
: Sentiment label (-1 = negative, 0 = neutral, 1 = positive)
- Lowercasing
- Removing URLs, mentions, hashtags
- Removing punctuation and numbers
- Tokenization
- Stopword removal (NLTK)
- Lemmatization (WordNetLemmatizer)
- Filtering short words (<3 chars)
- Sentiment distribution plot
- Word clouds for each sentiment category
- Top 20 TF-IDF words
- Word2Vec embeddings visualized using t-SNE
- Used
CountVectorizer
with parameters:
max_df=0.95
,min_df=10
,stop_words='english'
- Extracted 10 topics
- Visualized topic-word heatmaps
- Assigned dominant topic to each tweet
- Uses BERT embeddings + UMAP + HDBSCAN
- Higher topic diversity score (0.86 vs 0.69 for LDA)
- Visualized intertopic distance map and topic bar charts
Metric | LDA | BERTopic |
---|---|---|
Coherence Score | 0.40 | 0.38 |
Topic Diversity | 0.69 | 0.86 |
- LDA had slightly higher coherence
- BERTopic showed greater topic diversity and semantic separation
- Topics revealed strong focus on Indian politics (Modi, BJP, Congress)
- BERTopic captured more distinct themes like geopolitical issues and national pride
- Word embeddings improved semantic clustering in BERTopic
- Becker et al. (2011). Beyond trending topics: Real-world event identification on Twitter.
- Singh et al. (2019). Topic modelling and classification of tweets using NLP.
- Python
- Scikit-learn
- BERTopic
- Pandas, Numpy
- NLTK, WordNet
- Matplotlib, Seaborn