Topic-Modeling-on-Twitter-Data

This machine learning project goal is to topic modelling on Twitter data using LDA and BERTopic. This includes data cleaning, EDA, model building, evaluation, and visualizations to discover latent themes in political tweets.

NLP Topic Modelling on Twitter Data

This repository contains a topic modelling project completed as part of the ICT606 Machine Learning unit at Murdoch University. The project applies Natural Language Processing (NLP) techniques to analyze political Twitter data using Latent Dirichlet Allocation (LDA) and BERTopic models.

📌 Project Overview

Twitter is a rich source of real-time public opinion. In this project, we explore how topic modelling can uncover hidden themes in tweets. Two main models were implemented:

LDA (Latent Dirichlet Allocation) from scikit-learn
BERTopic using transformer-based embeddings

The project involves:

Cleaning and preprocessing 5,000 sampled tweets
Performing Exploratory Data Analysis (EDA)
Applying and evaluating topic models
Visualizing and interpreting the discovered topics

Dataset

The dataset was sourced from Kaggle:
🔗 Twitter Sentiment Dataset

cleaned_text: The tweet text
category: Sentiment label (-1 = negative, 0 = neutral, 1 = positive)

Preprocessing Steps

Lowercasing
Removing URLs, mentions, hashtags
Removing punctuation and numbers
Tokenization
Stopword removal (NLTK)
Lemmatization (WordNetLemmatizer)
Filtering short words (<3 chars)

Exploratory Data Analysis (EDA)

Sentiment distribution plot
Word clouds for each sentiment category
Top 20 TF-IDF words
Word2Vec embeddings visualized using t-SNE

Topic Modelling

LDA (Latent Dirichlet Allocation)

Used CountVectorizer with parameters:
max_df=0.95, min_df=10, stop_words='english'
Extracted 10 topics
Visualized topic-word heatmaps
Assigned dominant topic to each tweet

BERTopic

Uses BERT embeddings + UMAP + HDBSCAN
Higher topic diversity score (0.86 vs 0.69 for LDA)
Visualized intertopic distance map and topic bar charts

📈 Evaluation Metrics

Metric	LDA	BERTopic
Coherence Score	0.40	0.38
Topic Diversity	0.69	0.86

LDA had slightly higher coherence
BERTopic showed greater topic diversity and semantic separation

📌 Key Insights

Topics revealed strong focus on Indian politics (Modi, BJP, Congress)
BERTopic captured more distinct themes like geopolitical issues and national pride
Word embeddings improved semantic clustering in BERTopic

📚 References

Becker et al. (2011). Beyond trending topics: Real-world event identification on Twitter.
Singh et al. (2019). Topic modelling and classification of tweets using NLP.

🛠️ Tech Stack

Python
Scikit-learn
BERTopic
Pandas, Numpy
NLTK, WordNet
Matplotlib, Seaborn

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
Topic modeling on twitter data .py		Topic modeling on twitter data .py
Topic modeling on twitter data [Notebook] .ipynb		Topic modeling on twitter data [Notebook] .ipynb
Twitter_Data.csv		Twitter_Data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Topic-Modeling-on-Twitter-Data

NLP Topic Modelling on Twitter Data

📌 Project Overview

Dataset

Preprocessing Steps

Exploratory Data Analysis (EDA)

Topic Modelling

LDA (Latent Dirichlet Allocation)

BERTopic

📈 Evaluation Metrics

📌 Key Insights

📚 References

🛠️ Tech Stack

About

Uh oh!

Releases

Packages

Languages

sabbirdewan/Topic-Modeling-on-Twitter-Data

Folders and files

Latest commit

History

Repository files navigation

Topic-Modeling-on-Twitter-Data

NLP Topic Modelling on Twitter Data

📌 Project Overview

Dataset

Preprocessing Steps

Exploratory Data Analysis (EDA)

Topic Modelling

LDA (Latent Dirichlet Allocation)

BERTopic

📈 Evaluation Metrics

📌 Key Insights

📚 References

🛠️ Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages