Sub-event Detection in Twitter Streams - NLP Kaggle Competitition

CSC_51054_EP Machine and Deep Learning, Fall 2024

Team LLY Members: Ziyi LIU, Ling LIU, Yixing YANG

This repository contains the code for the Kaggle competition: Sub-event Detection in Twitter streams, a binary classification task.

Our team, "LLY", achieved 1st place on the private leaderboard and 12th place on the public leaderboard.

Project Goal

The goal of this project is to detect sub-events within Twitter streams related to specific main events. This involves processing tweets to identify and classify event-related information.

File Structure

.
├── data/                     # Contains raw and processed datasets (ignored by .gitignore)
│   └── challenge_data/
├── src/
│   ├── data.py               # Advanced data preprocessing and embedding generation
│   └── model.py              # RNN model implementation
├── challenge_data.py         # Initial script for data loading and basic preprocessing
├── cnn.ipynb                 # Jupyter notebook for CNN model experimentation
├── lstm.ipynb                # Jupyter notebook for LSTM model experimentation
├── rnn.ipynb                 # Jupyter notebook for RNN model experimentation
├── machine_learnings.ipynb   # Jupyter notebook for various traditional ML models
├── Logistic.ipynb            # Jupyter notebook for Logistic Regression model
├── voting.ipynb              # Jupyter notebook for combining models using a voting classifier
├── requirements.txt          # Python dependencies
├── DataChallengeReport.pdf   # Project report
├── INF554-Challenge-2024.pdf # Competition details
└── README.md                 # This file

Core Logic

Data Preprocessing

The primary data preprocessing is handled by src/data.py, with an initial version in challenge_data.py. Key steps include:

Text Cleaning: Lowercasing, removing URLs, user mentions (while keeping hashtags), punctuation, and numbers.
Language Handling: Translation of non-English tweets to English and emoji conversion to text.
Normalization: Unicode normalisation and contraction expansion.
Tokenization, Stopword Removal, and Lemmatization: Standard NLP techniques to prepare text for modelling.
Parallel Processing: Utilized for efficient preprocessing of large datasets.
Embedding Generation: src/data.py uses pre-trained GloVe embeddings (glove-twitter-200) to convert tweets into numerical vectors.

Models

Various models were explored and implemented:

RNN: A custom RNNBinaryClassifier is defined in src/model.py using PyTorch.
CNN, LSTM: Explored in their respective Jupyter notebooks (cnn.ipynb, lstm.ipynb).
Traditional Machine Learning Models: Experiments with models like Logistic Regression are found in machine_learnings.ipynb and Logistic.ipynb.
Voting Classifier: A voting.ipynb notebook details the combination of different models to improve performance.

Setup and Usage

Clone the repository:

git clone https://github.com/llada60/Sub-event_Detection_in_Twitter_streams.git
cd Sub-event_Detection_in_Twitter_streams

Install dependencies:
```
pip install -r requirements.txt
```
Download Data: The data/challenge_data/ directory is excluded by .gitignore. You will need to download the competition data from Kaggle and place it in this directory.
Run Preprocessing: The src/data.py script can be run to perform the full preprocessing pipeline. It saves intermediate and final processed files.
Explore Notebooks: The Jupyter notebooks (*.ipynb) contain the model training, experimentation, and evaluation logic. Open and run these using Jupyter Lab or Jupyter Notebook.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sub-event Detection in Twitter Streams - NLP Kaggle Competitition

Project Goal

File Structure

Core Logic

Data Preprocessing

Models

Setup and Usage

Key Libraries

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.idea		.idea
src		src
.gitignore		.gitignore
DataChallengeReport.pdf		DataChallengeReport.pdf
INF554-Challenge-2024.pdf		INF554-Challenge-2024.pdf
Logistic.ipynb		Logistic.ipynb
README.md		README.md
challenge_data.py		challenge_data.py
cnn.ipynb		cnn.ipynb
lstm.ipynb		lstm.ipynb
machine_learnings.ipynb		machine_learnings.ipynb
requirements.txt		requirements.txt
rnn.ipynb		rnn.ipynb
voting.ipynb		voting.ipynb

llada60/Kaggle.Sub-event_Detection_in_Twitter_streams

Folders and files

Latest commit

History

Repository files navigation

Sub-event Detection in Twitter Streams - NLP Kaggle Competitition

Project Goal

File Structure

Core Logic

Data Preprocessing

Models

Setup and Usage

Key Libraries

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages