Russian Twitter Troll Detection

This project aims to detect and identify Russian troll tweets from a dataset of tweets collected during the 2016 US Election period. Using machine learning techniques including SVM, KNN, and Natural Language Processing methods, we create classifiers that can distinguish between normal tweets and those created by Russian trolls.

Dataset

The dataset comes from a collection of Russian troll tweets released by Twitter as part of the House Intelligence Committee investigation into how Russia may have influenced the 2016 US Election. The dataset contains tweets from accounts believed to be connected to Russia's Internet Research Agency, a company known for operating social media troll accounts.

Dataset source: Russian Troll Tweets on Kaggle

The dataset consists of:

Russian troll user accounts information
Russian troll tweets
Normal tweets for comparison

Preprocessing

Our data preprocessing pipeline includes:

Cleaning and normalizing tweet text
Removing emojis, punctuation, URLs, and special characters
Extracting hashtags and mentions
Handling missing values

Feature Engineering

We implemented three different NLP approaches for feature extraction:

One-hot encoding of text
Word2Vec with simple averaging of word embeddings
Word2Vec with TF-IDF weighted averaging

Models

We trained and evaluated the following models:

K-Nearest Neighbors (KNN)
Support Vector Machine (SVM) with various kernels
Linear Support Vector Classification (LinearSVC)

Each model was trained using different feature extraction methods and optimized through grid search cross-validation.

Results

Our best performing model achieved over 80% accuracy in distinguishing between troll and normal tweets. We found that models using Word2Vec with TF-IDF weighting generally performed better than simple one-hot encoding approaches.

The most discriminative features in our models include political terms, emotional language patterns, and specific topics that were targeted during the election period.

Technologies Used

pandas for data manipulation
scikit-learn for machine learning models
Word2Vec from gensim for word embeddings
matplotlib and seaborn for visualization

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Tweets		Tweets
out		out
README.md		README.md
TwitterTroll.ipynb		TwitterTroll.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Russian Twitter Troll Detection

Dataset

Preprocessing

Feature Engineering

Models

Results

Technologies Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

gael-vanderlee/TwitterTrolls

Folders and files

Latest commit

History

Repository files navigation

Russian Twitter Troll Detection

Dataset

Preprocessing

Feature Engineering

Models

Results

Technologies Used

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages