GitHub - ZahrielIsmail/ParallelProcessingAWS: HTCondor Cluster built in AWS to do Sentiment Analysis on X (Formerly known as Twitter) data

HTCondor Cluster built in AWS to do Sentiment Analysis on X (Formerly known as Twitter) data

1) Introduction

Analyzing Twitter data, with its real-time and dynamic nature, presents a unique set of challenges and opportunities. To efficiently process and extract sentiments from large volumes of tweets, parallel computing becomes essential. This is where High-Throughput Computing (HTC) clusters, such as HTCondor, come into play. HTCondor provides a robust framework for managing and executing parallel and distributed computing tasks across a cluster of machines. Leveraging HTCondor for sentiment analysis on Twitter data not only enhances the speed and efficiency of the analysis but also enables the handling of the massive scale of information generated on social media platforms. This guide explores the utilization of HTCondor clusters for parallel processing of sentiment analysis on Twitter data. By distributing the computational workload across multiple nodes in the cluster, we can significantly reduce the processing time, allowing for near real-time analysis of sentiments expressed in tweets. We will delve into the setup, configuration, and deployment of HTCondor, outlining steps to harness the power of parallel computing for sentiment analysis tasks.

2) Objective

Prepare an HTCondor Infastructure on the AWS instances (Including RDS Server and NFS protocol)
Ensure that HTCondor Submission Host is able to distribute the jobs between the Execution Hosts to allow for parallel processing
Create a scheduler for HTCondor Submission Host to consistently extract data at fixed intervals from the RDS Database
Test the cluster by preparing and utilizing three different Sentiment Analysis Models and verifying the end results via exporting the performance metrics into a file at the end of the process

3) Methodology

Phase 1 Data Scrapping:

Scrapping the data from twitter Using tweepy, removal of unnecessary or redundant data as well as any preprocessing steps
Load the Data into the RDS Database which also contains the Python files necessary to conduct the modelling

Phase 2 Data Processing:

Submission Host extracts the Tweets.csv files from the RDS Database and submits the job via "Condor_submit" to distribute the data among three execution hosts as well as directing which model should be used for the processing
Execution Host receives the files and conducts the modelling process, after modelling is complete, certain matrices are noted and exported to the RDS Database
Matrices can be exported from the RDS Database for analysis outside of the AWS Ecosystem

4) Server Setup

This project focuses on only Phase 2 of the methodology listed in the previous mentioned table. The components that require being setup on the AWS environment are 6 Instances of sizes stated Below.

HTCondorManager
SubmissionHost // Requires setup of an NFS Kernel in this instance
ExeuctionHost //Requires At least two instances, used within this proejct is 3 Instances, all three instances require NFS Common and mounting folders from SubmissionHost instance
RDS Server

This section focuses on the infrastructure setup for the HTCondor cluster, the following table is the number of instances required for the function of this study: Name Role Size Required(?) Central_Manager HTCondor Central Manger Micro Yes Submission_Host HTCondor Submission Host Nano Yes Execution_Host_1 HTCondor Execution Host Small Yes Execution_Host_2 HTCondor Execution Host Small No Execution_Host_3 HTCondor Execution Host Small No

The following are the requirements for the systems to function. The declaration of roles within the HTCondor system must be done during the setup of the instances and will be covered in the following section.

4.1 Central Manager

The central managers role in the cluster is to manage the system resources as well as assigning jobs according to free execution hosts, the overall memory needed in this role is low, which provides a reason for the small memory size allocated to the instance. After the Central Manager node is created, it can be accessed via the EC2 terminal to initiate the setup phase. The following code is required to install HTCondor and set the role as central manager:

HTCondor

sudo apt-get update

curl -fsSL https://get.htcondor.org | sudo /bin/bash -s -- --no-dry-run --password "abc123" --central-manager

sudo systemctl restart condor

NFS Setup

sudo apt install nfs-kernel-server

mkdir condor_shared

sudo vim /etc/exports # add /home/ubuntu/condor_shared *(rw,sync,no_subtree_check)

sudo exportfs -ra

Github Clone

git clone

git config credential.helper store

git pull

Create commit.sh file git add .

git commit -m “Committed from EC2”

git push **************

4.2 Submission Host

The submission host will create the job request to be submitted to the execution hosts. It requires the central mangers private IP to form the cluster and also requires a job.sub file to submit the jobs. HTCondor

sudo apt-get update

curl -fsSL https://get.htcondor.org | sudo /bin/bash -s -- --no-dry-run --password "abc123" --submit

sudo systemctl restart condor

NFS Setup sudo apt install nfs-common

mkdir mounter

sudo mount 172.31.54.189:/home/ubuntu/condor_shared /home/ubuntu/mounter

Create job.sub file vim job.sub

executable = $(filename)

output = output_$(Process).txt

error = error_$(Process).txt

log = log.txt requirements = True should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = /home/ubuntu/mounter/local_dataset.xlsx

filename = script1.py

queue

filename = script2.py

queue

filename = script3.py

queue

4.3 Execution Host

The execution host are the workers of the cluster and will do the bulk of the processes. It requires the Central Managers IP to connect to the cluster. HTCondor

sudo apt-get update

curl -fsSL https://get.htcondor.org | sudo /bin/bash -s -- --no-dry-run --password "abc123" --execute

sudo systemctl restart condor

NFS Setup sudo apt install nfs-common

mkdir mounter

sudo mount 172.31.54.189:/home/ubuntu/condor_shared /home/ubuntu/mounter

Python Setup

Due to the scripts being built for Python3, some python packages are required to be installed on the execution hosts before being able to run. sudo apt install python3-pip

sudo pip3 install nltk sudo pip3 install nltk sudo pip3 install seaborn sudo pip3 install wordcloud sudo pip3 install sklearn sudo pip3 install scikit-learn sudo pip3 install xgboost

sudo mount 172.31.54.189:/home/ubuntu/condor_shared /home/ubuntu/mounter

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
ParallelProcessingAWS		ParallelProcessingAWS
Logistic Regression Confusion Matrix.png		Logistic Regression Confusion Matrix.png
Naive Bayes Confusion Matrix.png		Naive Bayes Confusion Matrix.png
Negative Tweets Word Cloud.png		Negative Tweets Word Cloud.png
Positive Tweets Word Cloud.png		Positive Tweets Word Cloud.png
README.md		README.md
Sentiment 2.xlsx		Sentiment 2.xlsx
Sentiment 3.xlsx		Sentiment 3.xlsx
System Architecture.jpg		System Architecture.jpg
XGBoost Confusion Matrix.png		XGBoost Confusion Matrix.png
comitter.sh		comitter.sh
error_0.txt		error_0.txt
job.sub		job.sub
local_dataset.xlsx		local_dataset.xlsx
log.txt		log.txt
lr_unicorn_sentimentanalysis.py		lr_unicorn_sentimentanalysis.py
nb_unicorn_sentimentanalysis.py		nb_unicorn_sentimentanalysis.py
output_0.txt		output_0.txt
script1.py		script1.py
sentiment_shorten.xlsx		sentiment_shorten.xlsx
sentiment_shorten_balanced.xlsx		sentiment_shorten_balanced.xlsx
unicorn_sentimentanalysis.py		unicorn_sentimentanalysis.py
xgb_unicorn_sentimentanalysis.py		xgb_unicorn_sentimentanalysis.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

1) Introduction

2) Objective

3) Methodology

4) Server Setup

4.1 Central Manager

4.2 Submission Host

4.3 Execution Host

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

ZahrielIsmail/ParallelProcessingAWS

Folders and files

Latest commit

History

Repository files navigation

1) Introduction

2) Objective

3) Methodology

4) Server Setup

4.1 Central Manager

4.2 Submission Host

4.3 Execution Host

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages