Report - Group 06

Introduction

This document represents an overview of our project. We start with its architecture and then delve into each major component and give a short description for it, along with the technologies used.

The main idea behind the project is to determine which of the following characteristics: day, time, state and length influence the sentiment of a tweet the most. We chose to use this dataset for our project, since every tweet contains a timestamp and a geotag for the time and location it was sent from, respectively. To accomplish our goal we perform various map/reduce steps on our dataset in order to group the tweets based on the aforementioned characteristics and compute the mean sentiment for each group.

Architecture

The architecture of the application follows the structure presented in the diagram below.

The arrows show the dataflow of our architecture. The tweets are retrieved from the database by the Spark drivers. We have two spark drivers, one responsible for the historical data and one responsible for the streaming data. The former retrieves the data using a simple SQL query, while the latter gets it via a message queue. Both drivers then send the tweets to the Spark cluster for processing.

The processing consists of reducing the initial data based on the length, hour, day and state by aggregating the sentiment score of the tweets. More details about the process can be found in the Spark driver for historical data description.

After the processing is done, the workers store the results in the initial database.

Database

Technologies: ElasticSearch, Python
Description: The database is responsible for storing the inital dataset of tweets as well as the results of the processing pipeline. The results should ideally be stored in a separate database, but this approach helps us to have a less resource-hungy development environment. However, the initial data as well as the various results are separated in their individual ElasticSearch indices for convenience, ease-of-access and portability. This also means that moving the results to a different database than the original data should be trivial to do.

For implementing the database itself we chose to use ElasticSearch because it is distributed by nature, document-oriented (so there's no schema that needs to be defined and updated throughout the development process), fast and easily scalable, plus the fact that it's something none of us worked with before, and this project looked like a good opportunity. For populating the database we used ElasticSearch's Python library, which is a wrapper for the RESTful API of ElasticSearch. This allowed us to abstract and streamline most of the querying process and the pushing data in parallel process.

Message queue

Technologies: - Kafka, Python

Description: For handling continuosly flow of data, we are using stream processing. Message queue is required to handle continuously stream of data. Kafka is a unified. high-throughput, low-latency platform for handling streaming data. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. Kafka stores key-value message that comes from various processes called producers which then divided into different partitions and topics. Message are strictly ordered by their 'offsets' within the partitions. Consumers read the message from the partitions. Kafka runs on a cluster of one or more servers (called brokers), and the partitions of all topics are distributed across the cluster nodes. Additionally, partitions are replicated to multiple brokers. In our application, we used confluent_kafka library which provides a wrapper around Kafka and can be used with Python. The producer will query the database i.e. Elasticsearch to get the data(100 data points at a time) and will send it to Kafka Producer with the topic stream. Our consumer is based on Spark which can read the data from the Kafka cluster with the same topic name.

Spark Driver - Historical Data

Technologies: - Python, Spark (PySpark)

Description: Our driver for the historical data processing performs a set of mapreduce steps, each of which we can use to analyse the relation between one variable and the sentiment of the tweet. As you may recall from the [Introduction][introduction], we have four variables we want to analyse: day, time, state and length. When we group the tweets from our dataset by these characteristics, we also add some relevant statistics about the groups we created.

One mapreduce step from our pipeline looks like this:

The part for adding relevant statistics is done by simply applying a function, to make sure that the format of the statistics remains the same for all mapreduce steps.

This mini pipeline is done for each variable whose effect we want to analyse. The dataframes with the processed results are all stored in their own index in the ElasticSearch cluster, as mentioned in the [Database][database] section.

Spark driver - streaming

Technologies: - Spark, Pyspark

Description: Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Internally, it works as follows, spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark streaming has 3 major components input source, streaming engine, output source. Spark streaming engine process incoming data from various input sources and writes to output source. Spark streaming has a micro-batch architecture where stream is treated as series of batches of data and new batches are created at regular interval. In our application we are using window approach where windowed computations allow you to apply transformations over a sliding window of data. We specify two operations:

window length The duration of window in secs.

sliding interval The interval at which window operations are performed in secs.

We map the latitude and longitude to location(particular state in USA) and tweet to sentiment score. We then perform aggregation operation to find the mean sentiment, max sentiment, absolute sentiment in the interval. The results can then be stored in Database for visualization.

Spark cluster

Technologies: Spark (PySpark)
Description: Spark is a distributed data processing engine. It lets us process big data sets faster by splitting the work up into chunks and assigning those chunks across a number of workers. The cluster used for our development process is formed by one Spark Master (reffered to as the Leader in our project) and two Spark Workers, but the amount of workers can easily be scaled up by simply changing the amount of replicas of the worker pod. Its integrated map/reduce implementation stands at the core of our processing pipeline. We chose to use Spark over the alternatives because it is faster, well-documented and well-integrated in Python via the PySpark library.

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
.idea		.idea
Architecture		Architecture
elasticsearch		elasticsearch
mapreduce		mapreduce
streaming		streaming
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Report - Group 06

Introduction