Spark Tweet Processor

This project is a demonstration of a simple Apache Spark batch data processing task using a file of tweets.

In this demo a file of tweets collected during a World Cup are provided. These tweets are to be processed to answer various questions and understand the contents of the tweets better. The project contains two tweet processing algorithms

A simple algorithm to find the most common words
A more advanced algorithm to find useful insights on the tweets

NOTE: This project is dependent on Spark and Gson projects. The simplest way to run this project is to import it as a Maven project into an IDE (such as IntelliJ) and run it from here.

Running a Simple Tweet Processing Demo

The aim of the first demo is to find the most common words seen in the list of tweets.

The tweet processing demo can be found at com.danosoftware.spark.demo.TweetProcessorDemo. This will process the file of tweets held at resources/tweets.json as follows:

Extract the line of text from each tweet
Split out individual words
Create a count per unique word
Filter the words that appear most often
Print the most common words

Run the main method of the above class to see the results.

Running the Advanced Tweet Processing Algorithm

TBC...

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src/main		src/main
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spark Tweet Processor

Running a Simple Tweet Processing Demo

Running the Advanced Tweet Processing Algorithm

About

Uh oh!

Releases

Packages

Languages

DannyNicholas/spark-tweet-processor

Folders and files

Latest commit

History

Repository files navigation

Spark Tweet Processor

Running a Simple Tweet Processing Demo

Running the Advanced Tweet Processing Algorithm

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages