DataEngineerChallenge

This is an interview challenge for PayPay. Please feel free to fork. Pull Requests will be ignored.

The challenge is to make make analytical observations about the data using the distributed tools below.

Processing & Analytical goals:

Sessionize the web log by IP. Sessionize = aggregrate all page hits by visitor/IP during a session. https://en.wikipedia.org/wiki/Session_(web_analytics)
Determine the average session time
Determine unique URL visits per session. To clarify, count a hit to a unique URL only once per session.
Find the most engaged users, ie the IPs with the longest session times

View results

Run Scala script

Install Spark if not already installed
Download script folder from repository
Either open terminal in folder or navigate to folder from terminal (use cd to navigate to folder)
Run this command: spark-shell -i PayPay_Challenge.scala
Once script is finished, you can use :q to close the spark shell

Run notebook

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.kys		.kys
Jupyter		Jupyter
script		script
ORIGINAL_README.md		ORIGINAL_README.md
README.md		README.md