This is an interview challenge for PayPay. Please feel free to fork. Pull Requests will be ignored.
The challenge is to make make analytical observations about the data using the distributed tools below.
-
Sessionize the web log by IP. Sessionize = aggregrate all page hits by visitor/IP during a session. https://en.wikipedia.org/wiki/Session_(web_analytics)
-
Determine the average session time
-
Determine unique URL visits per session. To clarify, count a hit to a unique URL only once per session.
-
Find the most engaged users, ie the IPs with the longest session times
- Spark (Scala)
- JupyterHub (IDE)
View results
- Notebook located within the jupyter folder (Notebook Link)
- Nbviewer if notebook is not rendering (Notebook Link)
Run Scala script
- Install Spark if not already installed
- Download script folder from repository
- Either open terminal in folder or navigate to folder from terminal (use cd to navigate to folder)
- Run this command: spark-shell -i PayPay_Challenge.scala
- Once script is finished, you can use :q to close the spark shell
Run notebook
- Install JupyterHub (need to have Python installed)
- Install Spark
- Install Scala kernel to run Spark in notebook
- Run all code in cells in notebook