Project: Data Lake with Spark and S3

Summary of the Project

Sparkify is a startup which wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to.

Files in the Repository

The project includes three files:

etl.py: loads data from S3 into tables using Spark and then save that data into S3.

Configuration File

[default]
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=

EMR Environment configuration

Add to the ~/.bashrc file.

export SPARK_HOME=/usr/lib/spark
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

How to run the Python scripts

Connect to the EMR cluster: ssh -i key.pem hadoop@ecX-X-X-X-X.compute-1.amazonaws.com
Run etl.py to execute the ETL with Spark: python etl.py

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
etl.py		etl.py
queries.ipynb		queries.ipynb
queries.py		queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project: Data Lake with Spark and S3

Summary of the Project

Files in the Repository

Configuration File

EMR Environment configuration

How to run the Python scripts

About

Uh oh!

Releases

Packages

Uh oh!

Languages

lucasaugustomcc/project4-data-lake-with-spark

Folders and files

Latest commit

History

Repository files navigation

Project: Data Lake with Spark and S3

Summary of the Project

Files in the Repository

Configuration File

EMR Environment configuration

How to run the Python scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages