Workshop #2: Data Engineer

Realized by Juan Andrés Ruiz Muñoz (@JuanRuizIng).

Overview ✨

In this workshop we use Spotify and Grammys data. With this data we run loading, cleaning and transformation processes to find interesting insights using the following tools:

Python ➜ Download site
Jupyter Notebook ➜ VS Code tool for using notebooks
PostgreSQL ➜ Download site
Power BI (Desktop version) ➜ Download site
Apache Airflow ➜ Download site

The libraries needed for Python are

Pandas
Matplotlib
Seaborn
SQLAlchemy
Python-dotenv

These libraries are included in the requirements.txt file.

Dataset Information

The first dataset used, Spotify Dataset, has 114000 rows and 21 columns of Spotify tracks over a range of 125 different genres. Each track has some audio features associated with it. This dataset is further transformed to be better processed by the visual analysis tool. Initially, the column names of the dataset and their respective Dtype are:

"Unnamed: 0" ➜ int
"track_id" ➜ object
"artists" ➜ object
"album_name" ➜ object
"track_name" ➜ object
"popularity" ➜ int
"duration_ms" ➜ int
"explicit" ➜ bool
"danceability" ➜ float
"energy" ➜ float
"key" ➜ int
"loudness"➜ float
"mode" ➜ int
"speechiness" ➜ float
"acousticness" ➜ float
"instrumentalness" ➜ float
"liveness" ➜ float
"valence" ➜ float
"tempo" ➜ float
"time_signature" ➜ int
"track_genre" ➜ object

The second dataset used, Grammy Awards, has 4800 rows and 10 columns of Grammy, is an award presented by The Recording Academy to recognize achievements in the music industry. The trophy depicts a gilded gramophone. The annual presentation ceremony features performances by prominent artists, and the presentation of those awards that have a more popular interest. The Grammys are the second of the Big Three major music awards held annually (between the American Music Awards in the Fall, and the Billboard Music Awards in the Summer). This dataset will be transformed and enabled for a merge into the spotify dataset. Initially, the column names of the dataset and their respective Dtype are:

"year" ➜ int
"title" ➜ object
"published_at" ➜ object
"updated_at" ➜ object
"category" ➜ object
"nominee" ➜ object
"artist" ➜ object
"workers" ➜ object
"img" ➜ object
"winner" ➜ bool

Clone the repository

Execute the following command to clone the repository:

  https://github.com/JuanRuizIng/Workshop002_Data_Engineering.git

Enviromental variables

From now on, the steps will be done in VS Code.

To establish the connection to the database, we use a module called database.py. In this Python script we call a file where our environment variables are stored, this is how we will create this file:

We create a directory named auth inside src folder.
There we create a file called .env.

In that file we declare 6 enviromental variables. Remember that the variables in this case go without double quotes, i.e. the string notation ("):

 PG_HOST = # host address, e.g. localhost or 127.0.0.1
 PG_PORT = # PostgreSQL port, e.g. 5432

 PG_USER = # your PostgreSQL user
 PG_PASSWORD = # your user password
 
 PG_DRIVER = postgresql+psycopg2
 PG_DATABASE = # your database name, e.g. postgres

(Optional) If you want the file to be saved on the drive, in auth you will need to save your google services key and edit the store.py file. For more information here.

Set the csvs

Place spotify_dataset.csv and the_grammys_awards.csv inside the src/data folder

Airflow configuration

In your airflow.cfg, you need configure to run the dags:

dags_folder = /home/[USER]/Workshop002_Data_Engineering/dags

Libraries

you will need to access your virtual environment and secure and execute in the terminal

pip install -r requirements.txt

Running the code

Execute loadRaw.py in src/loadRaw folder
In the terminal, we execute "airflow standalone" and login in airflow
Execute the ETL process named:

Check if everything is in order in logs or by viewing the executed tasks in the dashboard

your graphs should look like this:

To view the EDA process, we execute the 2 notebooks following whatever order. You can run it just pressing the "Execute All" button. Remember the notebooks in notebooks folder:
1. EDA_grammys.ipynb
2. EDA_spotify.ipynb

Remember to choose the appropriate Python kernel when running the notebook and install the ipykernel to support Jupyter notebooks in VS Code with the venv virtual environment.

Connecting the database with Power BI

Open Power BI Desktop and create a new dashboard. Select the Get data option - be sure you choose the "PostgreSQL Database" option.

Insert the PostgreSQL Server and Database Name.

Fill in the following fields with your credentials.

If you manage to connect to the database the following tables will appear:

Choose the merged_data_spotify_grammys table and start making your own visualizations!

Thank you! 💩🐍

Law 7, The 48 Laws of Power - Robert Greene

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
dags		dags
notebooks		notebooks
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
power bi image.pdf		power bi image.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Workshop #2: Data Engineer

Overview ✨

Dataset Information

Clone the repository

Enviromental variables

Set the csvs

Airflow configuration

Libraries

Running the code

Connecting the database with Power BI

Thank you! 💩🐍

About

Uh oh!

Releases

Packages

Languages

JuanRuizIng/Workshop002_Data_Engineering

Folders and files

Latest commit

History

Repository files navigation

Workshop #2: Data Engineer

Overview ✨

Dataset Information

Clone the repository

Enviromental variables

Set the csvs

Airflow configuration

Libraries

Running the code

Connecting the database with Power BI

Thank you! 💩🐍

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages