Realized by Juan Andrés Ruiz Muñoz (@JuanRuizIng).
In this workshop we use Spotify and Grammys data. With this data we run loading, cleaning and transformation processes to find interesting insights using the following tools:
- Python ➜ Download site
- Jupyter Notebook ➜ VS Code tool for using notebooks
- PostgreSQL ➜ Download site
- Power BI (Desktop version) ➜ Download site
- Apache Airflow ➜ Download site
The libraries needed for Python are
- Pandas
- Matplotlib
- Seaborn
- SQLAlchemy
- Python-dotenv
These libraries are included in the requirements.txt file.
The first dataset used, Spotify Dataset, has 114000 rows and 21 columns of Spotify tracks over a range of 125 different genres. Each track has some audio features associated with it. This dataset is further transformed to be better processed by the visual analysis tool. Initially, the column names of the dataset and their respective Dtype are:
- "Unnamed: 0" ➜ int
- "track_id" ➜ object
- "artists" ➜ object
- "album_name" ➜ object
- "track_name" ➜ object
- "popularity" ➜ int
- "duration_ms" ➜ int
- "explicit" ➜ bool
- "danceability" ➜ float
- "energy" ➜ float
- "key" ➜ int
- "loudness"➜ float
- "mode" ➜ int
- "speechiness" ➜ float
- "acousticness" ➜ float
- "instrumentalness" ➜ float
- "liveness" ➜ float
- "valence" ➜ float
- "tempo" ➜ float
- "time_signature" ➜ int
- "track_genre" ➜ object
The second dataset used, Grammy Awards, has 4800 rows and 10 columns of Grammy, is an award presented by The Recording Academy to recognize achievements in the music industry. The trophy depicts a gilded gramophone. The annual presentation ceremony features performances by prominent artists, and the presentation of those awards that have a more popular interest. The Grammys are the second of the Big Three major music awards held annually (between the American Music Awards in the Fall, and the Billboard Music Awards in the Summer). This dataset will be transformed and enabled for a merge into the spotify dataset. Initially, the column names of the dataset and their respective Dtype are:
- "year" ➜ int
- "title" ➜ object
- "published_at" ➜ object
- "updated_at" ➜ object
- "category" ➜ object
- "nominee" ➜ object
- "artist" ➜ object
- "workers" ➜ object
- "img" ➜ object
- "winner" ➜ bool
Execute the following command to clone the repository:
https://github.com/JuanRuizIng/Workshop002_Data_Engineering.git
From now on, the steps will be done in VS Code.
To establish the connection to the database, we use a module called database.py. In this Python script we call a file where our environment variables are stored, this is how we will create this file:
-
We create a directory named auth inside src folder.
-
There we create a file called .env.
-
In that file we declare 6 enviromental variables. Remember that the variables in this case go without double quotes, i.e. the string notation (
"
):PG_HOST = # host address, e.g. localhost or 127.0.0.1 PG_PORT = # PostgreSQL port, e.g. 5432 PG_USER = # your PostgreSQL user PG_PASSWORD = # your user password PG_DRIVER = postgresql+psycopg2 PG_DATABASE = # your database name, e.g. postgres
-
(Optional) If you want the file to be saved on the drive, in auth you will need to save your google services key and edit the store.py file. For more information here.
Place spotify_dataset.csv and the_grammys_awards.csv inside the src/data folder
In your airflow.cfg, you need configure to run the dags:
dags_folder = /home/[USER]/Workshop002_Data_Engineering/dags
you will need to access your virtual environment and secure and execute in the terminal
pip install -r requirements.txt
-
Execute loadRaw.py in src/loadRaw folder
-
In the terminal, we execute "airflow standalone" and login in airflow
-
Execute the ETL process named:
- Check if everything is in order in logs or by viewing the executed tasks in the dashboard
- your graphs should look like this:
-
To view the EDA process, we execute the 2 notebooks following whatever order. You can run it just pressing the "Execute All" button. Remember the notebooks in notebooks folder:
- EDA_grammys.ipynb
- EDA_spotify.ipynb
Remember to choose the appropriate Python kernel when running the notebook and install the ipykernel to support Jupyter notebooks in VS Code with the venv virtual environment.
- Open Power BI Desktop and create a new dashboard. Select the Get data option - be sure you choose the "PostgreSQL Database" option.
- Insert the PostgreSQL Server and Database Name.
- Fill in the following fields with your credentials.
- If you manage to connect to the database the following tables will appear:
- Choose the merged_data_spotify_grammys table and start making your own visualizations!
Law 7, The 48 Laws of Power - Robert Greene