Skip to content

This workshop is an exercise on how to build an ETL pipeline using Apache Airflow, the idea is to extract information using three different data sources (API, csv file, database), then do some transformations and merge the transformed data to finally load

Notifications You must be signed in to change notification settings

JuanRuizIng/Workshop002_Data_Engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Workshop #2: Data Engineer Data Icon

Realized by Juan Andrés Ruiz Muñoz (@JuanRuizIng).

Overview ✨

In this workshop we use Spotify and Grammys data. With this data we run loading, cleaning and transformation processes to find interesting insights using the following tools:

The libraries needed for Python are

  • Pandas
  • Matplotlib
  • Seaborn
  • SQLAlchemy
  • Python-dotenv

These libraries are included in the requirements.txt file.

Dataset Information Dataset

The first dataset used, Spotify Dataset, has 114000 rows and 21 columns of Spotify tracks over a range of 125 different genres. Each track has some audio features associated with it. This dataset is further transformed to be better processed by the visual analysis tool. Initially, the column names of the dataset and their respective Dtype are:

  • "Unnamed: 0" ➜ int
  • "track_id" ➜ object
  • "artists" ➜ object
  • "album_name" ➜ object
  • "track_name" ➜ object
  • "popularity" ➜ int
  • "duration_ms" ➜ int
  • "explicit" ➜ bool
  • "danceability" ➜ float
  • "energy" ➜ float
  • "key" ➜ int
  • "loudness"➜ float
  • "mode" ➜ int
  • "speechiness" ➜ float
  • "acousticness" ➜ float
  • "instrumentalness" ➜ float
  • "liveness" ➜ float
  • "valence" ➜ float
  • "tempo" ➜ float
  • "time_signature" ➜ int
  • "track_genre" ➜ object

The second dataset used, Grammy Awards, has 4800 rows and 10 columns of Grammy, is an award presented by The Recording Academy to recognize achievements in the music industry. The trophy depicts a gilded gramophone. The annual presentation ceremony features performances by prominent artists, and the presentation of those awards that have a more popular interest. The Grammys are the second of the Big Three major music awards held annually (between the American Music Awards in the Fall, and the Billboard Music Awards in the Summer). This dataset will be transformed and enabled for a merge into the spotify dataset. Initially, the column names of the dataset and their respective Dtype are:

  • "year" ➜ int
  • "title" ➜ object
  • "published_at" ➜ object
  • "updated_at" ➜ object
  • "category" ➜ object
  • "nominee" ➜ object
  • "artist" ➜ object
  • "workers" ➜ object
  • "img" ➜ object
  • "winner" ➜ bool

Clone the repository

Execute the following command to clone the repository:

  https://github.com/JuanRuizIng/Workshop002_Data_Engineering.git

Enviromental variables

From now on, the steps will be done in VS Code.

To establish the connection to the database, we use a module called database.py. In this Python script we call a file where our environment variables are stored, this is how we will create this file:

  1. We create a directory named auth inside src folder.

  2. There we create a file called .env.

  3. In that file we declare 6 enviromental variables. Remember that the variables in this case go without double quotes, i.e. the string notation ("):

     PG_HOST = # host address, e.g. localhost or 127.0.0.1
     PG_PORT = # PostgreSQL port, e.g. 5432
    
     PG_USER = # your PostgreSQL user
     PG_PASSWORD = # your user password
     
     PG_DRIVER = postgresql+psycopg2
     PG_DATABASE = # your database name, e.g. postgres
  4. (Optional) If you want the file to be saved on the drive, in auth you will need to save your google services key and edit the store.py file. For more information here.

Set the csvs

Place spotify_dataset.csv and the_grammys_awards.csv inside the src/data folder

image

Airflow configuration

In your airflow.cfg, you need configure to run the dags:

dags_folder = /home/[USER]/Workshop002_Data_Engineering/dags

Libraries

you will need to access your virtual environment and secure and execute in the terminal

pip install -r requirements.txt

Running the code

  1. Execute loadRaw.py in src/loadRaw folder

  2. In the terminal, we execute "airflow standalone" and login in airflow

  3. Execute the ETL process named:

image

  1. Check if everything is in order in logs or by viewing the executed tasks in the dashboard

image

  1. your graphs should look like this:

image

  1. To view the EDA process, we execute the 2 notebooks following whatever order. You can run it just pressing the "Execute All" button. Remember the notebooks in notebooks folder:

    1. EDA_grammys.ipynb
    2. EDA_spotify.ipynb

Remember to choose the appropriate Python kernel when running the notebook and install the ipykernel to support Jupyter notebooks in VS Code with the venv virtual environment.

Connecting the database with Power BI

  1. Open Power BI Desktop and create a new dashboard. Select the Get data option - be sure you choose the "PostgreSQL Database" option.

Power BI

  1. Insert the PostgreSQL Server and Database Name.

image

  1. Fill in the following fields with your credentials.

image

  1. If you manage to connect to the database the following tables will appear:

image

  1. Choose the merged_data_spotify_grammys table and start making your own visualizations!

image

Thank you! 💩🐍

Law 7, The 48 Laws of Power - Robert Greene

About

This workshop is an exercise on how to build an ETL pipeline using Apache Airflow, the idea is to extract information using three different data sources (API, csv file, database), then do some transformations and merge the transformed data to finally load

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published