Skip to content

Activity 1: Open Weather Map API Data Pipeline

mike edited this page Sep 13, 2023 · 77 revisions

Python Docker Apache Airflow AWS Google Cloud Ubuntu

Home

YouTube Video Walkthroughs

Part One of the Video Walkthrough

Part Two of the Video Walkthrough starts at section VIII.

I. Tools

  • Python
  • Docker
  • Apache Airflow
  • AWS (S3, EC2, Route 53)
  • GCP (GCS, BigQuery)
  • Ubuntu

Be sure to have your AWS and GCP accounts created before starting the activity! For information on setting up your cloud accounts, see the Requirements Wiki

II. Objective

Set up your local environment, Linux server, and cloud services by pulling code from the orangutan-stem repository. This will familiarize you with cloud services like AWS and GCP, introduce you to frameworks such as Apache Airflow, and expose you to setting up server-side data applications.

III. Local environment setup

To get started, you will first want to create an open weather map account, and retrieve an API key. The API keys take a few hours (up to 12+ sometimes), so be patient. To create an open weather map account, see the Open Weather Map API section in the requirements wiki. There is a code snippet in the python installation section below for testing your new api key on your local IDE/editor, before attempting the airflow DAG version.

This activity requires docker (+ docker compose), and git. If you install docker on a windows, linux, or MacOS desktop with the installer, it should already include docker compose. In addition, you should also have an AWS account, access/secret key, and a GCP service account. Instructions for setting up your cloud accounts can be found in the Requirements wiki.

a. Python

Install python >=3.7 on your local machine. Note, the docker image in this activity uses a different (older) version of python. It's useful to have a local (and newer) version of python installed for writing/testing snippets of code (i.e. testing if your open weather map api key is valid since it takes a few hours to initiate)

import requests
import pprint


# Bohorok, Northern Sumatra lat / long
lat_long = {'lat': 3.5553, 'long': 98.1448}
# Chicago, Illinois lat / long
# lat_long = {'lat': 41.8781, 'long': 87.6298}
open_weather_api_key = '' # put your api key here inside of single quotes
open_weather_api_response = requests.get(
            url=f"https://api.openweathermap.org/data/2.5/weather?lat={lat_long['lat']}&lon={lat_long['long']}&appid={open_weather_api_key}")
weather_data_content = open_weather_api_response.json()
main_weather_content = weather_data_content['main']
pprint.pprint(main_weather_content)

Python 3.11.5 is probably the best version to install as of 8/25/23

Be sure to select the appropriate version for your operating system

Python Installer Download Site

To verify python was properly installed, open terminal/command line/powershell and run,

python3 --version (Mac OS/Linux)

python --version (Windows OS)

b. Docker

Docker has an installer available. To download docker via the installer, see the docker installation requirements

To verify you have docker,

docker --version

To verify you have docker-compose,

docker-compose --version

A command you can test on your terminal/command line/windows powershell is,

docker ps

This will show what containers, and how many containers are running on your machine. If you just downloaded docker, there will be none listed.

c. VS Code

In the youtube video, I will be using VS (Visual Studio) Code. This is a free open-source development editor. Note, this is not a traditional IDE (integrated development environment) such as PyCharm, but it is very popular amongst developers. With all of the extensions VS Code offers, it works very similarly to a traditional IDE.

VS Code Download

d. git

Go to the git-scm website and install the appropriate version for your operating system.

To check the install worked successfully, open terminal/command line/powershell and run,

git -v or git --version

Once you have python, docker, vscode, and git installed, you should be good to go!

VI. Clone the codebase and run docker locally

You will want to make a directory on your machine to easily access the projects you clone from git, or create yourself. Create a directory somewhere on your machine. You can accomplish this with terminal/command line/powershell by writing,

mkdir orangutan_stem_projects

Change your current working directory to the new one just created with,

cd orangutan_stem_projects

Once you are in the new directory, you will clone the orangutan-stem codebase with the following command on terminal/command line/powershell,

git clone https://github.com/mikestack15/orangutan-stem.git

If this is your first time using the git cli, you may be prompted for some authentication/permissions in order to successfully clone the codebase.

V. Run Docker (Locally)

Once the codebase is cloned, change directories into the airflow folder

cd orangutan-stem/airflow

Once you are in the airflow directory, you will see a yaml file called, 'docker-compose.yaml'. Docker Compose, is a tool for defining and running multi-container Docker applications. It allows you to define your application's services, networks, and volumes in a single file called docker-compose.yaml. With Docker Compose, you can define how multiple containers work together to build a complex application. The public docker image can be found in Docker Hub.

According to airflow's documentation, it's best practice to first run,

docker-compose up airflow-init

within the orangutan-stem/airflow directory to initialize the database. It should work for the local docker setup without this command, but will not work without this command on the server-side setup. More on this in the later sections.

Within the orangutan-stem/airflow directory, run the following command to start up airflow locally,

docker-compose up

When you run docker-compose up, Docker Compose reads the configuration from the file, pulls the necessary images (if not already available), and creates containers for each service defined in the docker-compose.yaml file.

There are a lot of ways to set up airflow. Virtual environments, custom machine setup, kubernetes, cloud hosted (GCP-Cloud Composer, AWS-MWAA) and docker just to name the most popular. My preferred approach is with a docker container. It's simple, portable, and can work very well in production-grade enterprise applications. In this demo, I will show how to run it on your local machine, and on a server (AWS EC2)

VI. Accessing the airflow webserver (Locally)

After docker-compose up finishes running, you can go to localhost:8080, and will see the login screen for the airflow webserver. With the current yaml file cloned from the orangutan-stem repository, you will be able to log in with

username: airflow

password: airflow

Once you are logged in, you will see a red banner stating that one of the dag's in the dagbag is broken. This is okay, it is indicating that we need to set up a declared variable in the dag. To do this, go to the top ribbon, under Admin > Variables.

On the Variables page, create a new variable by clicking the blue box with a plus-sign in it. Be sure to name the key, 'open_weather_api_key', and put your Open Weather Map API key you were given as the val. To create an Open Weather Map API key, see the Requirements Wiki.

Once you have the variable set up, the red error banner will disappear (give it a minute and refresh browser), and you will see a dag called, 'extract_open_weather_data_to_lake'

You can click this dag, and attempt running it, however, it will fail on the second task since you have not configured the aws connection to write out the data retrieved from the open weather map api to s3. The third task will also fail without bigquery_default and google_cloud_default connections set up.

VII. Setting up Connections (Locally and Server-side)

You will need to set up a bucket in S3 and GCS, a BigQuery dataset and table, and three connections for this extract_open_weather_data_to_lake DAG to run successfully:

A. Connection for S3

  • aws_default // Connection Type: Amazon Web Services // Requires AWS account/access keys/S3 bucket setup
  1. create a bucket in s3 called, orangutan-orchard
  2. go to airflow webserver UI, at the top banner, click Admin -> Connections
  3. Create a connection id, aws_default, with connection type Amazon Web Services
  4. Add your AWS Access Key and AWS Secret Access Key to the appropriate fields and click save

B. Connection for BigQuery

  • bigquery_default // Connection Type: Google Cloud // Requires GCP Account/BigQuery enabled/scopes listed enabled
  1. create a connection id, bigquery_default with Connection Type Google Cloud
  2. enter your gcp project-id
  3. add the scopes
  • https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/bigquery
  1. copy and paste your gcp service account json file contents to the Keyfile JSON field (see Requirements wiki on how to create this)
  2. create a dataset/table in bigquery in GCP BigQuery UI
  • dataset: bukit_lawang_weather
  • table: raw_ingested_main_weather_content

C. Connection for Google Cloud

  • google_cloud_default // Connection Type: Google Cloud // Requires GCP Account/Cloud Storage+bucket setup/scopes listed enabled
  1. create a google cloud storage bucket orangutan-orchard and make sure your service account has appropriate permissions
  2. create a connection id, google_cloud_default with Connection Type Google Cloud
  3. populate project-id, and add the following scopes:
  • https://www.googleapis.com/auth/cloud-platform
  1. copy and paste your gcp service account json file contents to the Keyfile JSON field

Once all of these connections are setup, and your open weather map api key is properly activated, the DAG should run end-to-end! Your buckets in S3/GCS, and bigquery dataset/table need to be created for the dag to work.

VIII. Setting up this build on EC2 Server with Docker Traefik, Route 53, and Elastic IP (Server-side)

In this section, which marks the beginning of Part 2 of our YouTube video walkthrough, we'll focus on setting up an Airflow Docker server on AWS EC2. Cost considerations are essential, so let's dive into that. For this setup, you'll need at least a t2.large EC2 instance, though a t2.xlarge (or bigger) is recommended for production-grade environments. The hourly rate for these instances starts at approximately $0.07/hour. AWS does offer cloud credits, but once they're depleted, you'll need to budget for ongoing operational costs. Make sure that your cloud billing account is prepared for these expenses to ensure that the cost of running this build server-side is justified. If you're pursuing this activity solely for learning purposes, it's crucial to shut down your EC2 instance once you've completed the activity to avoid incurring additional charges. For detailed information on AWS EC2 costs, see their documentation.

EC2 Airflow Docker Build (Server-Side) Overview

If you navigate through the orangutan-stem codebase, you will see that there are several .yml/.yaml files in various directories. Based on where and how we run docker-compose, we can utilize specific yaml files in this codebase (as you originally cloned it) to run on an EC2 server. Note, you will have to update some code in the traefik.yml file found in airflow/compose/prod/traefik/traefik.yml More is explained on this further down. Just like the local airflow setup, we will use docker-compose to run our airflow docker application.

a. Setup EC2 Server

  1. Create an EC2 instance with the latest ubuntu version (22.04.3 LTS)
  2. For instance type (size) be sure to select a t2.large/t2.xlarge since docker requires a good chunk of cpu/memory. Also be sure to make <=200gb of storage on your ec2 instance.
  3. Create a new ssh access identify file/key as a part of setting up the EC2 instance (.pem file for openssh preferred). Save it as something like 'orangutan-server-key.pem'. On terminal/command prompt, cd into your Downloads directory, and run the following command (with the appropriate file name): mv orangutan-server-key.pem ~/.ssh/ On windows, you can manually drag it from your Downloads folder into .ssh/ directory (usually found in your C drive). If on a mac, you may have to run the following command for the .pem file to be properly accessed,

chmod 600 ~/.ssh/orangutan-server-key.pem

  1. Launch the EC2 instance, and once it is provisioned, you will want to allocate an elastic ip. Be sure to set your security group (inbound rules) to allow TCP traffic (all) for ports 80, 443, and 22.

b. Allocate Elastic IP/DNS Record to EC2 Server and Route 53

We will want to assign the ec2 instance an elastic ip so it has a stable and fixed ip address

  1. Within EC2, on the left-hand menu-bar select "Elastic IP's" and click "Allocate IP Address". Once the Elastic IP is created, be sure to assign it your newly created EC2 instance
  2. Navigate to Route 53, and on the left-hand menu click, 'Registered Domains -> Register Domains'
  3. You can find a cheap domain that you can use to point your elastic ip so it can be available on public internet without an ip address, and something more intuitive like, 'airflow.orangutan-stem.com'. Registering a domain takes a few minutes to a few hours, and costs ~$13 USD annually for low-demand domain names.
  4. Once your domain is registered, go to 'Hosted Zones' on the left-hand menu. Add a new hosted zone for 'airflow.{yourdomainname.com}'. Replace {yourdomainname.com} with your newly registered domain and don't forget the 'airflow' subdomain prefix. Once 'airflow.yourdomainname.com' is added as a hosted zone, copy and paste those new NS record Value(s) to the NS record Value(s) in the 'yourdomainname.com' hosted zone. See screenshots below-the first screenshot is after the 'airflow.yourdomainname.com' is created as a hosted zone, the second shows where to paste them in the 'yourdomainname.com' NS records.

Copy these NS record values, and paste them to the yourdomainname.com hosted zone record in the Value section

  1. Create an A record for airflow (airflow.yourdomainname.com) within the yourdomainname.com hosted zone. The Value should be the elastic IP address you created earlier.
  2. open a terminal/command prompt, and cd .ssh/
  3. create a new file (no file extension) called 'config'. This can be done with New-Item config on windows or touch config on mac/linux os
  4. on your computer's terminal/command prompt, cd into .ssh/, and vim/nano/code . (code . to open/edit in vs code) the config file with the following content:
Host airflow.yourdomainname.com
  HostName 3.100.100.100
  IdentityFile ~/.ssh/orangutan-server-key.pem
  User ubuntu

Be sure to replace with your correct dns name (Host), your elastic ip you have assigned to your ec2 instance (HostName), and the correct .pem file name you should have moved to your .ssh/ directory in the earlier step (IdentityFile). User should always be ubuntu for default setup.

Once you have setup your config file properly, you should be able to ssh into your ec2 instance! To do this, open a terminal/command prompt and type:

ssh airflow.yourdomainname.com

You may be prompted to type 'yes', and should see yourself logged into the server.

IX. Clone orangutan-stem codebase onto EC2 server

Git should be installed by default on the ec2 server ubuntu linux distro. To check run, git --version. If this works, run the following commands to clone the orangutan-stem repo

mkdir projects

cd projects

git clone https://github.com/mikestack15/orangutan-stem.git

X. Install Docker on EC2 Server

Once the codebase is cloned, you will have to install docker and docker compose. This has to be done through ssh command line/terminal. Be sure to do this in the EC2 instance terminal you should already be presently in.

Update the apt package index and install packages to allow apt to use a repository over HTTPS and Add Docker's official GPG key. The second command block is for setting up the repository

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

# Add the repository to Apt sources:
echo \
  "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

Update the apt package index

sudo apt-get update

Install docker engine, containerd, and docker compose

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Verify everything works:

sudo docker run hello-world

We want to be able to run docker without sudo, and in order to do this, we will need to create a docker group and add authenticated users to it.

sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker

Verify everything works without sudo

docker run hello-world

Configure Docker to start on boot with systemd

sudo systemctl enable docker.service
sudo systemctl enable containerd.service

XI. Set up an .env file and modify airflow/compose/prod/traefik/traefik.yml

  1. Navigate to the airflow/ directory and run the following command,

touch .env

  1. use vim/nano to modify the .env file with the following content:
AIRFLOW_UID=50000
AIRFLOW_GID=0
AIRFLOW__WEBSERVER__WARN_DEPLOYMENT_EXPOSURE=false
  1. Traefik serves as a reverse proxy and load balancer, much like Nginx. We employ Traefik to make the Airflow webserver publicly accessible over the internet. As a subsequent step, we will need to update the default authentication password for the Airflow user account. Without Traefik, the Airflow web server would be operational on the EC2 instance but would not be publicly accessible. To modify configurations, you can use text editors like Vim or Nano. Use vim/nano to modify the airflow/compose/prod/traefik/traefik.yml file and update the following three lines with your proper dns name:
[line 25]      email: "mailbox@mail.airflow.orangutan-stem.com" # need to adjust this with your correct dns name, i.e. airflow.yourdomain.com
[line 34]      rule: "Host(`airflow.orangutan-stem.com`)" # need to adjust this with your correct dns name, i.e. airflow.yourdomain.com
[line 45]      rule: "Host(`airflow.orangutan-stem.com`)" # need to adjust this with your correct dns name, i.e. airflow.yourdomain.com

XII. Run Docker Compose to initialize container on EC2 Server

  1. from the airflow/ directory run the following command:

docker compose -f prod.yaml build

  1. for first time setup or after the db is cleared, run,

docker compose -f prod.yaml up airflow-init

  1. After this completes, be sure to Ctrl/Cmd + C or exit it some other way. This migrates + seeds the database, which should only need to be done once.

  2. Open a new terminal to ssh into your ec2 instance, and cd into the airflow/ directory, to run,

docker compose -f prod.yaml up -d

  1. if you go to your public domain, airflow.yourdomainname.com you should be able to proceed to the airflow webserver with user airflow, password airflow. You should update this immediately! A way to do this is with the airflow cli, which comes with the docker build. Open a new ssh terminal to your ec2 instance, and run,

cd orangutan-stem-projects/orangutan-stem/airflow

docker compose run airflow-cli bash

airflow users create \
          --username admin \
          --firstname FIRST_NAME \
          --lastname LAST_NAME \
          --role Admin \
          --email admin@example.org

Assign a complex password, and delete the airflow user with the following command,

airflow users delete -u airflow

You will need the user 'airflow' created and saved in the database in order for docker compose up commands to work, so recreate it again, but with a secure password (not password==airflow)

airflow users create \
          --username airflow \
          --firstname Airflow \
          --lastname Airflow \
          --role Admin \
          --email airflow@example.org

To kill the docker container running airflow-cli after complete with the steps above, open a new ssh terminal to your ec2 server and type,

docker ps to identify which container is running airflow-cli

and run,

docker kill {airflow_cli_container_id} where {airflow_cli_container_id} is the actual alphanumeric container id that needs to be removed. Be sure to not remove any of the other containers as they are necessary to run airflow on EC2.

If you made it this far, and completed everything above successfully, you have just constructed a local and server-side, production-grade, airflow environment that is multi-cloud, and can be extremely useful for many different projects!

Home

Clone this wiki locally