This repo houses function code and deployment code for VEDA projects.
- dags contains the Directed Acyclic Graphs which constitute Airflow state machines. This includes the Python for running each task as well as the Python definitions of the structure of these DAGs. Files that define a DAG object are considered top-level DAG files, which are processed by Airflow.
- groups contain common groups of tasks that can be reused in multiple DAGs.
- utils contains shared low-level functions used in tasks and DAGs.
- airflow_worker contains a dockerfile, as well as requirements for Airflow workers. This service runs the tasks in the DAGs.
- airflow_services contains a dockerfile, as well as requirements for the Airflow scheduler, DAG processor and the webserver. These requirements are largely dependent on the Airflow version, as well as any requirements needed to parse the top-level DAG code.
- infrastructure contains the Terraform necessary to deploy all resources to AWS
- scripts contains bash scripts for deploying and operating Airflow instances.
- sm2a-local-config contains airflow configuration to run Airflow locally.
See install-docker-and-docker-compose
- Install uv.
- Run
uv sync
to install the required python packages. By default, all optional dependencies are included. To avoid this, useuv sync --no-default-groups
.
⚠️ You need to copy ./sm2a-local-config/env_example to ./sm2a-local-config/.env- You can define AWS credentials or other custom envs in .env file.
⚠️ If you update ./sm2a/sm2a-local-config/.env file you should runmake sm2a-local-run
again
To retrieve the variables for a stage that has been previously deployed, the secrets manager can be used to quickly populate an .env file with scripts/sync-env-local.sh
.
./scripts/sync-env-local.sh <app-secret-name>
Important
Be careful not to check in .env
(or whatever you called your env file) when committing work.
Currently, the client id and domain of an existing Cognito user pool programmatic client must be supplied in configuration as VEDA_CLIENT_ID
and VEDA_COGNITO_DOMAIN
(the veda-auth project can be used to deploy a Cognito user pool and client). To dispense auth tokens via the workflows API swagger docs, an administrator must add the ingest API lambda URL to the allowed callbacks of the Cognito client.
- Build services
make sm2a-local-build
- Initialize the metadata db
Note
This command is typically required only once at the beginning.
After running it, you generally do not need to run it again unless you run make clean
,
which will require you to reinitialize SM2A with make sm2a-local-init
make sm2a-local-init
This will create an airflow username: airflow
with password airflow
- Start all services
make sm2a-local-run
This will start SM2A services and will be running on http://localhost:8080
- Stop all services
make sm2a-local-stop
This project uses Terraform modules to deploy Apache Airflow and related AWS resources. Typically, your code will deploy automatically via Github Actions, after your Pull Request has been approved and merged.
- cicd.yml defines multiple jobs to:
- Check the linter
- Run unit tests
- Define the environment where the deployment will happen
- deploy.yml file uses OpenOIDC to obtain AWS credentials and deploys Terraform modules to AWS. The necessary environment variables are retrieved from AWS Secret Manager using the following Python script.
- gitflow.yml provides a structured way to manage the development, testing, and deployment of terraform modules. For more info refer to gitflow
To log in to the Airflow UI, you must be added to a specific GitHub team, depending on the deployed Airflow configuration. Contact your instance administrator to be added to the appropriate GitHub team that has access to the Airflow instance. Once added, you can log in by visiting https://<domain_name> and using your GitHub credentials.
Please review the docs folder for more information on how to create a DAG, add tasks, and manage dependencies and variables.
The DAGs are defined in Python files located in the dags directory. Each DAG should be defined as a Python module that defines a DAG object. The DAGs are scheduled by the Airflow Scheduler. Since we aim to keep the scheduler lightweight, every task-dependent library should be imported in the tasks and not at the DAG level. Example: Let's assume we need numpy library in a task, we should not import it like this
from airflow import DAG
import pendulum
from airflow.operators.python import PythonOperator
from airflow.operators.empty import EmptyOperator
import numpy as np
def foo_task():
process = ML_processing(np.rand())
return process
But rather like this
from airflow import DAG
import pendulum
from airflow.operators.python import PythonOperator
from airflow.operators.empty import EmptyOperator
def foo_task():
import numpy as np
process = ML_processing(np.rand())
return process
Doing so, the scheduler won't need numpy installed to schedule the task.
The DAG Launcher role in Airflow is designed to provide users with the necessary permissions to manage and launch DAGs in the Airflow UI. This role allows users to perform actions such as reading DAG runs, and interacting with various views like Task Instances, Jobs, and XComs.
The permissions granted are tailored to streamline the interaction with the DAG management interface, enabling effective monitoring and control over DAG executions.
- Permission on Dag runs for
veda_discover
,veda_dataset_pipeline
,veda_collection_pipeline
- Read access to "My Profile", "DAG Runs", "Jobs", "Task Instances", "XComs", "DAG Dependencies", "Task Logs", and "Website".
- Create, Read, Edit, and Menu Access for DAG runs and DAG-related views (e.g., DAGs, Documentation).
- Edit access to specific DAGs like veda_discover.
- Integrating GitHub Users with the Airflow DAG UI To grant users access to the Airflow UI, including the ability to manage DAGs and view the Swagger interface, GitHub users must be added to a GitHub team.