This is your new Kedro project with Kedro-Viz setup, which was generated using kedro 0.19.6
.
Take a look at the Kedro documentation to get started.
In order to get the best out of the template:
- Don't remove any lines from the
.gitignore
file we provide - Make sure your results can be reproduced by following a data engineering convention
- Don't commit data to your repository
- Don't commit any credentials or your local configuration to your repository. Keep all your credentials and local configuration in
conf/local/
Declare any dependencies in requirements.txt
for pip
installation.
To install them, run:
pip install -r requirements.txt
You can run your Kedro project with:
kedro run
Have a look at the files src/tests/test_run.py
and src/tests/pipelines/data_science/test_pipeline.py
for instructions on how to write your tests. Run the tests as follows:
pytest
To configure the coverage threshold, look at the .coveragerc
file.
To see and update the dependency requirements for your project use requirements.txt
. Install the project requirements with pip install -r requirements.txt
.
Further information about project dependencies
Note: Using
kedro jupyter
orkedro ipython
to run your notebook provides these variables in scope:catalog
,context
,pipelines
andsession
.Jupyter, JupyterLab, and IPython are already included in the project requirements by default, so once you have run
pip install -r requirements.txt
you will not need to take any extra steps before you use them.
To use Jupyter notebooks in your Kedro project, you need to install Jupyter:
pip install jupyter
After installing Jupyter, you can start a local notebook server:
kedro jupyter notebook
To use JupyterLab, you need to install it:
pip install jupyterlab
You can also start JupyterLab:
kedro jupyter lab
And if you want to run an IPython session:
kedro ipython
To automatically strip out all output cell contents before committing to git
, you can use tools like nbstripout
. For example, you can add a hook in .git/config
with nbstripout --install
. This will run nbstripout
before anything is committed to git
.
Note: Your output cells will be retained locally.
Further information about using notebooks for experiments within Kedro projects.
Further information about building project documentation and packaging your project.
Micromamba is a lightweight and fast package manager and environment manager from the Mamba project, designed to work with the Conda ecosystem. It's similar to Conda but aims to be more efficient and quicker in resolving dependencies and managing environments.
- Lightweight: Micromamba is much smaller in size compared to Conda.
- Fast: It has faster dependency resolution and environment management.
- No Python Dependency: Unlike Conda, Micromamba does not require Python to be installed, making it a more self-contained solution.
Yes, Micromamba can be used to create and manage virtual environments. Here’s a basic example of how to use Micromamba to create and activate a virtual environment:
-
Install Micromamba: You can install Micromamba using a minimal installer or from pre-built binaries. For example:
curl -L https://micromamba.snakepit.net/api/micromamba/linux-64/latest | tar -xvj
-
Create a New Environment:
micromamba create -n myenv python=3.8
This command creates a new environment named
myenv
with Python 3.8. -
Activate the Environment:
micromamba activate myenv
-
Install Packages:
micromamba install numpy pandas
-
Deactivate the Environment:
micromamba deactivate
Micromamba is particularly useful for users who want a lightweight alternative to Conda while still leveraging the Conda ecosystem for package management.
-
To initialize the current zsh shell, run: $ eval "$(micromamba shell hook --shell zsh)"
-
Activate the virtual environment using micromamba package manager: $ micromamba activate myenv (spaceflights310 in that case")
-
Install the requirements for the specific project : $ pip install -r requirements.txt
-
For the training phase, run the following command: $ spaceflights train
-
For the inference phase, run the following command: $ spaceflights inference
This guide will walk you through the steps to integrate Apache Airflow with your Kedro project in a Jupyter environment. The integration allows you to orchestrate your Kedro pipelines using Airflow's powerful scheduling and task management capabilities.
To get started, install Apache Airflow using the official guidelines. If you are working in a Python environment like conda or micromamba, you can install it with:
pip install apache-airflow kedro-airflow
pip install virtualenv apache-airflow\[cncf/kubernetes\]
Ensure that you initialize the Airflow database after installation:
export AIRFLOW_HOME=~/airflow # Define where Airflow will store its files
airflow db init # Initialize the Airflow database
The Kedro-Airflow plugin helps you convert Kedro pipelines into Airflow DAGs (Directed Acyclic Graphs).
From your project directory, run:
kedro airflow create
This will generate an Airflow DAG in the airflow_dags/
directory based on your Kedro pipelines.
Copy the generated DAG file (airflow_dags/my_project_name.py
) into Airflow’s DAGs folder. For example:
mkdir -p ~/airflow/dags
cp airflow_dags/* ~/airflow/dags/
This makes the Kedro pipeline available as an Airflow DAG.
Airflow needs two processes: the web server to view and manage DAGs, and the scheduler to run them.
Open two terminal tabs and run:
-
In the first tab, start the Airflow web server:
airflow webserver --port 8080
-
In the second tab, start the Airflow scheduler:
airflow scheduler
Now you can access the Airflow UI in your browser at http://localhost:8080
.
In the Airflow UI, you'll see your Kedro pipeline as a DAG. You can trigger it manually or schedule it to run at specific intervals.
If you prefer running Kedro directly from a Jupyter notebook, you can trigger Airflow DAGs from within Jupyter using Airflow's CLI.
Here's how you can do it from Jupyter:
import subprocess
# Trigger the DAG
dag_id = "your_dag_id"
subprocess.run(["airflow", "dags", "trigger", dag_id])
This will trigger your Airflow DAG (which runs your Kedro pipeline) from Jupyter.
Airflow uses Directed Acyclic Graphs (DAGs) to represent workflows. You need to create a DAG that triggers your Kedro pipeline. In your DAG file, import the necessary Kedro and Airflow modules, then define tasks that will run your Kedro pipeline.
Example structure of your Airflow DAG file:
from airflow import DAG
from datetime import datetime
from airflow.operators.python import PythonOperator
from kedro.framework.session import KedroSession
def run_kedro_pipeline():
with KedroSession.create("your_kedro_project") as session:
session.run()
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2023, 1, 1),
}
with DAG('your_kedro_dag', default_args=default_args, schedule_interval='@daily') as dag:
run_pipeline = PythonOperator(
task_id='run_kedro_pipeline',
python_callable=run_kedro_pipeline,
)
Save this Python file in your Airflow DAGs directory, typically located in ~/airflow/dags/
.
After setting up your DAG file:
- Restart the Airflow scheduler and web server.
- In the Airflow UI (
http://localhost:8080
), you should see the DAG listed. - Trigger the DAG manually and verify that your Kedro pipeline runs successfully.
If you're prompted to enter a username and password at http://localhost:8080
, the default credentials are:
- Username:
airflow
- Password:
airflow
If this doesn't work, you may need to set up authentication by modifying the airflow.cfg
file.
If you encounter an error about task instances still running when attempting to delete a DAG, follow these steps:
-
Clear the task instances using the Airflow UI or from the command line:
airflow tasks clear <dag_id> --yes
-
Alternatively, kill running task processes manually:
ps aux | grep airflow kill -9 <task_instance_pid>
Then, you can safely delete the DAG:
airflow dags delete <dag_id>
Sometimes you may encounter issues with Airflow's task runner processes that prevent you from deleting or stopping a DAG. To resolve this:
-
Identify running Airflow processes using:
ps aux | grep airflow
-
Kill any running
task runner
orweb server
processes manually:kill -9 <process_id>
-
Restart the Airflow scheduler and web server.
Integrating Kedro with Airflow allows you to leverage Airflow's orchestration and scheduling features with Kedro's pipeline framework. Once set up, Airflow will help you automate and monitor your Kedro pipelines with ease.
For more information, refer to the official documentation for Apache Airflow and Kedro.
If you want to start fresh with Apache Airflow and remove all existing configurations, databases, and processes, you can follow these steps. This approach will clear out your existing Airflow setup and allow you to start from scratch.
Ensure that all Airflow services (webserver, scheduler, etc.) are stopped. You can use pkill
or kill
commands to stop them:
pkill -f "airflow"
Airflow stores various files and directories that you might want to remove. Be careful with this step as it will delete all your existing DAGs, configurations, and logs.
# Remove Airflow logs
rm -rf /path/to/your/airflow/logs
# Remove Airflow DAGs
rm -rf /path/to/your/airflow/dags
# Remove Airflow configurations
rm -rf /path/to/your/airflow/config
Replace /path/to/your/airflow/
with the actual path to your Airflow installation. The default path is often ~/airflow
.
Airflow uses a database to store its metadata. If you're using SQLite (the default for a new installation), you can remove the database file:
rm /path/to/your/airflow/airflow.db
If you're using another database like PostgreSQL or MySQL, you'll need to drop the database or delete the schema manually.
Remove any stale PID files that might be causing issues:
rm /path/to/your/airflow/airflow-webserver.pid
rm /path/to/your/airflow/airflow-scheduler.pid
You can reinstall Airflow to ensure a clean setup:
pip uninstall apache-airflow
pip install apache-airflow
If the Airflow webserver is already running and using port 8080, you need to stop it first. If you can't find the exact process ID (PID), you can use the pkill
command to stop it:
pkill -f "airflow webserver"
pkill -f "airflow scheduler"
Alternatively, you can use lsof
or netstat
to find the PID of the process using port 8080 and then kill it:
lsof -i :8080
Look for the PID in the output and then use:
kill -9 <PID>
Airflow uses PID files to track running processes. If these files are stale, they might prevent Airflow from starting properly. Remove any existing PID files:
rm /path/to/your/airflow/airflow-webserver.pid
rm /path/to/your/airflow/airflow-scheduler.pid
Replace /path/to/your/airflow/
with the actual path to your Airflow installation.
**Repeat the whole process all over again in order to have a successfull Airfow initilization
If you continue to experience issues or if killing the processes does not work, consider these additional steps:
-
Check for any lingering Airflow processes: Ensure that there are no remaining Airflow processes running that might still be using port 8080.
-
Restart your system: Sometimes, a reboot can help resolve lingering port conflicts, especially if the above steps do not free up the port.
-
Change the port (if necessary): If you can't free up port 8080, consider running Airflow on a different port by specifying it with the
--port
flag (e.g.,--port 8081
).