This repository contains a collection of Jupyter notebooks demonstrating exploratory data analysis techniques using Kubox AI.
- Introduction
- AWS Cloud Setup
- Install Kubox CLI
- Basic EDA on NYC Taxi Dataset
- Local Development
- Contributing
- License
Tip
Kubox is currently in its early-stage public preview and under active development. We’re continuously improving and refining the platform, so things may change as we grow. We welcome your feedback and suggestions to help shape the future of Kubox AI.
In this examples, we showcase how deploying a data platform co-located with data (data locality) can reduce costs and time. Consider a scenario where you run a query to analyse the New York taxi 300GB dataset stored in AWS S3 from a Google Colab Notebook. This setup would result in USD 20 in AWS egress costs and take over an hour just to download the data. For more information see this blog post.
Create an AWS IAM role for the EC2 Instances
If you are creating a GPU based kubox cluster outside ap-southeast-2, you will need to create an Amazon Machine Image (AMI) in your desired AWS region
Download and install kubox
command line tool, a single binary required to create a Kubernetes based Kubox cluster.
curl https://kubox.sh | sh
Setup AWS CLI
aws configure
Tip
If you are new to Kubox see how to create your Hello World AWS Cluster
https://docs.kubox.ai/examples/nyc-taxi
Clone the Kubox notebooks repository to your local machine:
git clone https://github.com/kubox-ai/notebooks.git
Create a Kubox cluster from the root of the cloned repository:
kubox create -f cluster-basic.yaml
If you have issues creating see the troubleshooting guide here: https://docs.kubox.ai/kb/troubleshooting
A kubeconfig file will be generated as part of the cluster creation process. Set it as an environment variable:
export KUBECONFIG=./basic/cluster/config/kubeconfig
Connect to the kubernetes cluster:
kubectl get pods -n kubox
Port forward the notebook server to your local machine:
kubectl port-forward service/notebook 8080:80 -n kubox
Open your browser and navigate to http://localhost:8080
Delete the cluster when done.
kubox delete -f cluster-basic.yaml
Navigate to the basic or gpu directory
cd ./basic
Create a local python virtual environment and activate it. We are using pyenv
to manage our python versions. You can use pyenv install
to install a python version.
Set the current python version to 3.11.9
pyenv shell 3.11.9
Create a virtual environment
python -m venv .venv
Activate the virtual environment
source .venv/bin/activate
Install poetry
cd code
pip install poetry
Install dependencies
poetry install
We welcome contributions! If you find a bug, have a feature request, or want to improve the notebooks, feel free to open an issue or submit a pull request.
This repository is licensed under the Apache License 2.0. You are free to use, modify, and distribute this project under the terms of the license. See the LICENSE file for more details.