Read how to install Minikube: https://minikube.sigs.k8s.io/docs/start/
We have to install Docker as well https://docs.docker.com/engine/install/ubuntu/
Start Minikube:
$ sudo usermod -aG docker $USER
$ minikube start machine
$ minikube dashboard
Build the Docker image:
$ minikube docker-env
$ docker build -t spark-hadoop:latest -f ./docker/Dockerfile ./docker
Create the deployments and services:
$ kubectl create -f ./kubernetes/spark-master-deployment.yaml
$ kubectl create -f ./kubernetes/spark-master-service.yaml
$ kubectl create -f ./kubernetes/spark-worker-deployment.yaml
$ minikube addons enable ingress
$ kubectl apply -f ./kubernetes/minikube-ingress.yaml
Add an entry to hosts:
$ echo "$(minikube ip) " | sudo tee -a /etc/hosts
Run on Spark to check that it works
val myWords = "HI HI HOW ARE YOU HAH"
val mySplit = myWords.split(" ").foldLeft(Map.empty[String, Int]) {
(count, word) => count + (word -> (count.getOrElse(word, 0) + 1))
}
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. But it is way more then that!
This is a rewritten notebook example from this blog post by Databricks. The intension is to show why Delta Lake is a big deal and how to run Delta Lake without a Databricks services.
Delta Lake examples in this notebook:
- Convert data to as Delta Lake format
- Create Delta Lake table
- Spark SQL capabilities
- Delete data
- Update data
- View audit history of table
- Merge (union) of two tables which remove duplicates, updates rows and add a new row
Generate py-files:
pip install -r requirements
python ipynb2py.py