Skip to content

Modify metaflow tutorial to use k8s to run metaflow flows #216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 18 additions & 52 deletions docs/examples/metaflow/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,11 @@ client = run.data.client_model
client.predict(np.array([[1, 2, 3, 4]]))
```

## Run Flow on AWS and Deploy to Remote Kubernetes
## Run Flow and Deploy to Remote Kubernetes

We will now run our flow on AWS Batch and will launch Tempo artifacts onto a remote Kubernetes cluster.
We will now run our flow and deploy Tempo artifacts onto a remote Kubernetes cluster.

### Setup AWS Metaflow Support
### Setup Metaflow

[Install Metaflow with remote AWS support](https://docs.metaflow.org/metaflow-on-aws/metaflow-on-aws).

Expand All @@ -64,48 +64,8 @@ For deploying to a remote Kubernetes cluster with Seldon Core installed do the f

Create a GKE cluster and install Seldon Core on it using [Ansible to install Seldon Core on a Kubernetes cluster](https://github.com/SeldonIO/ansible-k8s-collection).


### K8S Auth from Metaflow

To deploy services to our Kubernetes cluster with Seldon Core installed, Metaflow steps that run on AWS Batch and use tempo will need to be able to access K8S API. This step will depend on whether you're using GKE or AWS EKS to run
your cluster.

#### Option 1. K8S cluster runs on GKE

We will need to create two files in the flow src folder:

```bash
kubeconfig.yaml
gsa-key.json
```

Follow the steps outlined in [GKE server authentication](https://cloud.google.com/kubernetes-engine/docs/how-to/api-server-authentication#environments-without-gcloud).




#### Option 2. K8S cluster runs on AWS EKS

Make note of two AWS IAM role names, for example find them in the IAM console. The names depend on how you deployed Metaflow and EKS in the first place:

1. The role used by Metaflow tasks executed on AWS Batch. If you used the default CloudFormation template to deploy Metaflow, it is the role that has `*BatchS3TaskRole*` in its name.

2. The role used by EKS nodes. If you used `eksctl` to create your EKS cluster, it is the role that starts with `eksctl-<your-cluster-name>-NodeInstanceRole-*`

Now, we need to make sure that AWS Batch role has permissions to access the K8S cluster. For this, add a policy to the AWS Batch task role(1) that has `eks:*` permissions on your EKS cluster (TODO: narrow this down).

You'll also need to add a mapping for that role to `aws-auth` ConfigMap in `kube-system` namespace. For more details, see [AWS docs](https://docs.aws.amazon.com/eks/latest/userguide/add-user-role.html) (under "To add an IAM user or role to an Amazon EKS cluster"). In short, you'd need to add this to `mapRoles` section in the aws-auth ConfigMap:
```
- rolearn: <batch task role ARN>
username: cluster-admin
groups:
- system:masters
```

We also need to make sure that the code running in K8S can access S3. For this, add a policy to the EKS node role (2) to allow it to read and write Metaflow S3 buckets.

### S3 Authentication
Services deployed to Seldon will need to access Metaflow S3 bucket to download trained models. The exact configuration will depend on whether you're using GKE or AWS EKS to run your cluster.
Services deployed to Seldon and Metaflow step code will need to access Metaflow S3 bucket to download trained models. The exact configuration will depend on whether you're using GKE or AWS EKS to run your cluster.

From the base templates provided below, create your `k8s/s3_secret.yaml`.

Expand All @@ -131,7 +91,7 @@ For GKE, to access S3 we'll need to add the following variables to use key/secre

For AWS EKS, we'll use the instance role assigned to the node, we'll only need to set one env variable:
```yaml
RCLONE_CONFIG_S3_ENV_AUTH: "true"
RCLONE_CONFIG_S3_ENV_AUTH: "true"
```

We provide two templates to use in the `k8s` folder:
Expand All @@ -143,35 +103,41 @@ s3_secret.yaml.tmpl.gke

Use one to create the file `s3_secret.yaml` in the same folder

Create a Secret from the `k8s/s3_secret.yaml.tmpl` file by adding your AWS Key that can read from S3 and saving as `k8s/s3_secret.yaml`

```python
!kubectl create -f k8s/s3_secret.yaml -n production
```

## Setup RBAC and Secret on Kubernetes Cluster

These steps assume you have authenticated to your cluster with kubectl configuration


Create a namespace and set up RBAC for Seldon deployments
```python
!kubectl create ns production
```


```python
!kubectl create -f k8s/tempo-pipeline-rbac.yaml -n production
```

Create a Secret from the `k8s/s3_secret.yaml.tmpl` file by adding your AWS Key that can read from S3 and saving as `k8s/s3_secret.yaml`

Create a namespace and set up RBAC for Metaflow batch jobs
```python
!kubectl create ns metaflow
```

```python
!kubectl create -f k8s/s3_secret.yaml -n production
!kubectl create -f k8s/metaflow-pipeline-rbac.yaml -n production
```

## Run Metaflow on AWS Batch

## Run Metaflow

```python
!python src/irisflow.py \
--environment=conda \
--with batch:image=seldonio/seldon-core-s2i-python37-ubi8:1.10.0-dev \
--with kubernetes:image=seldonio/seldon-core-s2i-python37-ubi8:1.10.0-dev \
run
```

Expand Down
31 changes: 31 additions & 0 deletions docs/examples/metaflow/k8s/metaflow-pipeline-rbac.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: metaflow-pipeline
rules:
- apiGroups:
- machinelearning.seldon.io
resources:
- seldondeployments
- seldondeployments/status
verbs:
- "*"
- apiGroups:
- serving.kubeflow.org
resources:
- inferenceservices/status
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: metaflow-pipeline-rolebinding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: metaflow-pipeline
subjects:
- kind: ServiceAccount
name: default
namespace: metaflow
22 changes: 4 additions & 18 deletions docs/examples/metaflow/src/irisflow.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
from metaflow import FlowSpec, IncludeFile, Parameter, conda, step
from utils import pip
import os


PIPELINE_FOLDER_NAME = "classifier"
SKLEARN_FOLDER_NAME = "sklearn"
Expand Down Expand Up @@ -28,14 +30,6 @@ class IrisFlow(FlowSpec):
conda_env = IncludeFile(
"conda_env", help="The path to conda environment for classifier", default=script_path("conda.yaml")
)
kubeconfig = IncludeFile("kubeconfig", help="The path to kubeconfig", default=script_path("kubeconfig.yaml"))
gsa_key = IncludeFile(
"gsa_key", help="The path to google service account json", default=script_path("gsa-key.json")
)
k8s_provider = Parameter(
"k8s_provider", help="kubernetes provider. Needed for non local run to deploy", default="gke"
)
eks_cluster_name = Parameter("eks_cluster_name", help="AWS EKS cluster name (if using EKS)", default="")

@conda(libraries={"scikit-learn": "0.24.1"})
@step
Expand Down Expand Up @@ -111,17 +105,9 @@ def deploy_tempo_remote(self, classifier):
import numpy as np

from tempo import deploy_remote
from tempo.metaflow.utils import aws_authenticate, gke_authenticate
from tempo.serve.deploy import get_client
from tempo.serve.metadata import SeldonCoreOptions

if self.k8s_provider == "gke":
gke_authenticate(self.kubeconfig, self.gsa_key)
elif self.k8s_provider == "aws":
aws_authenticate(self.eks_cluster_name)
else:
raise Exception(f"Unknown Kubernetes Provider {self.k8s_provider}")

runtime_options = SeldonCoreOptions(
**{"remote_options": {"namespace": "production", "authSecretName": "s3-secret"}}
)
Expand Down Expand Up @@ -155,8 +141,8 @@ def tempo(self):
from tempo.metaflow.utils import running_aws_batch

classifier, s3_active = self.create_tempo_artifacts()
if s3_active and running_aws_batch(self.tempo):
print("Deploying to remote k8s cluster")
if os.getenv("KUBERNETES_SERVICE_HOST"):
print("Deploying to k8s cluster")
self.deploy_tempo_remote(classifier)
else:
print("Deploying to local Docker")
Expand Down