Skip to content

Commit 96c04ac

Browse files
adwk67maltesandersbernauer
authored
feat: Jupyterhub with keycloak, spark and s3 (#155)
* initial keycloak setup * wip: jupyterhub + keycloak * wip * wip: certificates work but callback does not * wip: various tweaks * added some temp docs * add login info * added some readme info * corrected ingress secret, set python cacert explicitly * wip: working version * clean-up realm-config * delegate user check to Keycloak * use demo-specific keycloak * removed unnecessary settings * specify ports * add jupyterhub.yaml to stack * wip: working nb/spark combo * read/write from s3 * remove driver service resource in favour of the ones produced dynamically * use secret for minio credentials, add demo entry * set endpoints via extra config * mount notebook * user-specific job name * add some notebook comments * typos and add password to stack * first draft of demo docs * typo, fixed title * added hdfs write/read steps * updated docs * doc cleanup * Apply suggestions from code review Review comments. Co-authored-by: Malte Sander <malte.sander.it@gmail.com> * review suggestions: remove HDFS, improve docs and server options * Update docs/modules/demos/pages/jupyterhub-keycloak.adoc Co-authored-by: Sebastian Bernauer <sebastian.bernauer@stackable.de> * Update docs/modules/demos/pages/jupyterhub-keycloak.adoc Co-authored-by: Sebastian Bernauer <sebastian.bernauer@stackable.de> * Update docs/modules/demos/pages/jupyterhub-keycloak.adoc Co-authored-by: Malte Sander <malte.sander.it@gmail.com> * Update docs/modules/demos/pages/jupyterhub-keycloak.adoc Co-authored-by: Malte Sander <malte.sander.it@gmail.com> * added a note about proxy reachability --------- Co-authored-by: Malte Sander <malte.sander.it@gmail.com> Co-authored-by: Sebastian Bernauer <sebastian.bernauer@stackable.de>
1 parent 4803b96 commit 96c04ac

21 files changed

+1300
-0
lines changed

demos/demos-v2.yaml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -226,3 +226,20 @@ demos:
226226
cpu: "3"
227227
memory: 5098Mi
228228
pvc: 16Gi
229+
jupyterhub-keycloak:
230+
description: Demo showing jupyterhub notebooks secured with keycloak
231+
documentation: https://docs.stackable.tech/stackablectl/stable/demos/jupyterhub-keycloak.html
232+
stackableStack: jupyterhub-keycloak
233+
labels:
234+
- jupyterhub
235+
- keycloak
236+
- spark
237+
- S3
238+
manifests:
239+
# TODO: revert paths
240+
- plainYaml: demos/jupyterhub-keycloak/load-gas-data.yaml
241+
supportedNamespaces: []
242+
resourceRequests:
243+
cpu: 6400m
244+
memory: 12622Mi
245+
pvc: 20Gi
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
---
2+
apiVersion: batch/v1
3+
kind: Job
4+
metadata:
5+
name: load-gas-data
6+
spec:
7+
template:
8+
spec:
9+
containers:
10+
- name: load-gas-data
11+
image: "bitnami/minio:2022-debian-10"
12+
command: ["bash", "-c", "cd /tmp; curl -O https://repo.stackable.tech/repository/misc/datasets/gas-sensor-data/20160930_203718.csv && mc --insecure alias set minio http://minio:9000/ $(cat /minio-s3-credentials/accessKey) $(cat /minio-s3-credentials/secretKey) && mc cp 20160930_203718.csv minio/demo/gas-sensor/raw/;"]
13+
volumeMounts:
14+
- name: minio-s3-credentials
15+
mountPath: /minio-s3-credentials
16+
volumes:
17+
- name: minio-s3-credentials
18+
secret:
19+
secretName: minio-s3-credentials
20+
restartPolicy: OnFailure
21+
backoffLimit: 50
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
= jupyterhub-keycloak
2+
3+
:k8s-cpu: https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu
4+
:spark-pkg: https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
5+
:pyspark: https://spark.apache.org/docs/latest/api/python/getting_started/index.html
6+
:jupyterhub-k8s: https://github.com/jupyterhub/zero-to-jupyterhub-k8s
7+
:jupyterlab: https://jupyterlab.readthedocs.io/en/stable/
8+
:jupyter: https://jupyter.org
9+
:keycloak: https://www.keycloak.org/
10+
:gas-sensor: https://archive.ics.uci.edu/dataset/487/gas+sensor+array+temperature+modulation
11+
12+
This demo showcases the integration between {jupyter}[JupyterHub] and {keycloak}[Keycloak] deployed on the Stackable Data Platform (SDP) onto a Kubernetes cluster.
13+
{jupyterlab}[JupyterLab] is deployed using the {jupyterhub-k8s}[pyspark-notebook stack] provided by the Jupyter community.
14+
A simple notebook is provided that shows how to start a distributed Spark cluster, reading and writing data from an S3 instance.
15+
16+
For this demo a small sample of {gas-sensor}[gas sensor measurements*] is provided.
17+
Install this demo on an existing Kubernetes cluster:
18+
19+
[source,console]
20+
----
21+
$ stackablectl demo install jupyterhub-keycloak
22+
----
23+
24+
WARNING: When running a distributed Spark cluster from within a JupyterHub notebook, the notebook acts as the driver and requests executors Pods from k8s.
25+
These Pods in turn can mount *all* volumes and Secrets in that namespace.
26+
To prevent this from breaking user separation, it is planned to use an OPA gatekeeper to define OPA rules that restrict what the created executor Pods can mount. This is not yet implemented in this demo.
27+
28+
[#system-requirements]
29+
== System requirements
30+
31+
To run this demo, your system needs at least:
32+
33+
* 8 {k8s-cpu}[cpu units] (core/hyperthread)
34+
* 32GiB memory
35+
36+
You may need more resources depending on how many concurrent users are logged in, and which notebook profiles they are using.
37+
38+
== Aim / Context
39+
40+
This demo shows how to authenticate JupyerHub users against a Keycloak backend using JupyterHub's OAuthenticator.
41+
The same users as in the xref:end-to-end-security.adoc[End-to-end-security] demo are configured in Keycloak and these will be used as examples.
42+
The notebook offers a simple template for using Spark to interact with S3 as a storage backend.
43+
44+
== Overview
45+
46+
This demo will:
47+
48+
* Install the required Stackable Data Platform operators
49+
* Spin up the following data products:
50+
** *JupyterHub*: A multi-user server for Jupyter notebooks
51+
** *Keycloak*: An identity and access management product
52+
** *S3*: A Minio instance for data storage
53+
* Download a sample of the gas sensor dataset into S3
54+
* Install the Jupyter notebook
55+
* Demonstrate some basic data operations against S3
56+
* Illustrate multi-user usage
57+
58+
== JupyterHub
59+
60+
Have a look at the available Pods before logging in:
61+
62+
[source,console]
63+
----
64+
$ kubectl get pods
65+
NAME READY STATUS RESTARTS AGE
66+
hub-84f49ccbd7-29h7j 1/1 Running 0 56m
67+
keycloak-544d757f57-f55kr 2/2 Running 0 57m
68+
load-gas-data-m6z5p 0/1 Completed 0 54m
69+
minio-5486d7584f-x2jn8 1/1 Running 0 57m
70+
proxy-648bf7f45b-62vqg 1/1 Running 0 56m
71+
72+
----
73+
74+
The `proxy` Pod has an associated `proxy-public` service with a statically-defined port (31095), exposed with type NodePort. The `keycloak` Pod has a Service called `keycloak` with a fixed port (31093) of type NodePort as well.
75+
In order to reach the JupyterHub web interface, navigate to this service.
76+
The node port IP can be found in the ConfigMap `keycloak-address` (written by the Keycloak Deployment as it starts up).
77+
On Kind this can be any node - not necessarily the one where the proxy Pod is running.
78+
This is due to the way in which Docker networking is used within the cluster.
79+
On other clusters it will be necessary to use the exact Node on which the proxy is running.
80+
81+
In the example below that would then be 172.19.0.5:31095:
82+
83+
[source,yaml]
84+
----
85+
apiVersion: v1
86+
data:
87+
keycloakAddress: 172.19.0.5:31093 # Keycloak itself
88+
keycloakNodeIp: 172.19.0.5 # can be used to access the proxy-public service
89+
kind: ConfigMap
90+
metadata:
91+
name: keycloak-address
92+
namespace: default
93+
----
94+
95+
NOTE: The `hub` Pod may show a `CreateContainerConfigError` for a few moments on start-up as it requires the ConfigMap written by the Keycloak deployment.
96+
97+
You should see the JupyterHub login page, which will indicate a re-direct to the OAuth service (Keycloak):
98+
99+
image::jupyterhub-keycloak/oauth-login.png[]
100+
101+
Click on the sign-in button.
102+
You will be redirected to the Keycloak login, where you can enter one of the aforementioned users (e.g. `justin.martin` or `isla.williams`: the password is the same as the username):
103+
104+
image::jupyterhub-keycloak/keycloak-login.png[]
105+
106+
A successful login will redirect you back to JupyterHub where different profiles are listed (the drop-down options are visible when you click on the respective fields):
107+
108+
image::jupyterhub-keycloak/server-options.png[]
109+
110+
The explorer window on the left includes a notebook that is already mounted.
111+
112+
Double-click on the file `notebook/process-s3.ipynb`:
113+
114+
image::jupyterhub-keycloak/load-nb.png[]
115+
116+
Run the notebook by selecting "Run All Cells" from the menu:
117+
118+
image::jupyterhub-keycloak/run-nb.png[]
119+
120+
The notebook includes some comments regarding image compatibility and uses a custom image built off the official Spark image that matches the Spark version used in the notebook.
121+
The java versions also match exactly.
122+
Python versions need to match at the `major:minor` level, which is why Python 3.11 is used in the custom image.
123+
124+
Once the spark executor has been started (we have specified `spark.executor.instances` = 1) it will spin up as an extra pod.
125+
We have named the spark job to incorporate the current user (justin-martin).
126+
JupyterHub has started a pod for the user's notebook instance (`jupyter-justin-martin---bdd3b4a1`) and another one for the spark executor (`process-s3-jupyter-justin-martin-bdd3b4a1-9e9da995473f481f-exec-1`):
127+
128+
[source,console]
129+
----
130+
$ kubectl get pods
131+
NAME READY STATUS RESTARTS AGE
132+
...
133+
jupyter-justin-martin---bdd3b4a1 1/1 Running 0 17m
134+
process-s3-jupyter-justin-martin-... 1/1 Running 0 2m9s
135+
...
136+
----
137+
138+
Stop the kernel in the notebook (which will shut down the spark session and thus the executor) and log out as the current user.
139+
Log in now as `daniel.king` and then again as `isla.williams` (you may need to do this in a clean browser sessions so that existing login cookies are removed).
140+
This user has been defined as an admin user in the jupyterhub configuration:
141+
142+
[source,yaml]
143+
----
144+
...
145+
hub:
146+
config:
147+
Authenticator:
148+
# don't filter here: delegate to Keycloak
149+
allow_all: True
150+
admin_users:
151+
- isla.williams
152+
...
153+
----
154+
155+
You should now see user-specific pods for all three users:
156+
157+
158+
[source,console]
159+
----
160+
$ kubectl get pods
161+
NAME READY STATUS RESTARTS AGE
162+
...
163+
jupyter-daniel-king---181a80ce 1/1 Running 0 6m17s
164+
jupyter-isla-williams---14730816 1/1 Running 0 4m50s
165+
jupyter-justin-martin---bdd3b4a1 1/1 Running 0 3h47m
166+
...
167+
----
168+
169+
The admin user (`isla.williams`) will also have an extra Admin tab in the JupyterHub console where current users can be managed.
170+
You can find this in the JupyterHub UI at http://<ip>:31095/hub/admin e.g http://172.19.0.5:31095/hub/admin:
171+
172+
image::jupyterhub-keycloak/admin-tab.png[]
173+
174+
You can inspect the S3 buckets by using stackable stacklet list to return the Minio endpoint and logging in there with `admin/adminadmin`:
175+
176+
[source,console]
177+
----
178+
$ stackablectl stacklet list
179+
180+
┌─────────┬───────────────┬───────────┬───────────────────────────────┬────────────┐
181+
│ PRODUCT ┆ NAME ┆ NAMESPACE ┆ ENDPOINTS ┆ CONDITIONS │
182+
╞═════════╪═══════════════╪═══════════╪═══════════════════════════════╪════════════╡
183+
│ minio ┆ minio-console ┆ default ┆ http http://172.19.0.5:32470 ┆ │
184+
└─────────┴───────────────┴───────────┴───────────────────────────────┴────────────┘
185+
----
186+
187+
image::jupyterhub-keycloak/s3-buckets.png[]
188+
189+
NOTE: if you attempt to re-run the notebook you will need to first remove the `_temporary folders` from the S3 buckets.
190+
These are created by spark jobs and are not removed from the bucket when the job has completed.
191+
192+
*See: Burgués, Javier, Juan Manuel Jiménez-Soto, and Santiago Marco. "Estimation of the limit of detection in semiconductor gas sensors through linearized calibration models." Analytica chimica acta 1013 (2018): 13-25
193+
Burgués, Javier, and Santiago Marco. "Multivariate estimation of the limit of detection by orthogonal partial least squares in temperature-modulated MOX sensors." Analytica chimica acta 1019 (2018): 49-64.

0 commit comments

Comments
 (0)