Skip to content

Commit b8cc2e9

Browse files
authored
docs: JupyterHub/Keycloak demo - add marketing input (#159)
* added marketing input * cleanup * minor text change
1 parent c5ca9ff commit b8cc2e9

File tree

1 file changed

+115
-19
lines changed

1 file changed

+115
-19
lines changed

docs/modules/demos/pages/jupyterhub-keycloak.adoc

Lines changed: 115 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -9,21 +9,20 @@
99
:keycloak: https://www.keycloak.org/
1010
:gas-sensor: https://archive.ics.uci.edu/dataset/487/gas+sensor+array+temperature+modulation
1111

12-
This demo showcases the integration between {jupyter}[JupyterHub] and {keycloak}[Keycloak] deployed on the Stackable Data Platform (SDP) onto a Kubernetes cluster.
13-
{jupyterlab}[JupyterLab] is deployed using the {jupyterhub-k8s}[pyspark-notebook stack] provided by the Jupyter community.
14-
A simple notebook is provided that shows how to start a distributed Spark cluster, reading and writing data from an S3 instance.
12+
== Installation
1513

16-
For this demo a small sample of {gas-sensor}[gas sensor measurements*] is provided.
17-
Install this demo on an existing Kubernetes cluster:
14+
To install the demo on an existing Kubernetes cluster, use the following command:
1815

1916
[source,console]
2017
----
2118
$ stackablectl demo install jupyterhub-keycloak
2219
----
2320

24-
WARNING: When running a distributed Spark cluster from within a JupyterHub notebook, the notebook acts as the driver and requests executors Pods from k8s.
25-
These Pods in turn can mount *all* volumes and Secrets in that namespace.
26-
To prevent this from breaking user separation, it is planned to use an OPA gatekeeper to define OPA rules that restrict what the created executor Pods can mount. This is not yet implemented in this demo.
21+
== Accessing the JupyterHub Interface
22+
23+
* Navigate to the {jupyter}[JupyterHub] web interface using the NodePort IP and port (e.g., http://<ip>:31095)
24+
* Log in using the predefined user credentials (e.g., `justin.martin` or `isla.williams` with the password matching the username)
25+
* Select a {jupyterhub-k8s}[notebook] (provided by the Jupyter community) profile and start processing data using the provided notebook
2726

2827
[#system-requirements]
2928
== System requirements
@@ -33,27 +32,103 @@ To run this demo, your system needs at least:
3332
* 8 {k8s-cpu}[cpu units] (core/hyperthread)
3433
* 32GiB memory
3534

36-
You may need more resources depending on how many concurrent users are logged in, and which notebook profiles they are using.
35+
Additional resources may be required depending on the number of concurrent users and their selected notebook profiles.
3736

38-
== Aim / Context
37+
== Overview
3938

40-
This demo shows how to authenticate JupyerHub users against a Keycloak backend using JupyterHub's OAuthenticator.
41-
The same users as in the xref:end-to-end-security.adoc[End-to-end-security] demo are configured in Keycloak and these will be used as examples.
42-
The notebook offers a simple template for using Spark to interact with S3 as a storage backend.
39+
The JupyterHub-Keycloak integration demo offers a comprehensive and secure multi-user data science environment on Kubernetes.
40+
This demo highlights several key features:
4341

44-
== Overview
42+
* Secure Authentication: Utilizes {keycloak}[Keycloak] for robust user authentication and identity management
43+
* Dynamic Spark Integration: Demonstrates how to start a distributed Spark cluster directly from a Jupyter notebook, with dynamic resource allocation
44+
* S3 Storage Interaction: Illustrates reading from and writing to an S3-compatible storage (MinIO) using Spark, with secure credential management
45+
* Scalable and Flexible: Leverages Kubernetes for scalable resource management, allowing users to select from predefined resource profiles
46+
* User-Friendly: Provides an intuitive interface for data scientists to perform common data operations with ease
4547

4648
This demo will:
4749

4850
* Install the required Stackable Data Platform operators
4951
* Spin up the following data products:
50-
** *JupyterHub*: A multi-user server for Jupyter notebooks
51-
** *Keycloak*: An identity and access management product
52-
** *S3*: A Minio instance for data storage
53-
* Download a sample of the gas sensor dataset into S3
52+
** JupyterHub: A multi-user server for Jupyter notebooks
53+
** Keycloak: An identity and access management product
54+
** S3: A Minio instance for data storage
55+
* Download a sample of {gas-sensor}[gas sensor measurements*] into S3
5456
* Install the Jupyter notebook
5557
* Demonstrate some basic data operations against S3
56-
* Illustrate multi-user usage
58+
* Enable multi-user usage
59+
60+
== Introduction to the Demo
61+
62+
The JupyterHub-Keycloak demo is designed to provide data scientists with a typical environment for data analysis and processing.
63+
This demo integrates JupyterHub with Keycloak for secure user management and utilizes Apache Spark for distributed data processing.
64+
The environment is deployed on a Kubernetes cluster, ensuring scalability and efficient resource utilization.
65+
66+
NOTE: There are some security considerations to be aware of if using distributed Spark.
67+
Each Spark cluster runs using the same service account, and it is possible for an executor pod to mount any secret in the namespace.
68+
It is planned to implement OPA gatekeeper rules in later versions of this demo to restrict this.
69+
This feature is not yet implemented and in the meantime, users' environments are kept separate but not private.
70+
71+
== Showcased Features of the Demo
72+
73+
=== Secure User Authentication with Keycloak
74+
75+
* **OAuthenticator**: JupyterHub is configured to use Keycloak for user authentication, ensuring secure and manageable access control.
76+
* **Admin Users**: Certain users (e.g. for this demo: `isla.williams`) are configured as admin users with access to user management features in the JupyterHub admin console.
77+
78+
=== Dynamic Spark Configuration
79+
80+
* **Client Mode**: Spark is configured to run in client mode, with the notebook acting as the driver.
81+
This setup is ideal for interactive data processing.
82+
* **Executor Management**: Spark executors are dynamically spawned as Kubernetes pods, with executor resources being defined by each user's Spark session.
83+
* **Compatibility**: Ensures compatibility between the driver and executor by matching Spark, Python, and Java versions.
84+
85+
=== S3 Storage Integration
86+
87+
* **MinIO**: Utilizes MinIO as an S3-compatible storage solution for storing and retrieving data.
88+
* **Secure Credential Management**: MinIO credentials are managed using Kubernetes secrets, keeping them separate from notebook code.
89+
* **Data Operations**: Demonstrates reading from and writing to S3 storage using Spark, with support for CSV and Parquet formats.
90+
91+
== Configuration Settings Overview
92+
93+
=== Keycloak Configuration
94+
95+
* **Deployment**: Keycloak is deployed using a Kubernetes Deployment with a ConfigMap for realm configuration.
96+
* **Services**: Keycloak and JupyterHub services use fixed NodePorts (31093 for Keycloak and 31095 for JupyterHub).
97+
98+
=== JupyterHub Configuration
99+
100+
* **Authentication**: Configured to use GenericOAuthenticator for authenticating against Keycloak.
101+
* **Certificates**: Utilizes self-signed certificates for secure communication between JupyterHub and Keycloak.
102+
* **Endpoints**: Endpoints for OAuth callback, authorization, token- and user-data are dynamically set using environment variables and a ConfigMap.
103+
104+
=== Spark Configuration
105+
106+
* **Executor Image**: Uses a custom image `oci.stackable.tech/sandbox/spark:3.5.2-python311` (built on the standard Spark image) for the executors, matching the Python version of the notebook.
107+
* **Resource Allocation**: Configures Spark executor instances, memory, and cores through settings defined in the notebook.
108+
* **Hadoop and AWS Libraries**: Includes necessary Hadoop and AWS libraries for S3 operations, matching the notebook image version.
109+
110+
For more details, see the https://docs.stackable.tech/home/stable/tutorials/jupyterhub/[tutorial].
111+
112+
== Detailed Demo/Notebook Walkthrough
113+
114+
The demo showcases an ipython notebook that begins by outputting the versions of Python, Java, and PySpark being used.
115+
It reads MinIO credentials from a mounted secret to access the S3 storage.
116+
This ensures that the environment is correctly set up and that the necessary credentials are available for S3 operations.
117+
The notebook configures Spark to interact with an S3 bucket hosted on MinIO.
118+
It includes necessary Hadoop and AWS libraries to facilitate S3 operations.
119+
The Spark session is configured with various settings, including executor instances, memory, and cores, to ensure optimal performance.
120+
121+
The demo then performs various data processing tasks, including:
122+
123+
* **Creating an In-Memory DataFrame**: Verifies compatibility between the driver and executor libraries.
124+
* **Inspecting S3 Buckets with PyArrow**: Lists files in the S3 bucket using the PyArrow library.
125+
* **Read/Write Operations**: Demonstrates reading CSV data from S3, performing basic transformations, and writing the results back to S3 in CSV and Parquet formats.
126+
* **Data Aggregation**: Aggregates data by hour and writes the aggregated results back to S3.
127+
* **DataFrame Conversion**: Shows how to convert between Spark and Pandas DataFrames for further analysis or visualization.
128+
129+
== Users
130+
131+
The same users as in the xref:end-to-end-security.adoc[End-to-end-security] demo are configured in Keycloak and these will be used as examples.
57132

58133
== JupyterHub
59134

@@ -189,5 +264,26 @@ image::jupyterhub-keycloak/s3-buckets.png[]
189264
NOTE: if you attempt to re-run the notebook you will need to first remove the `_temporary folders` from the S3 buckets.
190265
These are created by spark jobs and are not removed from the bucket when the job has completed.
191266

267+
== Where to go from here
268+
269+
=== Add your own data
270+
271+
You can augment the demo dataset with your own data by creating new buckets and folders and uploading your own data via the MinIO UI.
272+
273+
=== Scale up and out
274+
275+
There are several possibilities here (all of which will depend to some degree on resources available to the cluster):
276+
277+
* Allocate more CPU and memory resources to the JupyterHub notebooks or change notebook profiles by modifying the `singleuser.profileList` in the Helm chart values
278+
* add concurrent users
279+
* alter Spark session settings by changing `spark.executor.instances`, `spark.executor.memory` or `spark.executor.cores`
280+
* Integrate other data sources, for example HDFS (see the https://docs.stackable.tech/home/nightly/demos/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/[JupyterHub-Pyspark] demo)
281+
282+
== Conclusion
283+
284+
The JupyterHub-Keycloak integration demo, with its dynamic Spark integration and S3 storage interaction, is a great starting point for data scientists to begin building complex data operations.
285+
286+
For further details and customization options, refer to the demo notebook and configuration files provided in the repository. This environment is ideal for data scientists with a platform engineering background, offering a template solution for secure and efficient data processing.
287+
192288
*See: Burgués, Javier, Juan Manuel Jiménez-Soto, and Santiago Marco. "Estimation of the limit of detection in semiconductor gas sensors through linearized calibration models." Analytica chimica acta 1013 (2018): 13-25
193289
Burgués, Javier, and Santiago Marco. "Multivariate estimation of the limit of detection by orthogonal partial least squares in temperature-modulated MOX sensors." Analytica chimica acta 1019 (2018): 49-64.

0 commit comments

Comments
 (0)