Skip to content

Commit 5e5df23

Browse files
adwk67xeniape
andauthored
docs: added note about adding jupyterhub dependencies (#105)
* docs: added note about adding jupyterhub dependencies * added section for product images * wording * formatting * link to notebook from image * corrected relative link path * updated docs in light of notebook/image changes * Update docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc Co-authored-by: Xenia <xenia.fischer@stackable.tech> * Update docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc Co-authored-by: Xenia <xenia.fischer@stackable.tech> * Update docs/modules/demos/pages/signal-processing.adoc Co-authored-by: Xenia <xenia.fischer@stackable.tech> * Update docs/modules/demos/pages/signal-processing.adoc Co-authored-by: Xenia <xenia.fischer@stackable.tech> * omit redundant sentence * Update docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc Co-authored-by: Xenia <xenia.fischer@stackable.tech> --------- Co-authored-by: Xenia <xenia.fischer@stackable.tech>
1 parent 819b870 commit 5e5df23

File tree

2 files changed

+100
-6
lines changed

2 files changed

+100
-6
lines changed

docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc

Lines changed: 60 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -136,17 +136,71 @@ You should arrive at your workspace:
136136
image::jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/jupyter_hub_workspace.png[]
137137

138138
Now you can double-click on the `notebook` folder on the left, open and run the contained file.
139-
Click on the double arrow (⏩️) to execute the Python scripts.
139+
Click on the double arrow (⏩️) to execute the Python scripts (click on the image below to go to the notebook file).
140140

141-
image::jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/jupyter_hub_run_notebook.png[]
141+
image::jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/jupyter_hub_run_notebook.png[link=https://github.com/stackabletech/demos/blob/main/stacks/jupyterhub-pyspark-hdfs/notebook.ipynb,window=_blank]
142142

143143
You can also inspect the `hdfs` folder where the `core-site.xml` and `hdfs-site.xml` from the discovery ConfigMap of the HDFS cluster are located.
144144

145-
[NOTE]
146-
====
147145
The image defined for the spark job must contain all dependencies needed for that job to run.
148-
For pyspark jobs, this will mean that Python libraries either need to be baked into the image (this demo contains a Dockerfile that was used to generate an image containing scikit-learn, pandas and their dependencies) or {spark-pkg}[packaged in some other way].
149-
====
146+
For PySpark jobs, this will mean that Python libraries either need to be baked into the image or {spark-pkg}[packaged in some other way].
147+
This demo contains a custom image created from a Dockerfile that is used to generate an image containing scikit-learn, pandas and their dependencies.
148+
This is described below.
149+
150+
=== Install the libraries into a product image
151+
152+
Libraries can be added to a custom *product* image launched by the notebook. Suppose a Spark job is prepared like this:
153+
154+
[source,python]
155+
----
156+
spark = (SparkSession
157+
.builder
158+
.master(f'k8s://https://{os.environ["KUBERNETES_SERVICE_HOST"]}:{os.environ["KUBERNETES_SERVICE_PORT"]}')
159+
.config("spark.kubernetes.container.image", "docker.stackable.tech/demos/spark-k8s-with-scikit-learn:3.5.0-stackable24.3.0")
160+
.config("spark.driver.port", "2222")
161+
.config("spark.driver.blockManager.port", "7777")
162+
.config("spark.driver.host", "driver-service.default.svc.cluster.local")
163+
.config("spark.driver.bindAddress", "0.0.0.0")
164+
.config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
165+
.config("spark.kubernetes.authenticate.serviceAccountName", "spark")
166+
.config("spark.executor.instances", "4")
167+
.config("spark.kubernetes.container.image.pullPolicy", "IfNotPresent")
168+
.appName("taxi-data-anomaly-detection")
169+
.getOrCreate()
170+
)
171+
----
172+
173+
It requires a specific Spark image:
174+
175+
[source,python]
176+
----
177+
.config("spark.kubernetes.container.image",
178+
"docker.stackable.tech/demos/spark-k8s-with-scikit-learn:3.5.0-stackable24.3.0"),
179+
...
180+
----
181+
182+
This is created by taking a Spark image, in this case `docker.stackable.tech/stackable/spark-k8s:3.5.0-stackable24.3.0`, installing specific python libraries into it
183+
, and re-tagging the image:
184+
185+
[source,console]
186+
----
187+
FROM docker.stackable.tech/stackable/spark-k8s:3.5.0-stackable24.3.0
188+
189+
COPY demos/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/requirements.txt .
190+
191+
RUN pip install --no-cache-dir --upgrade pip && \
192+
pip install --no-cache-dir -r ./requirements.txt
193+
----
194+
195+
Where `requirements.txt` contains:
196+
197+
[source,console]
198+
----
199+
scikit-learn==1.3.1
200+
pandas==2.0.3
201+
----
202+
203+
NOTE: Using a custom image requires access to a repository where the image can be made available.
150204

151205
== Model details
152206

docs/modules/demos/pages/signal-processing.adoc

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,46 @@ image::signal-processing/notebook.png[]
6565

6666
The notebook reads the measurement data in windowed batches using a loop, computes some predictions for each batch and persists the scores in a separate timescale table.
6767

68+
=== Adding libraries
69+
70+
There are two ways of doing this:
71+
72+
==== Install from within the notebook
73+
74+
This can be done by executing `!pip install` from within a notebook cell, as shown in the screenshot:
75+
76+
[source,console]
77+
----
78+
!pip install psycopg2-binary
79+
!pip install alibi-detect
80+
----
81+
82+
==== Install the libraries into a custom image
83+
84+
Alternatively dependencies can be added into the base image used for jupyterhub.
85+
This could make use of any Dockerfile mechanism (downloading via `curl`, using a package manager etc.) and is not limited to python libraries.
86+
To achieve the same imports as mentioned in the previous section, build the Dockerfile like this:
87+
88+
[source,console]
89+
----
90+
FROM jupyter/pyspark-notebook:python-3.9
91+
92+
COPY demos/signal-processing/requirements.txt .
93+
94+
RUN pip install --no-cache-dir --upgrade pip && \
95+
pip install --no-cache-dir -r ./requirements.txt
96+
----
97+
98+
Where `requirements.txt` contains:
99+
100+
[source,console]
101+
----
102+
psycopg2-binary==2.9.9
103+
alibi-detect==0.11.4
104+
----
105+
106+
NOTE: Using a custom image requires access to a repository where the image can be made available.
107+
68108
== Model details
69109

70110
The enriched data is calculated using an online, unsupervised https://docs.seldon.io/projects/alibi-detect/en/stable/od/methods/sr.html[model] that uses a technique called http://www.houxiaodi.com/assets/papers/cvpr07.pdf[Spectral Residuals].

0 commit comments

Comments
 (0)