diff --git a/docs/user_guides/projects/jobs/notebook_job.md b/docs/user_guides/projects/jobs/notebook_job.md
index a17788651..364b5900e 100644
--- a/docs/user_guides/projects/jobs/notebook_job.md
+++ b/docs/user_guides/projects/jobs/notebook_job.md
@@ -82,7 +82,7 @@ It is possible to also set following configuration settings for a `PYTHON` job.
* `Environment`: The python environment to use
* `Container memory`: The amount of memory in MB to be allocated to the Jupyter Notebook script
* `Container cores`: The number of cores to be allocated for the Jupyter Notebook script
-* `Additional files`: List of files that will be locally accessible by the application
+* `Additional files`: List of files that will be locally accessible in the working directory of the application. Only recommended to use if project datasets are not mounted under `/hopsfs`.
You can always modify the arguments in the job settings.
@@ -142,7 +142,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
```python
-jobs_api = project.get_jobs_api()
+jobs_api = project.get_job_api()
notebook_job_config = jobs_api.get_configuration("PYTHON")
@@ -166,7 +166,33 @@ In this code snippet, we execute the job with arguments and wait until it reache
execution = job.run(args='-p a 2 -p b 5', await_termination=True)
```
-### API Reference
+## Configuration
+The following table describes the JSON payload returned by `jobs_api.get_configuration("PYTHON")`
+
+| Field | Type | Description | Default |
+|-------------------------|----------------|------------------------------------------------------|--------------------------|
+| `type` | string | Type of the job configuration | `"pythonJobConfiguration"` |
+| `appPath` | string | Project path to notebook (e.g `Resources/foo.ipynb`) | `null` |
+| `environmentName` | string | Name of the python environment | `"pandas-training-pipeline"` |
+| `resourceConfig.cores` | number (float) | Number of CPU cores to be allocated | `1.0` |
+| `resourceConfig.memory` | number (int) | Number of MBs to be allocated | `2048` |
+| `resourceConfig.gpus` | number (int) | Number of GPUs to be allocated | `0` |
+| `logRedirection` | boolean | Whether logs are redirected | `true` |
+| `jobType` | string | Type of job | `"PYTHON"` |
+
+
+## Accessing project data
+!!! notice "Recommended approach if `/hopsfs` is mounted"
+ If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section instead of the `Additional files` property to reference file resources.
+
+### Absolute paths
+The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your notebook.
+
+### Relative paths
+The notebook's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset.
+
+
+## API Reference
[Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)
diff --git a/docs/user_guides/projects/jobs/pyspark_job.md b/docs/user_guides/projects/jobs/pyspark_job.md
index 3cc9e3030..c0cb7e804 100644
--- a/docs/user_guides/projects/jobs/pyspark_job.md
+++ b/docs/user_guides/projects/jobs/pyspark_job.md
@@ -8,7 +8,7 @@ description: Documentation on how to configure and execute a PySpark job on Hops
All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service:
-- Python (*Hopsworks Enterprise only*)
+- Python
- Apache Spark
Launching a job of any type is very similar process, what mostly differs between job types is
@@ -179,7 +179,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
```python
-jobs_api = project.get_jobs_api()
+jobs_api = project.get_job_api()
spark_config = jobs_api.get_configuration("PYSPARK")
@@ -211,7 +211,45 @@ print(f_err.read())
```
-### API Reference
+## Configuration
+The following table describes the JSON payload returned by `jobs_api.get_configuration("PYSPARK")`
+
+| Field | Type | Description | Default |
+| ------------------------------------------ | -------------- |-----------------------------------------------------| -------------------------- |
+| `type` | string | Type of the job configuration | `"sparkJobConfiguration"` |
+| `appPath` | string | Project path to script (e.g `Resources/foo.py`) | `null` |
+| `environmentName` | string | Name of the project spark environment | `"spark-feature-pipeline"` |
+| `spark.driver.cores` | number (float) | Number of CPU cores allocated for the driver | `1.0` |
+| `spark.driver.memory` | number (int) | Memory allocated for the driver (in MB) | `2048` |
+| `spark.executor.instances` | number (int) | Number of executor instances | `1` |
+| `spark.executor.cores` | number (float) | Number of CPU cores per executor | `1.0` |
+| `spark.executor.memory` | number (int) | Memory allocated per executor (in MB) | `4096` |
+| `spark.dynamicAllocation.enabled` | boolean | Enable dynamic allocation of executors | `true` |
+| `spark.dynamicAllocation.minExecutors` | number (int) | Minimum number of executors with dynamic allocation | `1` |
+| `spark.dynamicAllocation.maxExecutors` | number (int) | Maximum number of executors with dynamic allocation | `2` |
+| `spark.dynamicAllocation.initialExecutors` | number (int) | Initial number of executors with dynamic allocation | `1` |
+| `spark.blacklist.enabled` | boolean | Whether executor/node blacklisting is enabled | `false` |
+
+
+## Accessing project data
+
+### Read directly from the filesystem (recommended)
+
+To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`:
+
+```python
+df = spark.read.csv("/Projects/my_project/Resources/data.csv", header=True, inferSchema=True)
+df.show()
+```
+
+### Additional files
+
+Different file types can be attached to the spark job and made available in the `/srv/hops/artifacts` folder when the PySpark job is started. This configuration is mainly useful when you need to add additional setup, such as jars that needs to be added to the CLASSPATH.
+
+When reading data in your Spark job it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option.
+
+
+## API Reference
[Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)
diff --git a/docs/user_guides/projects/jobs/python_job.md b/docs/user_guides/projects/jobs/python_job.md
index 4fa58cfa6..420e38e49 100644
--- a/docs/user_guides/projects/jobs/python_job.md
+++ b/docs/user_guides/projects/jobs/python_job.md
@@ -81,7 +81,8 @@ It is possible to also set following configuration settings for a `PYTHON` job.
* `Environment`: The python environment to use
* `Container memory`: The amount of memory in MB to be allocated to the Python script
* `Container cores`: The number of cores to be allocated for the Python script
-* `Additional files`: List of files that will be locally accessible by the application
+* `Additional files`: List of files that will be locally accessible in the working directory of the application. Only recommended to use if project datasets are not mounted under `/hopsfs`.
+ You can always modify the arguments in the job settings.
@@ -129,7 +130,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
```python
-jobs_api = project.get_jobs_api()
+jobs_api = project.get_job_api()
py_job_config = jobs_api.get_configuration("PYTHON")
@@ -163,7 +164,33 @@ print(f_err.read())
```
-### API Reference
+## Configuration
+The following table describes the JSON payload returned by `jobs_api.get_configuration("PYTHON")`
+
+| Field | Type | Description | Default |
+|-------------------------|----------------|-------------------------------------------------|--------------------------|
+| `type` | string | Type of the job configuration | `"pythonJobConfiguration"` |
+| `appPath` | string | Project path to script (e.g `Resources/foo.py`) | `null` |
+| `environmentName` | string | Name of the project python environment | `"pandas-training-pipeline"` |
+| `resourceConfig.cores` | number (float) | Number of CPU cores to be allocated | `1.0` |
+| `resourceConfig.memory` | number (int) | Number of MBs to be allocated | `2048` |
+| `resourceConfig.gpus` | number (int) | Number of GPUs to be allocated | `0` |
+| `logRedirection` | boolean | Whether logs are redirected | `true` |
+| `jobType` | string | Type of job | `"PYTHON"` |
+
+
+## Accessing project data
+!!! notice "Recommended approach if `/hopsfs` is mounted"
+ If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section instead of the `Additional files` property to reference file resources.
+
+### Absolute paths
+The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your script.
+
+### Relative paths
+The script's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset.
+
+
+## API Reference
[Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)
diff --git a/docs/user_guides/projects/jobs/ray_job.md b/docs/user_guides/projects/jobs/ray_job.md
index 99312f4a2..1b79a6f49 100644
--- a/docs/user_guides/projects/jobs/ray_job.md
+++ b/docs/user_guides/projects/jobs/ray_job.md
@@ -8,7 +8,7 @@ description: Documentation on how to configure and execute a Ray job on Hopswork
All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service:
-- Python (*Hopsworks Enterprise only*)
+- Python
- Apache Spark
- Ray
@@ -168,7 +168,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
```python
-jobs_api = project.get_jobs_api()
+jobs_api = project.get_job_api()
ray_config = jobs_api.get_configuration("RAY")
@@ -203,7 +203,12 @@ print(f_err.read())
```
-### API Reference
+## Accessing project data
+
+The project datasets are mounted under `/home/yarnapp/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/home/yarnapp/hopsfs/Resources/data.csv` in your script.
+
+
+## API Reference
[Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)
diff --git a/docs/user_guides/projects/jobs/spark_job.md b/docs/user_guides/projects/jobs/spark_job.md
index 66be8c001..6d0f0510b 100644
--- a/docs/user_guides/projects/jobs/spark_job.md
+++ b/docs/user_guides/projects/jobs/spark_job.md
@@ -183,7 +183,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
```python
-jobs_api = project.get_jobs_api()
+jobs_api = project.get_job_api()
spark_config = jobs_api.get_configuration("SPARK")
@@ -212,7 +212,48 @@ print(f_err.read())
```
-### API Reference
+## Configuration
+The following table describes the JSON payload returned by `jobs_api.get_configuration("SPARK")`
+
+| Field | Type | Description | Default |
+|--------------------------------------------| -------------- |---------------------------------------------------------| -------------------------- |
+| `type` | string | Type of the job configuration | `"sparkJobConfiguration"` |
+| `appPath` | string | Project path to spark program (e.g `Resources/foo.jar`) | `null` |
+| `mainClass` | string | Name of the main class to run (e.g `org.company.Main`) | `null` |
+| `environmentName` | string | Name of the project spark environment | `"spark-feature-pipeline"` |
+| `spark.driver.cores` | number (float) | Number of CPU cores allocated for the driver | `1.0` |
+| `spark.driver.memory` | number (int) | Memory allocated for the driver (in MB) | `2048` |
+| `spark.executor.instances` | number (int) | Number of executor instances | `1` |
+| `spark.executor.cores` | number (float) | Number of CPU cores per executor | `1.0` |
+| `spark.executor.memory` | number (int) | Memory allocated per executor (in MB) | `4096` |
+| `spark.dynamicAllocation.enabled` | boolean | Enable dynamic allocation of executors | `true` |
+| `spark.dynamicAllocation.minExecutors` | number (int) | Minimum number of executors with dynamic allocation | `1` |
+| `spark.dynamicAllocation.maxExecutors` | number (int) | Maximum number of executors with dynamic allocation | `2` |
+| `spark.dynamicAllocation.initialExecutors` | number (int) | Initial number of executors with dynamic allocation | `1` |
+| `spark.blacklist.enabled` | boolean | Whether executor/node blacklisting is enabled | `false` |
+
+## Accessing project data
+
+### Read directly from the filesystem (recommended)
+
+To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`:
+
+```java
+Dataset df = spark.read()
+ .option("header", "true") // CSV has header
+ .option("inferSchema", "true") // Infer data types
+ .csv("/Projects/my_project/Resources/data.csv");
+
+df.show();
+```
+
+### Additional files
+
+Different file types can be attached to the spark job and made available in the `/srv/hops/artifacts` folder when the Spark job is started. This configuration is mainly useful when you need to add additional configuration such as jars that needs to be added to the CLASSPATH.
+
+When reading data in your Spark job it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option.
+
+## API Reference
[Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)
diff --git a/docs/user_guides/projects/jupyter/python_notebook.md b/docs/user_guides/projects/jupyter/python_notebook.md
index 3412a0d96..409faa6d5 100644
--- a/docs/user_guides/projects/jupyter/python_notebook.md
+++ b/docs/user_guides/projects/jupyter/python_notebook.md
@@ -5,7 +5,7 @@
Jupyter is provided as a service in Hopsworks, providing the same user experience and features as if run on your laptop.
* Supports JupyterLab and the classic Jupyter front-end
-* Configured with Python and PySpark kernels
+* Configured with Python3, PySpark and Ray kernels
## Step 1: Jupyter dashboard
@@ -82,6 +82,17 @@ Start the Jupyter instance by clicking the `Run Jupyter` button.
+## Accessing project data
+!!! notice "Recommended approach if `/hopsfs` is mounted"
+ If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section.
+ If the file system is not mounted, then project files can be localized using the [download api](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/datasets/#download) to localize files in the current working directory.
+
+### Absolute paths
+The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your notebook.
+
+### Relative paths
+The notebook's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset.
+
## Going Further
diff --git a/docs/user_guides/projects/jupyter/ray_notebook.md b/docs/user_guides/projects/jupyter/ray_notebook.md
index d6d4eae3e..f008583e1 100644
--- a/docs/user_guides/projects/jupyter/ray_notebook.md
+++ b/docs/user_guides/projects/jupyter/ray_notebook.md
@@ -139,4 +139,8 @@ In the Ray Dashboard, you can monitor the resources used by code you are runnin
Access Ray Dashboard for Jupyter Ray session
-
\ No newline at end of file
+
+
+## Accessing project data
+
+The project datasets are mounted under `/home/yarnapp/hopsfs` in the Ray containers, so you can access `data.csv` from the `Resources` dataset using `/home/yarnapp/hopsfs/Resources/data.csv`.
diff --git a/docs/user_guides/projects/jupyter/spark_notebook.md b/docs/user_guides/projects/jupyter/spark_notebook.md
index c358bee61..689df54ba 100644
--- a/docs/user_guides/projects/jupyter/spark_notebook.md
+++ b/docs/user_guides/projects/jupyter/spark_notebook.md
@@ -135,6 +135,23 @@ Navigate back to Hopsworks and a Spark session will have appeared, click on the
+## Accessing project data
+
+### Read directly from the filesystem (recommended)
+
+To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`:
+
+```python
+df = spark.read.csv("/Projects/my_project/Resources/data.csv", header=True, inferSchema=True)
+df.show()
+```
+
+### Additional files
+
+Different files can be attached to the jupyter session and made available in the `/srv/hops/artifacts` folder when the PySpark kernel is started. This configuration is mainly useful when you need to add additional configuration such as jars that needs to be added to the CLASSPATH.
+
+When reading data in your Spark application, it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option.
+
## Going Further
You can learn how to [install a library](../python/python_install.md) so that it can be used in a notebook.
diff --git a/mkdocs.yml b/mkdocs.yml
index a84b46b95..7223164d1 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -152,11 +152,11 @@ nav:
- Run Ray Notebook: user_guides/projects/jupyter/ray_notebook.md
- Remote Filesystem Driver: user_guides/projects/jupyter/remote_filesystem_driver.md
- Jobs:
+ - Run Python Job: user_guides/projects/jobs/python_job.md
+ - Run Jupyter Notebook Job: user_guides/projects/jobs/notebook_job.md
- Run PySpark Job: user_guides/projects/jobs/pyspark_job.md
- Run Spark Job: user_guides/projects/jobs/spark_job.md
- - Run Python Job: user_guides/projects/jobs/python_job.md
- Run Ray Job: user_guides/projects/jobs/ray_job.md
- - Run Jupyter Notebook Job: user_guides/projects/jobs/notebook_job.md
- Scheduling: user_guides/projects/jobs/schedule_job.md
- Kubernetes Scheduling: user_guides/projects/scheduling/kube_scheduler.md
- Airflow: user_guides/projects/airflow/airflow.md