Skip to content

Added in-depth comments highlighting lakeFS use cases in spark-demo #212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 37 additions & 3 deletions 00_notebooks/spark-demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,15 @@
"\n",
"Use Case: Isolated Testing Environment\n",
"\n",
"Access lakeFS using the S3A gateway"
"Access lakeFS using the S3 gateway. Applicable for all S3 compatible storage, including Azure Blob."
]
},
{
"cell_type": "markdown",
"id": "28326d80",
"metadata": {},
"source": [
"In this demo, you'll learn how to use lakeFS to create an isolated testing environment for your ETL pipelines without duplicating data. The notebook will guide you through creating branches and merging changes back to the main branch seamlessly using Spark, and accessing lakeFS using the S3 gateway. This approach ensures safe, efficient, and complete testing with datasets. "
]
},
{
Expand Down Expand Up @@ -292,7 +300,7 @@
"id": "8e234b50",
"metadata": {},
"source": [
"## Reading data by using S3A Gateway"
"## Reading data by using S3 Gateway"
]
},
{
Expand All @@ -310,6 +318,16 @@
"df.show()"
]
},
{
"cell_type": "markdown",
"id": "7a80e29b",
"metadata": {},
"source": [
"This section demonstrates how to use the lakeFS [S3 gateway](https://docs.lakefs.io/integrations/spark.html#s3-compatible-api) to interact with Spark. The S3 gateway is an endpoint provided by lakeFS that implements a subset of the S3 API, acting as a bridge between Spark and the underlying object storage.\n",
"\n",
"This simplifies data access. Alternatively, you can use the lakeFS [Spark client](https://docs.lakefs.io/integrations/spark.html#lakefs-hadoop-filesystem) for direct data flow to/from object storage, which can offer improved performance and enhanced security.\n"
]
},
{
"cell_type": "markdown",
"id": "2f29fc32",
Expand Down Expand Up @@ -360,12 +378,20 @@
"print(f\"{newBranch} ref:\", branchNew.get_commit().id)"
]
},
{
"cell_type": "markdown",
"id": "43254cc6",
"metadata": {},
"source": [
"In the above, we create a new branch using lakeFS by utilizing 0-copy branching. This means that instead of duplicating the actual data files, lakeFS only manipulates metadata and pointers to the data. This makes the process almost instantaneous at any scale, allowing us to safely experiment with a complete identical dataset in an isolated environment without affecting the main branch."
]
},
{
"cell_type": "markdown",
"id": "90b8c7c0",
"metadata": {},
"source": [
"## Partition the data and write to new branch by using S3A Gateway"
"## Partition the data and write to new branch by using S3 Gateway"
]
},
{
Expand Down Expand Up @@ -424,6 +450,14 @@
"print_diff(diff)"
]
},
{
"cell_type": "markdown",
"id": "d8875684",
"metadata": {},
"source": [
"lakeFS helps in managing versions of this data, allowing us to experiment safely in our new branch without affecting the main branch. "
]
},
{
"cell_type": "markdown",
"id": "a4749992",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,14 @@
"###### Or, alternatively, refer to lakeFS Quickstart doc (https://docs.lakefs.io/quickstart/installing.html)."
]
},
{
"cell_type": "markdown",
"id": "d960cc24",
"metadata": {},
"source": [
"In this demo, you'll learn how to integrate lakeFS with Apache Airflow to perform isolated job runs with atomic promotions to production. Here we can use an existing Airflow DAG to demonstrate lakeFS for your ETL pipelines. The notebook will guide you through creating a lakeFS repository and visualizing your workflow in the Airflow UI. "
]
},
{
"cell_type": "markdown",
"id": "16ddc884-bdf5-4fc5-97b1-38662358268c",
Expand Down
8 changes: 8 additions & 0 deletions 01_standalone_examples/airflow-01/Airflow Demo New DAG.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,14 @@
"###### Or, alternatively, refer to lakeFS Quickstart doc (https://docs.lakefs.io/quickstart/installing.html)."
]
},
{
"cell_type": "markdown",
"id": "fa757972",
"metadata": {},
"source": [
"In this demo, you'll learn how to create and troubleshoot a new DAG using lakeFS and Apache Airflow. The notebook will guide you through setting up a lakeFS repository, configuring branches, and visualizing the workflow in Airflow. You'll also see how to identify and resolve errors using the Airflow UI, with a troubleshooting demo. "
]
},
{
"cell_type": "markdown",
"id": "16ddc884-bdf5-4fc5-97b1-38662358268c",
Expand Down
Loading