Skip to content

Added in-depth comments highlighting lakeFS use cases in spark-demo #212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 18, 2024

Conversation

omertzionitreeverse
Copy link
Contributor

Added a summary as an introduction to assist with clarity. Also added descriptions for sections using the S3A gateway and a lakeFS 0-copy branching explanation. Purpose is to enhance clarity when going through demos.

@omertzionitreeverse omertzionitreeverse added the documentation Improvements or additions to documentation label Jul 9, 2024
@omertzionitreeverse
Copy link
Contributor Author

Added a summary for airflow demos to improve clarity.

@iddoavn iddoavn requested a review from kesarwam July 11, 2024 18:14
"id": "7a80e29b",
"metadata": {},
"source": [
"The S3A gateway makes it easy to use data tools like Spark with S3 storage. It handles large datasets efficiently, saves costs with affordable storage, and organizes data neatly for better management. Combined with lakeFS, one can safely experiment with data versions without risking the original. In the above we read the Parquet file from the source branch using the S3A Gateway. After reading, the data is stored in lakeFS, which allows us to version control and manage our data efficiently. This way, we can safely read and analyze the data without worrying about accidental changes.\n"
Copy link
Contributor

@iddoavn iddoavn Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like:

This example demonstrates using the lakeFS S3 gateway to interact with Spark.
The S3 gateway is an endpoint provided by lakeFS that implements a subset of the S3 API. 
It acts as a bridge between Spark and the underlying object storage, providing a simplified way to access data.
Alternatively, the LakeFS Spark client can be used for direct data flow to/from object storage, potentially offering improved performance and security.
For more information on the LakeFS Spark client, refer to: https://docs.lakefs.io/reference/spark-client.html

@kesarwam kesarwam merged commit 5215aca into main Jul 18, 2024
1 check passed
@kesarwam kesarwam deleted the documentation-improvements branch July 18, 2024 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants