-
Notifications
You must be signed in to change notification settings - Fork 0
Blog/apache spark #74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for infraspec ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
|
||
### 2\. Spark Executors | ||
|
||
Executors (process)are the “workers” that actually do the processing. They take instructions from the driver, execute the tasks, and send back the results. Every Spark application gets its own set of executors, which run on different machines. They are responsible for completing the tasks, saving data, reporting results, and re-running any tasks that fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the spacing between (process)
and are
.
content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md
Outdated
Show resolved
Hide resolved
content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md
Outdated
Show resolved
Hide resolved
content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md
Outdated
Show resolved
Hide resolved
content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md
Outdated
Show resolved
Hide resolved
|
||
That’s when I came across Apache Spark. It handles large datasets by distributing the work across multiple machines. | ||
|
||
When I ran my first Spark job, I was honestly amazed. It processed my data way faster than Pandas. **The coolest part?** The driver-executor model. The driver assigns tasks, and the executors do the heavy lifting. If something goes wrong, Spark retries only the failed tasks instead of starting over, offering fault tolerance, and efficient distribution. Plus, it works well with cluster managers like **YARN** and **Kubernetes**, making it easy to scale up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar issue: driver-executor
Suggestion: driver–executor
|
||
### Cluster Managers and Storage Systems | ||
|
||
Apache Spark can run on multiple cluster managers like, Standalone which is Spark’s built-in resource manager for small to medium-sized clusters, Hadoop YARN (Yet another resource navigator), Apache Mesos, Kubernetes for orchestrating Spark workloads in containerized environments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling issue: another
Suggestion: Another
|
||
<h3 id="rdd"> RDDs (Resilient Distributed Datasets)</h3> | ||
|
||
They are the fundamental building block of Spark's older API, introduced in the Spark 1.x series. While RDDs are still available in Spark 2.x and beyond, they are no longer the default API due to the introduction of higher-level abstractions like DataFrames and Datasets. However, every operation in Spark ultimately gets compiled down to RDDs, making it important to understand their basics. The Spark UI also displays job execution details in terms of RDDs, so having a working knowledge of them is essential for debugging and optimization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling issue: more
Suggestion: More
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Good work Shivani 👏
@arihant-2310 , I have made the requested changes. |
Thanks |
No description provided.