Skip to content

Blog/apache spark #74

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Feb 7, 2025
Merged

Blog/apache spark #74

merged 23 commits into from
Feb 7, 2025

Conversation

ShivaniBanke
Copy link
Collaborator

No description provided.

Copy link

netlify bot commented Jan 24, 2025

Deploy Preview for infraspec ready!

Name Link
🔨 Latest commit c648793
🔍 Latest deploy log https://app.netlify.com/sites/infraspec/deploys/67a5e6b33d736e00083434a5
😎 Deploy Preview https://deploy-preview-74--infraspec.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.


### 2\. Spark Executors

Executors (process)are the “workers” that actually do the processing. They take instructions from the driver, execute the tasks, and send back the results. Every Spark application gets its own set of executors, which run on different machines. They are responsible for completing the tasks, saving data, reporting results, and re-running any tasks that fail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the spacing between (process) and are.


That’s when I came across Apache Spark. It handles large datasets by distributing the work across multiple machines.

When I ran my first Spark job, I was honestly amazed. It processed my data way faster than Pandas. **The coolest part?** The driver-executor model. The driver assigns tasks, and the executors do the heavy lifting. If something goes wrong, Spark retries only the failed tasks instead of starting over, offering fault tolerance, and efficient distribution. Plus, it works well with cluster managers like **YARN** and **Kubernetes**, making it easy to scale up.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar issue: driver-executor
Suggestion: driver–executor


### Cluster Managers and Storage Systems

Apache Spark can run on multiple cluster managers like, Standalone which is Spark’s built-in resource manager for small to medium-sized clusters, Hadoop YARN (Yet another resource navigator), Apache Mesos, Kubernetes for orchestrating Spark workloads in containerized environments.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling issue: another
Suggestion: Another


<h3 id="rdd"> RDDs (Resilient Distributed Datasets)</h3>

They are the fundamental building block of Spark's older API, introduced in the Spark 1.x series. While RDDs are still available in Spark 2.x and beyond, they are no longer the default API due to the introduction of higher-level abstractions like DataFrames and Datasets. However, every operation in Spark ultimately gets compiled down to RDDs, making it important to understand their basics. The Spark UI also displays job execution details in terms of RDDs, so having a working knowledge of them is essential for debugging and optimization.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling issue: more
Suggestion: More

Copy link
Contributor

@arihant-2310 arihant-2310 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Good work Shivani 👏

@ShivaniBanke
Copy link
Collaborator Author

@arihant-2310 , I have made the requested changes.

@ShivaniBanke
Copy link
Collaborator Author

LGTM 👍

Good work Shivani 👏

Thanks

@ShivaniBanke ShivaniBanke merged commit 5d272e4 into main Feb 7, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants