Blog/apache spark #74

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

ShivaniBanke merged 23 commits into main from blog/apache-spark

Feb 7, 2025

Collaborator

ShivaniBanke commented Jan 24, 2025

No description provided.

ShivaniBanke added 9 commits

November 25, 2024 08:33


          Add blog on apache spark and its core.

5d91154


          Reformat

ea031c8


          Fix markdown lint

365e77c


          Fix Sparks’s

c453131


          Include images

816622f


          Fix images path

5ca7e14


          Fix spell issues

fd12fab


          Merge branch 'main' into blog/apache-spark

0e5572c


          Fix headings

910d529

netlify bot commented Jan 24, 2025 •

edited

Loading

✅ Deploy Preview for infraspec ready!

Name	Link
🔨 Latest commit	`c648793`
🔍 Latest deploy log	https://app.netlify.com/sites/infraspec/deploys/67a5e6b33d736e00083434a5
😎 Deploy Preview	https://deploy-preview-74--infraspec.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

ShivaniBanke added 10 commits

January 24, 2025 10:46


          Fix spell errors

29cca3c


          Rearrange blog


          Fix markdown formatting

c3dbf80


          Fix formatting

ac9e364


          Fix indentation

632a506


          Fix indentation tabs

6ea369c


          Fix spelling issues

ec71d9d


          Fix spelling issues again

175c896


          Update blog

6722e6a


          Add missing context

ca47263

ShivaniBanke requested a review from arihant-2310

February 7, 2025 06:59

arihant-2310 reviewed

View reviewed changes

content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md Outdated


		### 2\. Spark Executors

		Executors (process)are the “workers” that actually do the processing. They take instructions from the driver, execute the tasks, and send back the results. Every Spark application gets its own set of executors, which run on different machines. They are responsible for completing the tasks, saving data, reporting results, and re-running any tasks that fail.

Contributor

arihant-2310 Feb 7, 2025

Fix the spacing between (process) and are.

arihant-2310 reviewed

View reviewed changes

content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md Outdated Show resolved Hide resolved

arihant-2310 reviewed

View reviewed changes

content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md Outdated Show resolved Hide resolved

ShivaniBanke added 3 commits

February 7, 2025 15:39


          Fix formatting issue

5f1be8f


          Fix formatting issue

cbe3cb9


          Fix grammar issue

2a5aa8b

arihant-2310 reviewed

View reviewed changes

content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md Outdated Show resolved Hide resolved

arihant-2310 reviewed

View reviewed changes

content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md Outdated Show resolved Hide resolved


          Fix punctuations

c648793

github-actions bot reviewed

View reviewed changes

content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md


		That’s when I came across Apache Spark. It handles large datasets by distributing the work across multiple machines.

		When I ran my first Spark job, I was honestly amazed. It processed my data way faster than Pandas. The coolest part? The driver-executor model. The driver assigns tasks, and the executors do the heavy lifting. If something goes wrong, Spark retries only the failed tasks instead of starting over, offering fault tolerance, and efficient distribution. Plus, it works well with cluster managers like YARN and Kubernetes, making it easy to scale up.

github-actions bot Feb 7, 2025

Grammar issue: driver-executor
Suggestion: driver–executor

github-actions bot reviewed

View reviewed changes

content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md


		### Cluster Managers and Storage Systems

		Apache Spark can run on multiple cluster managers like, Standalone which is Spark’s built-in resource manager for small to medium-sized clusters, Hadoop YARN (Yet another resource navigator), Apache Mesos, Kubernetes for orchestrating Spark workloads in containerized environments.

github-actions bot Feb 7, 2025

Spelling issue: another
Suggestion: Another

github-actions bot reviewed

View reviewed changes

content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md


		<h3 id="rdd"> RDDs (Resilient Distributed Datasets)</h3>

		They are the fundamental building block of Spark's older API, introduced in the Spark 1.x series. While RDDs are still available in Spark 2.x and beyond, they are no longer the default API due to the introduction of higher-level abstractions like DataFrames and Datasets. However, every operation in Spark ultimately gets compiled down to RDDs, making it important to understand their basics. The Spark UI also displays job execution details in terms of RDDs, so having a working knowledge of them is essential for debugging and optimization.

github-actions bot Feb 7, 2025

Spelling issue: more
Suggestion: More

arihant-2310 approved these changes

View reviewed changes

Contributor

arihant-2310 left a comment

LGTM 👍

Good work Shivani 👏

Collaborator Author

ShivaniBanke commented Feb 7, 2025

@arihant-2310 , I have made the requested changes.

Collaborator Author

ShivaniBanke commented Feb 7, 2025

LGTM 👍

Good work Shivani 👏

Thanks

ShivaniBanke merged commit 5d272e4 into main

5 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet