Apache-Spark-for-Data-Engineers

This is a code base that serves as acompanying content for Spark Training lectured and prepared by David Vrba.

Training Description

This training is oriented on three areas of Spark SQL. The first area are Spark internals with detailed description of what is happening under the cover when a query is sent for execution. The second area are optimizations based on proper understanding of the query plan. And finaly the last one is related to the data layout in the file system or how to prepare data using Spark for analytical queries.

Training Format

6 hours
50% theory, 50% hands on
Language: Scala or Python

Objectives of the training

Learn how Spark SQL module works under the hood
Learn how the optimization engine works in Spark
Understand what is happening under the cover when you send a query for execution
Understand query plans and use that information to optimize queries
Learn advanced optimization techniques to achieve high performance
Learn how to prepare data for analytical queries

Training Outline

Spark SQL internals - Query Execution
- Logical Planning
  - Catalyst API
  - Analyzer
  - Cache Manager
  - Optimizer
  - Rules
  - Extending the optimizer
  - Limiting the optimizer
- Physical Planning
  - Spark Plan (Query Planner, Strategies)
  - Executed Plan (Preparation rules)
  - Understanding operators in Physical Plan
- Cost Based Optimization
  - How CBO works
  - Statistics Collection
  - Statistics Usage
Lab I
Query Optimization & Performance tunning
- Shuffle elimination
  - Bucketing
  - Data repartition (when and how)
- Optimizing joins
  - One-side shuffle-free join
  - Brodcast join vs Sort-Merge join
- Data Reuse
  - Caching
  - Checkpointing
  - Reuse Exchange
- Optimization tips
- Shuffle partitions
Lab II
Data Layout
- Different File Formats
- Partitioning
- Bucketing
Lab III

For more information about the training you can contact directly the lecturer via LinkedIn.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
conf		conf
data		data
python		python
.gitignore		.gitignore
README.md		README.md
installation-notes.md		installation-notes.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Apache-Spark-for-Data-Engineers

Training Description

Training Format

Objectives of the training

Training Outline

About

Uh oh!

Releases

Packages

Languages

davidvrba/Apache-Spark-for-Data-Engineers

Folders and files

Latest commit

History

Repository files navigation

Apache-Spark-for-Data-Engineers

Training Description

Training Format

Objectives of the training

Training Outline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages