This is a code base that serves as acompanying content for Spark Training lectured and prepared by David Vrba.
This training is oriented on three areas of Spark SQL. The first area are Spark internals with detailed description of what is happening under the cover when a query is sent for execution. The second area are optimizations based on proper understanding of the query plan. And finaly the last one is related to the data layout in the file system or how to prepare data using Spark for analytical queries.
- 6 hours
- 50% theory, 50% hands on
- Language: Scala or Python
- Learn how Spark SQL module works under the hood
- Learn how the optimization engine works in Spark
- Understand what is happening under the cover when you send a query for execution
- Understand query plans and use that information to optimize queries
- Learn advanced optimization techniques to achieve high performance
- Learn how to prepare data for analytical queries
- Spark SQL internals - Query Execution
- Logical Planning
- Catalyst API
- Analyzer
- Cache Manager
- Optimizer
- Rules
- Extending the optimizer
- Limiting the optimizer
- Physical Planning
- Spark Plan (Query Planner, Strategies)
- Executed Plan (Preparation rules)
- Understanding operators in Physical Plan
- Cost Based Optimization
- How CBO works
- Statistics Collection
- Statistics Usage
- Logical Planning
- Lab I
- Query Optimization & Performance tunning
- Shuffle elimination
- Bucketing
- Data repartition (when and how)
- Optimizing joins
- One-side shuffle-free join
- Brodcast join vs Sort-Merge join
- Data Reuse
- Caching
- Checkpointing
- Reuse Exchange
- Optimization tips
- Shuffle partitions
- Shuffle elimination
- Lab II
- Data Layout
- Different File Formats
- Partitioning
- Bucketing
- Lab III
For more information about the training you can contact directly the lecturer via LinkedIn.