This is a code base that serves as acompanying content for Spark Training lectured and prepared by David Vrba. It is a two days course that covers Apache Spark from three different perspectives:
- The core part is the programming interface of DataFrame API (in Spark 2.4)
- The internal processes of Spark SQL and execution layer together with various performance tips
- APIs of ML Pipelines and GraphFrames for advanced analytics
The course is offered in Python language and Scala version is being prepared. The Python version is taught in Jupyter notebook environment, while Scala version in Apache Zeppelin. See the installation notes for the complete stack used througout the course.
- 2 days
- 50% theory, 50% hands on
- Language: Python
- Basic concepts of Apache Spark and distributed computing
- How to use DataFrame API in Spark for ETL jobs or ad hoc data analysis
- How the DataFrame API works under the hood
- How the optimization engine works in Spark
- What is happening under the cover when you send a query for execution
- How is Spark application executed
- How to understand query plans and use that information to optimize queries
- Basic concepts of libraries ML Pipelines and GraphFrames
- How to use these libraries for advanced analytics
- What are basic concepts of Structured Streaming in Spark
- News in Spark 2.2, 2.3, 2.4
- Introduction to Apache Spark
- High level introduction to Spark
- Introduction to Spark architecture
- Spark APIs: high level vs low level vs internal APIs
- Structured APIs in Spark
- Basic concepts of DataFrame API
- DataFrame, Row, Column
- Operations in SparkSQL: transformations, actions
- Working with DataFrame: creating a DataFrame and basic transformations
- Working with different data types (Integer, String, Date, Timestamp, Boolean)
- Filtering
- Conditions
- Dealing with null values
- Joins
- Lab I
- Simple ETL
- Advanced transformations with DataFrames
- Aggregations and Window functions
- User Defined Functions
- Higher Order Functions and complex data types (news in Spark 2.4)
- Lab II
- Analyzing data using DataFrame API
- Introduction to internal processes in Spark SQL
- Catalyst - Optimization engine in Spark
- Logical Planning
- Physical Planning
- Cost based optimizations
- Execution Layer
- Introduction to low level APIs: RDDs
- Structure of Spark job (Stages, Tasks, Shuffle)
- DAG Scheduler
- Lifecycle of Spark application
- Lab III
- Spark UI
- Introduction to performance tuning in Spark
- Data persistence: caching, checkpointing
- Most often bottlenecks in Spark applications
- Bucketing & Partitioning
- Lab IV
- Cost based optimization and metastore
- Introduction to advanced analytics in Spark
- Basic concepts of ML Pipelines (native library for machine learning)
- Basic concepts of GraphFrames (library for graph processing)
- Lab V
- Machine learning & Graph processing
- Structured Streaming
- Basic concepts of streaming in Spark
- Stateful vs stateless transformations
- What is watermark and how to use it to close the state
- Lab VI
- Structured Streaming API
For more information about the training you can contact directly the lecturer via LinkedIn.
Data for the training are downloaded from the Stack Exchange database.