From Simple Transformations to Highly Efficient Jobs

This is a code base that serves as acompanying content for Spark Training lectured and prepared by David Vrba. It is a two days course that covers Apache Spark from three different perspectives:

The core part is the programming interface of DataFrame API (in Spark 2.4)
The internal processes of Spark SQL and execution layer together with various performance tips
APIs of ML Pipelines and GraphFrames for advanced analytics

The course is offered in Python language and Scala version is being prepared. The Python version is taught in Jupyter notebook environment, while Scala version in Apache Zeppelin. See the installation notes for the complete stack used througout the course.

Training Format

2 days
50% theory, 50% hands on
Language: Python

Objectives of the training is to learn:

Basic concepts of Apache Spark and distributed computing
How to use DataFrame API in Spark for ETL jobs or ad hoc data analysis
How the DataFrame API works under the hood
How the optimization engine works in Spark
What is happening under the cover when you send a query for execution
How is Spark application executed
How to understand query plans and use that information to optimize queries
Basic concepts of libraries ML Pipelines and GraphFrames
- How to use these libraries for advanced analytics
What are basic concepts of Structured Streaming in Spark
News in Spark 2.2, 2.3, 2.4

Training Outline

Introduction to Apache Spark
- High level introduction to Spark
- Introduction to Spark architecture
- Spark APIs: high level vs low level vs internal APIs
Structured APIs in Spark
- Basic concepts of DataFrame API
- DataFrame, Row, Column
- Operations in SparkSQL: transformations, actions
- Working with DataFrame: creating a DataFrame and basic transformations
- Working with different data types (Integer, String, Date, Timestamp, Boolean)
- Filtering
- Conditions
- Dealing with null values
- Joins
Lab I
- Simple ETL
Advanced transformations with DataFrames
- Aggregations and Window functions
- User Defined Functions
- Higher Order Functions and complex data types (news in Spark 2.4)
Lab II
- Analyzing data using DataFrame API
Introduction to internal processes in Spark SQL
- Catalyst - Optimization engine in Spark
- Logical Planning
- Physical Planning
- Cost based optimizations
Execution Layer
- Introduction to low level APIs: RDDs
- Structure of Spark job (Stages, Tasks, Shuffle)
- DAG Scheduler
- Lifecycle of Spark application
Lab III
- Spark UI
Introduction to performance tuning in Spark
- Data persistence: caching, checkpointing
- Most often bottlenecks in Spark applications
- Bucketing & Partitioning
Lab IV
- Cost based optimization and metastore
Introduction to advanced analytics in Spark
- Basic concepts of ML Pipelines (native library for machine learning)
- Basic concepts of GraphFrames (library for graph processing)
Lab V
- Machine learning & Graph processing
Structured Streaming
- Basic concepts of streaming in Spark
- Stateful vs stateless transformations
- What is watermark and how to use it to close the state
Lab VI
- Structured Streaming API

For more information about the training you can contact directly the lecturer via LinkedIn.

Data for the training are downloaded from the Stack Exchange database.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
conf		conf
data		data
python		python
scala		scala
.gitignore		.gitignore
README.md		README.md
installation-notes.md		installation-notes.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

From Simple Transformations to Highly Efficient Jobs

Training Format

Objectives of the training is to learn:

Training Outline

About

Uh oh!

Releases

Packages

Languages

janposlusny/From-Simple-Transformations-to-Highly-Efficient-Jobs

Folders and files

Latest commit

History

Repository files navigation

From Simple Transformations to Highly Efficient Jobs

Training Format

Objectives of the training is to learn:

Training Outline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages