This is a code base that serves as acompanying content for Spark Training lectured and prepared by David Vrba.
Structured APIs are the recommended APIs to be used in the current version of Spark for processing structured or semi-structured data. This course is an introduction to Apache Spark and the structured APIs, it is oriented on Spark SQL and covers mainly the programing interface of the DataFrame API from its basic concepts (simple transformations of the DataFrame) to even more advanced features such as user defined functions, advanced aggregations with window functions and complex data types with higher order functions.
- 6 hours
- 50% theory, 50% hands on
- Language: Scala or Python
- Understand basic concepts of Apache Spark and distributed computing
- Learn how to use DataFrame API in Spark for ETL jobs or ad hoc data analysis
- Learn advanced features of DataFrame API
- Aggregation and Window functions
- User Defined Functions
- Higher Order Functions with complex data types
- Understand information in Spark UI
- Introduction to Apache Spark
- High level introduction to Spark
- Introduction to Spark architecture
- Spark APIs: high level vs low level vs internal APIs
- Structured APIs in Spark
- Basic concepts of DataFrame API
- DataFrame, Row, Column
- Operations in SparkSQL: transformations, actions
- Working with DataFrame: creating a DataFrame and basic transformations
- Working with different data types (Integer, String, Date, Timestamp, Boolean)
- Filtering
- Conditions
- Dealing with null values
- Joins
- Lab I
- Advanced transformations with DataFrames
- Aggregations and Window functions
- Lab II
- User Defined Functions
- Lab III
- Higher Order Functions and complex data types (news in Spark 2.4)
- Lab IV
- Spark UI
- Understand the information in Spark UI
- No prior knowladge in Spark is required
- Basic level in Python or Scala programming language
For more information about the training you can contact directly the lecturer via LinkedIn.