PS: This assignment is not mandatory if you think you collected enough points. Tough, it is always good to practice more!
You will implement various ways of using Spark. The goal in this assignment is to use functionalities of spark.
Hint: All the answers of the tasks given below, you can find it right in the Spark Python Documentation.
Using Spark SQL and Spark DataFrame can be achieved in the following format.
To do the tasks with Spark SQL, you need to implement the tasks by writing SQL queries and executing your results with Spark SQL.
your_query = 'select * from tablename'
sparkSession.sql(your_query).show()
To do the tasks with Spark DataFrame, you need to use pyspark.sql.DataFrame functions like below.
data.select(...).groupBy(...)
You can use the Amazon Product Review Dataset with the whole dataset or the smallest one, or any other dataset that you feel comfortable with. You may need two datasets at least to be able to perform some operations, such as joins.
For amazon product review dataset, review and the metadata datasets could be your two datasets.
You can use small and simple datasets, you can create the dataset on your own, since some of the tasks may require multiple datasets. If you are using other datasets, please provide a way to download the dataset in the notebook. Remember, your notebooks should be reproducible, meaning that when you run it, it should have everything needed.
PS: Data files should not be uploaded to git, unless they are really small and sample use only.
Here in this assignment, you need to the following.
Create a notebook where you read a dataset, explore the dataset in various ways, extract some statistics, and showcase your results.
There will be following headings.
- Explore configuration options on SparkSession
- Explain the difference between
SQLContext
,SparkSession
,SparkContext
,SparkConf
.- Use these classes in an example.
- Exploring the data with Spark SQL
- Exploring the data with Spark DataFrame
- Find at least 5 different option of spark and initiate your spark session based on these options.
An example of one way of setting the config is like the following.
SparkSession.builder.config("spark.some.config.option", "some-value")
- Use each of the
SQLContext
,SparkSession
,SparkContext
,SparkConf
classes and explain in detail the differences on each. Explain the differences and when to use which one.- Explain in detail, give examples, use these classes in your examples, compare them.
From the list below, do all of the operations.
- Read data as RDD
- Read data as DataFrame
- Convert RDD to Spark DataFrame
- Convert Spark DataFrame to RDD
- Convert Spark DataFrame to Pandas DataFrame
From following tasks, you need to do at least 8 of them to get the full points in both ways, by writing SQL queries
and by using pyspark.sql.DataFrame
functions.
You must show the result of your task by showing first few rows of your dataset, if not the rows, similarly, the aggreated result from your code.
- Select first 10 rows of dataset.
- Show the schema of of the dataset.
- Group by and get max, min, count of a column in the dataset.
- Filter your dataset by some conditions based on your column.
- Apply group by with having clause.
- Apply order by.
- Apply inner join/ left join/ right join on your two tables.
- Select distinct records by a column.
- Register a user defined function to Spark and use it in your Spark SQL Query.
- Transform the data type of columns from int to string/ string to integer.
- Apply a min-max normalization on any numerical column.
- Apply a standard normalization on any numerical column.
- Transform a categorical column with OneHot Encoding.
- Use Spark.sql.DataFrameStatFunctions to collect statistics from the data.
- Get covariance
- Get correlation
- Get crosstab
- Get freqItems
Following table is will give it a meaning for each file.
File | Description |
---|---|
README.md ** | A descriptive file to give an introduction of current project/ assignment. |
Instructions.md ** | A copy of current README.md file. |
LICENCE ** | The licence of the file that every project should have. |
.gitignore | The file to control which files should be ignored by Git. |
*.ipynb | Assignment notebook with your Spark code. |
- I have completed all the tasks in tasks section.
- I edit this README file and checkmarked things I've completed in the tasks section.
- My notebook(s) are well organized with headings, comments, that makes it visually appealing.
- My notebook(s) have the results of my execution.
- My notebook(s) are reproducible.
- I download the final version of my repository, and uploaded to the blackboard!