Spark Even More! (Bonus)

PS: This assignment is not mandatory if you think you collected enough points. Tough, it is always good to practice more!

You will implement various ways of using Spark. The goal in this assignment is to use functionalities of spark.

Hint: All the answers of the tasks given below, you can find it right in the Spark Python Documentation.

Introductory Knowledge and Information

Using Spark SQL and Spark DataFrame can be achieved in the following format.

Spark SQL

To do the tasks with Spark SQL, you need to implement the tasks by writing SQL queries and executing your results with Spark SQL.

your_query = 'select * from tablename'
sparkSession.sql(your_query).show()

Spark DataFrame

To do the tasks with Spark DataFrame, you need to use pyspark.sql.DataFrame functions like below.

data.select(...).groupBy(...)

Dataset

You can use the Amazon Product Review Dataset with the whole dataset or the smallest one, or any other dataset that you feel comfortable with. You may need two datasets at least to be able to perform some operations, such as joins.

For amazon product review dataset, review and the metadata datasets could be your two datasets.

You can use small and simple datasets, you can create the dataset on your own, since some of the tasks may require multiple datasets. If you are using other datasets, please provide a way to download the dataset in the notebook. Remember, your notebooks should be reproducible, meaning that when you run it, it should have everything needed.

PS: Data files should not be uploaded to git, unless they are really small and sample use only.

Tasks

Here in this assignment, you need to the following.

Create a notebook where you read a dataset, explore the dataset in various ways, extract some statistics, and showcase your results.

There will be following headings.

Explore configuration options on SparkSession
Explain the difference between SQLContext, SparkSession, SparkContext, SparkConf.
- Use these classes in an example.
Exploring the data with Spark SQL
Exploring the data with Spark DataFrame

1. Explore Spark Configuration Tasks

Find at least 5 different option of spark and initiate your spark session based on these options.

An example of one way of setting the config is like the following.

SparkSession.builder.config("spark.some.config.option", "some-value")

Use each of the SQLContext, SparkSession, SparkContext, SparkConf classes and explain in detail the differences on each. Explain the differences and when to use which one.
- Explain in detail, give examples, use these classes in your examples, compare them.

2. Read Data Tasks

From the list below, do all of the operations.

Read data as RDD
Read data as DataFrame
Convert RDD to Spark DataFrame
Convert Spark DataFrame to RDD
Convert Spark DataFrame to Pandas DataFrame

3. Exploration Tasks

From following tasks, you need to do at least 8 of them to get the full points in both ways, by writing SQL queries and by using pyspark.sql.DataFrame functions.

You must show the result of your task by showing first few rows of your dataset, if not the rows, similarly, the aggreated result from your code.

What are all these other files?

Following table is will give it a meaning for each file.

File	Description
README.md **	A descriptive file to give an introduction of current project/ assignment.
Instructions.md **	A copy of current README.md file.
LICENCE **	The licence of the file that every project should have.
.gitignore	The file to control which files should be ignored by Git.
*.ipynb	Assignment notebook with your Spark code.

Your To-Do List for This Assignment

I have completed all the tasks in tasks section.
I edit this README file and checkmarked things I've completed in the tasks section.
My notebook(s) are well organized with headings, comments, that makes it visually appealing.
My notebook(s) have the results of my execution.
My notebook(s) are reproducible.
I download the final version of my repository, and uploaded to the blackboard!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spark Even More! (Bonus)

Introductory Knowledge and Information

Spark SQL

Spark DataFrame

Dataset

Tasks

1. Explore Spark Configuration Tasks

2. Read Data Tasks

3. Exploration Tasks

What are all these other files?

Your To-Do List for This Assignment

About

Uh oh!

Releases

Packages

License

spu-bigdataanalytics-201/assignment5

Folders and files

Latest commit

History

Repository files navigation

Spark Even More! (Bonus)

Introductory Knowledge and Information

Spark SQL

Spark DataFrame

Dataset

Tasks

1. Explore Spark Configuration Tasks

2. Read Data Tasks

3. Exploration Tasks

What are all these other files?

Your To-Do List for This Assignment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages