This project analyzes Wikimedia page view statistics using Apache Spark, implementing tasks in both the map-reduce paradigm and Spark loops for performance comparison.
Ensure you have the following:
- Google Colab account
- Apache Spark setup in Google Colab
- Dataset downloaded from here
Download the dataset and upload it to your Google Drive. The dataset should be accessible in your Colab environment.
wikimedia-page-view-analysis/
│
├── data/
│ └── pagecounts-20160101-000000.gz
├── spark_map_reduce.ipynb
├── spark_loops.ipynb
└── README.md
-
Upload the dataset to your Google Drive.
-
Set up Spark in your Colab environment using the following commands:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null !wget -q http://apache.mirrors.tds.net/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz !tar xf spark-3.1.1-bin-hadoop2.7.tgz !pip install -q findspark
-
Configure environment variables:
import os os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"
-
Initialize Spark:
import findspark findspark.init()
Open the spark_map_reduce.ipynb
notebook in the src/
directory in Colab and run the cells:
- Compute min, max, and average page size
- Count page titles starting with "The"
- Determine unique terms in page titles
- Extract each title and number of times it was repeated
- Combine data of pages with the same title
Open the spark_loops.ipynb
notebook in the src/
directory in Colab and run the cells:
- Compute min, max, and average page size
- Count page titles starting with "The"
- Determine unique terms in page titles
- Extract each title and number of times it was repeated
- Combine data of pages with the same title
- Map-Reduce: The notebook uses Spark's RDD transformations and actions to compute the required statistics.
- Spark Loops: The notebook iterates over the RDD using Spark's built-in functions to achieve the same.
- Map-Reduce: The notebook filters and counts titles starting with "The" and distinguishes those not part of the English project.
- Spark Loops: The same task is accomplished using loop constructs.
- Map-Reduce: The notebook splits titles into terms, normalizes them, and counts unique terms.
- Spark Loops: Terms are processed similarly using loops.
- Map-Reduce: The notebook groups titles and counts occurrences.
- Spark Loops: The same is done using looping constructs.
- Map-Reduce: The notebook aggregates data for pages with the same title.
- Spark Loops: The task is achieved using loop constructs.
The results of the performance comparison for each task are provided above.
The project demonstrates the efficiency of both map-reduce and Spark loops paradigms in processing large datasets using Apache Spark. The performance comparison provides insights into the advantages and trade-offs of each approach.