Wildfire Analysis with PySpark

This project uses Apache Spark to analyse the "1.88 Million US Wildfires" dataset. It provides pipelines for feature engineering, model training, and evaluation on a Hadoop YARN cluster.

Prerequisites

SSH access to the cluster node co246a-1.
curl and unzip installed on your local machine.

How to Run

The process is two steps: first, prepare the dataset locally, and second, run the analysis on the cluster.

Step 1: Prepare the Dataset

From your local machine, run the automated download and processing script. This will download the source data, generate the required CSV, and clean up intermediate files.

# Make the script executable (only needs to be done once)
chmod +x download_and_process.sh

# Run the script
./download_and_process.sh

This will create the final dataset at data/wildfire_processed_no_leakage.csv.

Note: If you already have the processed CSV, simply ensure it is in the correct location and skip this step.

Step 2: Run the Spark Analysis

Connect to the Cluster:
```
ssh co246a-1
```
Edit the Run Script: The main.py script can perform several different analyses. Open run_spark.sh and set the --experiment flag at the bottom of the file to choose which one to run.
```
# In run-spark.sh, find the spark-submit command:
spark-submit \
    ...
    main.py \
    --dataset_directory /user/$USER/input/data \
    --experiment <CHOOSE_ONE>
```
Your choices for <CHOOSE_ONE> are:
- class_weight: Compares model performance with and without class weighting.
- scaling: Compares model performance with and without feature scaling.
- pca: Runs the analysis using PCA for dimensionality reduction.
Execute the Job: Navigate to your project directory on the cluster and run the script.
```
cd /path/to/your/project/
./run-spark.sh
```

The script will handle uploading the data to HDFS and submitting the selected experiment to the Spark cluster.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
burkehami_wildfire_classification_report.pdf		burkehami_wildfire_classification_report.pdf
download_and_process.sh		download_and_process.sh
main.py		main.py
preprocess_data.py		preprocess_data.py
run_spark.sh		run_spark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Wildfire Analysis with PySpark

Prerequisites

How to Run

Step 1: Prepare the Dataset

Step 2: Run the Spark Analysis

About

Uh oh!

Releases

Packages

Languages

Slaymish/pyspark-wildfire-prediction

Folders and files

Latest commit

History

Repository files navigation

Wildfire Analysis with PySpark

Prerequisites

How to Run

Step 1: Prepare the Dataset

Step 2: Run the Spark Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages