Pump it Up: Data Mining the Water Table

This project predicts the functionality of water pumps across Tanzania using Scikit-Learn and Pandas. It follows best practices of the ML pipeline, including data inspection and cleaning, model training and hyperparameter tuning using Optuna. The ML models utilized as part of the proejct are as follows:

Logistic Regression
Random Forest Classifier
Gradient Boosting Classifier
Hist Gradient Boosting Classifier
MLPClassifier

The project achieved an accuracy of 82.36% with the tuned RandomForestClassifier, ranking in the top 10% of worldwide submissions for the competition.

Instructions

Data exploration is conducted in the data_exploration.py file
Data cleaning is conducted in the cleaner.py file
Model training is conducted in the part1.py file
Hyperparameter tuning is conducted in the hpo.py file

To run the script, navigate to the root directory and run the following command:

python3 part1.py <train-input-file> <train-labels-file> <test-input-file> <
numerical-preprocessing> <categorical-preprocessing> <model-type> <test-
prediction-output-file>

where:

<train-input-file>, <train-labels-file>, <test-input-file> are the paths to the .csv data files provided by the competition.
<numerical-preprocessing> represents the type of scaling method for numerical features. Valid values include: None and StandardScaler.
<categorical-preprocessing> represents the type of encoding for categorical features. Valid values include: OneHotEncoder, OrdinalEncoder, and TargetEncoder.
<model-type> represents the model type. Valid values include: LogisticRegression, RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier and MLPClassifier.
<test-prediction-output-file> consists of the predictions on the test dataset of the competition. This must follow the .csv submission format required by the competition.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.gitignore		.gitignore
LogisticRegression_tuning.db		LogisticRegression_tuning.db
README.md		README.md
RandomForestClassifier_tuning.db		RandomForestClassifier_tuning.db
cleaner.py		cleaner.py
data_exploration.py		data_exploration.py
hpo.py		hpo.py
initializer.py		initializer.py
part1.py		part1.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pump it Up: Data Mining the Water Table

Instructions

About

Uh oh!

Releases

Packages

Languages

nsengupta5/Pump-It-Up-Challenge

Folders and files

Latest commit

History

Repository files navigation

Pump it Up: Data Mining the Water Table

Instructions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages