A Simple Framework for Oversampling Imbalanced Tibbles
This package currently contains only three functions to oversample imbalanced datasets. It was developed for Brett Devine's QMBE 3740 - Data Mining at Hamline University to be a simple wrapper for handling class imbalances without teaching the complexities and nuances of Tidymodels and the Tidyverse.
This package is not on CRAN, so it can be installed via:
devtools::install_github('andrewargeros/simpleoversample')
Or, by cloning this repo/downloading the release and insalling locally with devtools::install({PATH})
.
This function will randomly duplicate minority rows from a tibble or dataframe with replacement, and append to the original. The proportion of minority:majority observations can be controlled using the prop
parameter.
This function applies the SMOTE synthetic minority oversampling algorithm to balance classes using artificially generated data. Note: SMOTE can only handle numeric data, and this function will remove any non-numeric predictor columns, to avoid this, use something like recipes::step_dummy(all_nominal())
before passing to smote()
.
This function is roughly the same as the SMOTE algorithm, but applies a density distribution to generate more realistic data as presented in He et. al (2008).
These examples use the Palmer Penguins dataset.
library(tidyverse)
pens = palmerpenguins::penguins
pens %>% count(species) # Baseline- Note the imbalance
# A tibble: 3 × 2
# species n
# <fct> <int>
# Adelie 152
# Chinstrap 68
# Gentoo 12
library(simpleoversample)
pens %>%
drop_na() %>%
smote('species') %>%
count(species)
# A tibble: 3 × 2
# species n
# <fct> <int>
# Adelie 146
# Chinstrap 146
# Gentoo 146
# Using Random Oversampling
pens %>%
random_oversample('species') %>%
count(species)
# A tibble: 3 × 2
# species n
# <fct> <int>
# Adelie 152
# Chinstrap 152
# Gentoo 152