Skip to content

andrewargeros/simpleoversample

Repository files navigation

simpleoversample

A Simple Framework for Oversampling Imbalanced Tibbles

This package currently contains only three functions to oversample imbalanced datasets. It was developed for Brett Devine's QMBE 3740 - Data Mining at Hamline University to be a simple wrapper for handling class imbalances without teaching the complexities and nuances of Tidymodels and the Tidyverse.

Installation

This package is not on CRAN, so it can be installed via:

devtools::install_github('andrewargeros/simpleoversample')

Or, by cloning this repo/downloading the release and insalling locally with devtools::install({PATH}).

random_oversample()

This function will randomly duplicate minority rows from a tibble or dataframe with replacement, and append to the original. The proportion of minority:majority observations can be controlled using the prop parameter.

smote()

This function applies the SMOTE synthetic minority oversampling algorithm to balance classes using artificially generated data. Note: SMOTE can only handle numeric data, and this function will remove any non-numeric predictor columns, to avoid this, use something like recipes::step_dummy(all_nominal()) before passing to smote().

adasyn()

This function is roughly the same as the SMOTE algorithm, but applies a density distribution to generate more realistic data as presented in He et. al (2008).

Example

These examples use the Palmer Penguins dataset.

library(tidyverse)
pens = palmerpenguins::penguins
pens %>% count(species) # Baseline- Note the imbalance

  #   A tibble: 3 × 2
  #   species	n
  #   <fct>	<int>
  #   Adelie	152
  #   Chinstrap	68
  #   Gentoo	12

library(simpleoversample)
pens %>%
  drop_na() %>%
  smote('species') %>%
  count(species)

  #   A tibble: 3 × 2
  #   species	n
  #   <fct>	<int>
  #   Adelie	146
  #   Chinstrap	146
  #   Gentoo	146  

# Using Random Oversampling

pens %>%
  random_oversample('species') %>%
  count(species)

  #   A tibble: 3 × 2
  #   species	n
  #   <fct>	<int>
  #   Adelie	152
  #   Chinstrap	152
  #   Gentoo	152    

About

A Simple Framework for Oversampling Imbalanced Tibbles

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages