Skip to content

CangyuanLi/rapidstats

Repository files navigation

rapidstats:

PyPI version PyPI - Downloads License: MIT Tests pre-commit Code style: black Imports: isort

Documentation

What is it?

rapidstats is a minimal library that implements fast statistical routines in Rust and Polars. While similar in spirit, it does not aim to be a complete re-implementation of libraries like scikit-learn or scipy. Only functions that can be significantly faster (e.g. a bootstrap class that offers optimized Rust kernels for metrics such as ROC-AUC) or significantly more ergonomic (e.g. dataframe-first encoders and scalers) are added.

This library is in an alpha state. Although all functions are tested against existing libraries, use at your own risk. The API is subject to change very frequently.

Usage:

Dependencies

rapidstats has a minimal set of dependencies. It only depends on polars, narwhals (for dataframe compatibility), and tqdm (for progress bars). You may install pyarrow (pip install rapidstats[pyarrow]) to allow functions to take numpy arrays, pandas objects, and other objects that may be converted through Arrow.

Installing

The easiest way is to install rapidstats is from PyPI using pip:

pip install rapidstats

Performance

rapidstats is very fast. For example, say you wanted the confusion matrix metrics for a 50,000 row dataset. You aren't sure what exact threshold you want yet, so you decide to compute the metrics for multiple thresholds, let's say 500. With sklearn, this takes 40 seconds. With rapidstats, this takes just .2 seconds, a 198x speedup! Furthermore, rapidstats can use a cumuluative sum algorithm that computes the metrics at all possible thresholds, not just these particular 500. So finding the metrics for 500 or 50,000 metrics takes the exact same amount of time. In addition, even just looping the rapidstats version is a 58x speedup, since rapidstats applies several optimizations, such as computing the basic confusion matrix (TP, FP, FN, TN) using a nice bincount trick and avoiding re-computing this basic matrix for each different metric.

Similarly, calculating the bootstrapped (100 iterations) ROC-AUC of a 25,000 sample dataset takes only .15 seconds, compared to .83 seconds for the equivalent sklearn + scipy operation, a speedup of 5.3x.

About

A Python library that implements fast statistical routines via Rust

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published