capybara

About

tldr; If you have a 2-4GB dataset and you need to estimate a (generalized) linear model with a large number of fixed effects, this package is for you. It works with larger datasets as well and facilites computing clustered standard errors.

‘capybara’ is a fast and small footprint software that provides efficient functions for demeaning variables before conducting a GLM estimation. This technique is particularly useful when estimating linear models with multiple group fixed effects. It is a fork of the excellent Alpaca package created and maintained by Dr. Amrei Stammann. The software can estimate Exponential Family models (e.g., Poisson) and Negative Binomial models.

Traditional QR estimation can be unfeasible due to additional memory requirements. The method, which is based on Halperin (1962) vector projections offers important time and memory savings without compromising numerical stability in the estimation process.

The software heavily borrows from Gaure (2013) and Stammann (2018) works on OLS and GLM estimation with large fixed effects implemented in the ‘lfe’ and ‘alpaca’ packages. The differences are that ‘capybara’ does not use C nor Rcpp code, instead it uses cpp11 and cpp11armadillo.

The summary tables borrow from Stata outputs. I have also provided integrations with ‘broom’ to facilitate the inclusion of statistical tables in Quarto/Jupyter notebooks.

If this software is useful to you, please consider donating on Buy Me A Coffee. All donations will be used to continue improving capybara.

Installation

You can install the development version of capybara like so:

remotes::install_github("pachadotdev/capybara")

Examples

See the documentation: https://pacha.dev/capybara/.

Here is simple example of estimating a linear model and a Poisson model with fixed effects:

m1 <- felm(mpg ~ wt | cyl, mtcars)
m2 <- fepoisson(mpg ~ wt | cyl, mtcars)
summary_table(m1, m2, model_names = c("Linear", "Poisson"))

|     Variable     |       Linear        |      Poisson      |
|------------------|---------------------|-------------------|
| wt               |           -3.206*** |           -0.180* |
|                  |             (0.295) |           (0.072) |
|                  |                     |                   |
| Fixed effects    |                     |                   |
| cyl              |                 Yes |               Yes |
|                  |                     |                   |
| N                |                  32 |                32 |
| R-squared        |               0.837 |             0.616 |

Standard errors in parenthesis
Significance levels: *** p < 0.001; ** p < 0.01; * p < 0.05; . p < 0.1

Design choices

Capybara is full of trade-offs. I have used ‘data.table’ to benefit from in-place modifications. The model fitting is done on C++ side. While the code aims to be fast, I prefer to have some bottlenecks instead of low numerical stability or reinvent the wheel. Armadillo works great for the size of data and the models that I use for my research. The principle was: “He who gives up code safety for code speed deserves neither.” (Wickham, 2014).

Benchmarks

Median time and memory footprint for the different models in the book An Advanced Guide to Trade Policy Analysis.

Model	Package	Median Time	Memory
PPML	Alpaca	720.07 ms - 3	302.64 MB - 3
PPML	Base R	41.72 s - 4	2.73 GB - 4
PPML	Capybara	405.89 ms - 2	19.22 MB - 1
PPML	Fixest	130.1 ms - 1	44.59 MB - 2

Trade Diversion	Alpaca	3.79 s - 3	339.79 MB - 3
Trade Diversion	Base R	39.84 s - 4	2.6 GB - 4
Trade Diversion	Capybara	947.96 ms - 2	26.22 MB - 1
Trade Diversion	Fixest	932.78 ms - 1	36.59 MB - 2

Endogeneity	Alpaca	2.65 s - 3	306.27 MB - 3
Endogeneity	Base R	10.7 m - 4	11.94 GB - 4
Endogeneity	Capybara	1.32 s - 2	15.55 MB - 1
Endogeneity	Fixest	225.64 ms - 1	28.08 MB - 2

Reverse Causality	Alpaca	3.36 s - 3	335.61 MB - 3
Reverse Causality	Base R	10.69 m - 4	11.94 GB - 4
Reverse Causality	Capybara	1.36 s - 2	17.73 MB - 1
Reverse Causality	Fixest	296.63 ms - 1	32.43 MB - 2

Phasing Effects	Alpaca	4.6 s - 3	393.86 MB - 3
Phasing Effects	Base R	10.75 m - 4	11.95 GB - 4
Phasing Effects	Capybara	1.57 s - 2	22.08 MB - 1
Phasing Effects	Fixest	471.1 ms - 1	41.12 MB - 2

Globalization	Alpaca	8.2 s - 3	539.49 MB - 3
Globalization	Base R	10.79 m - 4	11.97 GB - 4
Globalization	Capybara	2.07 s - 2	32.98 MB - 1
Globalization	Fixest	869.62 ms - 1	62.87 MB - 2

Changing the number of cores

Note that you can use Sys.setenv(CAPYBARA_NCORES = 4) (or other positive integers) to change the number of cores that capybara uses, here is an example of how it affects the performance

cores	PPML	Trade Diversion
2	1.8s	16.2s
4	1.5s	14.0s
6	0.8s	2.4s
8	0.4s	0.9s

Installing with compiler optimizations

CRAN packages are built with the -O2 compiler flag, which is sufficient for most packages, including capybara. However, if you want to use the maximum compiler optimizations, you can do so by setting the -O3 compiler flag.

To do that, create a user Makevars file in your home directory (~/.R/Makevars) and add the following lines:

# Copy to ~/.R/Makevars if you want to override R's default optimization
CXXFLAGS = -O3
CXX11FLAGS = -O3
CXX14FLAGS = -O3
CXX17FLAGS = -O3
CXX20FLAGS = -O3

Additional optimizations can be enabled by setting the CAPYBARA_PORTABLE environment variable to "no" before installing the package. This will enable hardware-specific compiler flags that can significantly improve performance (sometimes 2-4x faster than just using portable flags).

Sys.setenv(CAPYBARA_OPTIMIZATIONS = "yes")

# CRAN version
install.packages("capybara", type = "source")

# Local version
install.packages(".", repos = NULL, type = "source")
# or
devtools::install()

This will determine if your hardware allows hardware-specific compiler flags that provide significant performance improvements (sometimes 2-4x faster than just using portable flags).

Code of Conduct

Please note that the capybara project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Acknowledgements

Thanks a lot to Prof. Yoto Yotov for reviewing the summary functions.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
R		R
benchmarks		benchmarks
data		data
dev		dev
docs		docs
inst		inst
man		man
src		src
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.lintr		.lintr
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
Makefile		Makefile
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
capybara.Rproj		capybara.Rproj
cleanup		cleanup
codemeta.json		codemeta.json
configure		configure
cran-comments.md		cran-comments.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

capybara

About

Installation

Examples

Design choices

Benchmarks

Changing the number of cores

Installing with compiler optimizations

Code of Conduct

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

pachadotdev/capybara

Folders and files

Latest commit

History

Repository files navigation

capybara

About

Installation

Examples

Design choices

Benchmarks

Changing the number of cores

Installing with compiler optimizations

Code of Conduct

Acknowledgements

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages