Pronouned like fiasco, but with a t instead of an c
(F)ormulas (I)n (AST) (O)ut
A Language-Agnostic modern Wilkinson's formula parser and lexer.
This library is in test and actively changing.
Formula parsing and materialization is normally done in a single library.
Python, for example, has patsy
/formulaic
/formulae
which all do parsing & materialization.
R's model.matrix
also handles formula parsing and design matrix creation.
There is nothing wrong with this coupling. I wanted to try decoupling the parsing and materialization.
I thought this would allow a focused library that could be used in multiple languages or dataframe libraries.
This package has a clear path, to parse and/or lex formulas and return structured JSON metadata.
Note: Technically an AST is not returned. A simplified/structured intermediate representation (IR) in the form of json is returned. This json IR ought to be easy for many language bindings to use.
The library exposes a clean, focused API:
parse_formula()
- Takes a Wilkinson's formula string and returns structured JSON metadatalex_formula()
- Tokenizes a formula string and returns JSON describing each token "Only two functions?! What kind of library is this?!" An easy to maintain library with a small surface area. The best kind.
The parser returns a variable-centric JSON structure where each variable is described with its roles, transformations, interactions, and random effects. This makes it easy to understand the complete model structure and generate appropriate design matrices. wayne is a python package that can take this JSON and generates design matrices for use in statistical modeling.
- Comprehensive Formula Support: Full R/Wilkinson notation including complex random effects and intercept-only models
- Variable-Centric Output: Variables are first-class citizens with detailed metadata
- Advanced Random Effects: brms-style syntax with correlation control and grouping options
- Intercept-Only Models: Full support for
y ~ 1
andy ~ 0
formulas with proper metadata generation - Multivariate Models: Full support for
bind(y1, y2) ~ x
formulas with multiple response variables - Pretty Error Messages: Colored, contextual error reporting with syntax highlighting
- Robust Error Recovery: Graceful handling of malformed formulas with specific error types
- Language Agnostic Output: JSON format for easy integration with various programming languages
- Comprehensive Documentation: Detailed usage examples and grammar rules
- Comprehensive Metadata: Variable roles, transformations, interactions, and relationships
- Automatic Naming For Generated Columns: Consistent, descriptive names for transformed and interaction terms
- Dual API: Both parsing and lexing functions for flexibility
- Efficient tokenization: using one of the fastest lexer generators for Rust (logos crate)
- Fast pattern matching: using match statements and enum-based token handling. Rust match statements are zero-cost abstractions.
- Minimal string copying: with extensive use of string slices (
&str
) where possible
- Formula Validation: Check if formulas are valid against datasets before expensive computation
- Cross-Platform Model Specs: Define models once, implement in multiple statistical frameworks
- Intercept-Only Models: Support for null models like
y ~ 1
andy ~ 0
for baseline comparisons - Multivariate Models: Support for multiple response variables like
bind(y1, y2) ~ x
for joint modeling
I can't think of every kind of formula that could be parsed. I do have a checklist to start with.
To my knowldege the brms
formula syntax is the most complex and possibly the most complete.
I would like to start with this as a baseline then continue to extend as needed.
I also offer a clean_name for each parameter. This will all a materializer to use a simpler name for the parameter.
Polynomials for example would result in names like x1_poly_1
or x1_poly_2
as opposed to [s]^2
. I keep clean_names in camel case.
y ~ 1
-> y ~ 1
(null model with intercept)
y ~ 0
-> y ~ 0
(null model without intercept)
bind(y1, y2) ~ x
-> bind(y1, y2) ~ x
(multivariate response model)
y ~ x1*x2 + s(z) + (1+x1|1) + (1|g2) - 1
-> y ~ x1 * x2 + s(z) + (1 + x1 | 1) + (1 | g2) - 1
y ~ x1*x2 + s(z) + (1+x1|1) + (1|g2), sigma ~ x1 + (1|g2)
-> y ~ x1 * x2 + s(z) + (1 + x1 | 1) + (1 | g2)
and sigma ~ x1 + (1 | g2)
y ~ a1 - a2^x, a1 + a2 ~ 1, nl = TRUE)
y ~ a1 - a2^x
a1 ~ 1
a2 ~ 1
y ~ a1 - a2^x, a1 ~ 1, a2 ~ x + (x|g), nl = TRUE)
y ~ a1 - a2^x
a1 ~ 1
a2 ~ x + (x | g)
y ~ a1 - a2^x, a1 ~ 1 + (1 |2| g), a2 ~ x + (x |2| g), nl = TRUE)
y ~ a1 - a2^x
a1 ~ 1 + (1 | 2 | g)
a2 ~ x + (x | 2 | g)
y ~ a1 - a2^x, a1 ~ 1 + (1 | gr(g, id = 2)), a2 ~ x + (x | gr(g, id = 2)), nl = TRUE)
y ~ a1 - a2^x
a1 ~ 1 + (1 | gr(g, id = 2))
a2 ~ x + (x | gr(g, id = 2))
mvbind(y1, y2) ~ x * z + (1|g)
y1 ~ x * z + (1 | g)
y2 ~ x * z + (1 | g)
y ~ x * z + (1+x|ID1|g), zi ~ x + (1|ID1|g))
y ~ x * z + (1 + x | ID1 | g)
zi ~ x + (1 | ID1 | g)
y ~ mo(x) + more_predictors)
y ~ mo(x) + more_predictors
y ~ cs(x) + more_predictors)
y ~ cs(x) + more_predictors
y ~ cs(x) + (cs(1)|g))
y ~ cs(x) + (cs(1) | g)
y ~ person + item, disc ~ item)
y ~ person + item
disc ~ item
disc ~ item
y ~ me(x, sdx))
y ~ me(x, sdx)
Specify predictors on all parameters of the wiener diffusion model the main formula models the drift rate 'delta'
rt | dec(decision) ~ x, bs ~ x, ndt ~ x, bias ~ x)
rt | dec(decision) ~ x
bs ~ x
ndt ~ x
bias ~ x
rt | dec(decision) ~ x, bias = 0.5)
rt | dec(decision) ~ x
bias = 0.5
mix <- mixture(gaussian, gaussian)
mix <- mixture(gaussian, gaussian)
y ~ 1, mu1 ~ x, mu2 ~ z, family = mix)
y ~ 1
mu1 ~ x
mu2 ~ z
y ~ x, sigma2 = "sigma1", family = mix)
y ~ x
sigma2 = sigma1
(y ~ 1) +nlf(sigma ~ a * exp(b * x), a ~ x) + lf(b ~ z + (1|g), dpar = "sigma") + gaussian()
y ~ 1
sigma ~ a * exp(b * x)
a ~ x
b ~ z + (1 | g)
(y1 ~ x + (1|g)) + gaussian() + cor_ar(~1|g) + bf(y2 ~ z) + poisson()
y1 ~ x + (1 | g)
autocor ~ arma(time = NA, gr = g, p = 1, q = 0, cov = FALSE)
y2 ~ z
(y1 ~ 1 + x + (1|c|obs), sigma = 1) + gaussian()
y2 ~ 1 + x + (1|c|obs)) + poisson()
bmi ~ age * mi(chl)) + bf(chl | mi() ~ age) + set_rescor(FALSE)
bmi ~ age * mi(chl)
chl | mi() ~ age
y ~ eta, nl = TRUE) + lf(eta ~ 1 + x) + nlf(sigma ~ tau * sqrt(eta)) + lf(tau ~ 1)
y ~ eta
eta ~ 1 + x
sigma ~ tau * sqrt(eta)
tau ~ 1
(y1 ~ x + (1|g) + (y2 ~ s(z))
y1 ~ x + (1 | g)
y2 ~ s(z)
y ~ x + (1 | g), fill = "mean"
For detailed documentation, see gr() Function Documentation.