Datablok is a collection of experiments in novel and high-performance applications of the Rust database building blocks (Apache DataFusion, Arrow & Parquet).
- parquet-nested-parallel - Writing 1 billion nested records to Parquet with a per-core throughput of ~1.3 million records per second, using a multi-stage parallel pipeline.
- tantivy-byte-array-index -
Embedding arbitrary data in Parquet and exploiting it to improve
DataFusion query performance. In this instance we embed a Tantivy
full-text index to accelerate
LIKE
queries.
- Test the performance limits of single-node data processing.
- Explore novel ways of composing database building blocks.
- Find and contribute improvements to the underlying libraries.
To run a specific experiment, use the -p
or --package
flag for cargo from
the root of the repository.
For example, to run the hello-datafusion
doodle:
cargo run -p hello-datafusion
The verify.sh
script mirrors the CI pipeline. Running this script is a good
practice before pushing code changes to
prevent failures in CI. Catching errors locally is much faster than waiting for
the CI pipeline to discover them.
# Run all checks on the 'hello-datafusion' package
./scripts/verify.sh hello-datafusion
# For a more detailed output, use the --verbose flag
./scripts/verify.sh --verbose hello-datafusion