smallpond

A lightweight data processing framework built on DuckDB and 3FS.

Features

🚀 High-performance data processing powered by DuckDB
🌍 Scalable to handle PB-scale datasets
🛠️ Easy operations with no long-running services
🪣 Direct reading from S3 storage

Installation

Python 3.8 to 3.12 is supported.

pip install smallpond

Quick Start

# Download example data
wget https://duckdb.org/data/prices.parquet

import smallpond

# Initialize session
sp = smallpond.init()

# Load data
df = sp.read_parquet("prices.parquet")

# Process data
df = df.repartition(3, hash_by="ticker")
df = sp.partial_sql("SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker", df)

# Save results
df.write_parquet("output/")
# Show results
print(df.to_pandas())

Reading from S3

import smallpond

# Initialize session
sp = smallpond.init()

# Load data directly from S3
df = sp.read_parquet("s3://my-bucket/data/prices.parquet", 
                     s3_region="us-west-2")

# You can also provide explicit credentials
df = sp.read_parquet("s3://my-bucket/data/*.parquet",
                     recursive=True,
                     s3_region="us-west-2",
                     s3_access_key_id="YOUR_ACCESS_KEY",
                     s3_secret_access_key="YOUR_SECRET_KEY")

# For S3-compatible storage (like MinIO, Ceph, etc.)
df = sp.read_parquet("s3://my-bucket/data.parquet",
                     s3_endpoint="https://minio.example.com",
                     s3_region="us-east-1",
                     s3_access_key_id="YOUR_ACCESS_KEY",
                     s3_secret_access_key="YOUR_SECRET_KEY")

# Process data as usual
df = sp.partial_sql("SELECT * FROM {0} LIMIT 10", df)
print(df.to_pandas())

smallpond uses DuckDB's Secrets Manager under the hood to securely handle S3 credentials across the entire session.

Documentation

For detailed guides and API reference:

Performance

We evaluated smallpond using the GraySort benchmark (script) on a cluster comprising 50 compute nodes and 25 storage nodes running 3FS. The benchmark sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min.

Details can be found in 3FS - Gray Sort.

Development

pip install .[dev]

# run unit tests
pytest -v tests/test*.py

# build documentation
pip install .[docs]
cd docs
make html
python -m http.server --directory build/html

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
examples		examples
smallpond		smallpond
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
test_s3.py		test_s3.py
test_s3_persistent.py		test_s3_persistent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

smallpond

Features

Installation

Quick Start

Reading from S3

Documentation

Performance

Development

License

About

Uh oh!

Releases

Packages

Languages

License

definite-app/smallpond

Folders and files

Latest commit

History

Repository files navigation

smallpond

Features

Installation

Quick Start

Reading from S3

Documentation

Performance

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages