Skip to content

Add parquet support #20

@kevmo314

Description

@kevmo314

This is going to be substantially more challenging than the csv format, but might be rewarding. We want to add Apache Parquet support as Parquet is used in a very large number of real-world data science applications.

But one challenge right now is that we require a byte pointer to a specific record whereas Parquet is columnar, meaning records are split across different locations.

https://github.com/apache/parquet-format

We likely will need to write our own Parquet file parser to figure out the correct byte offset, then in the js library be pretty particular about how exactly that record is fetched/parsed. This might involve needing to return additional metadata beyond just the byte offset, which we can do via an intermediate pointer in the index.

Anyways, let's talk about this one before working on it, it'll be super educational about how Parquet works but I don't want us to get lost in the complexity.

Metadata

Metadata

Labels

featureNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions