Parquet skeletons

I had just been thinking about using feather (arrow IPC) or parquet for skeleton storage and as is often the case, you've thought about it first.

Something I'd considered would be to use a null rather than -1 for the `parent_id` of root nodes. This means we're not wasting a bit to encode a single value per skeleton, and we can map node IDs onto the whole uint64 space (not that we're likely to run out of IDs, but they're not necessarily counting up from 0). AFAIK, nulls are encoded in the header of the parquet so retrieving root nodes could be much faster than checking through the whole file. If we could fix the pandas version to >=2, we could use the arrow backend and switch navis generally onto using a nullable column, but until then it's a fairly simple switch to do at the IO stage. N.B. using the arrow backend would, I think, allow an extremely fast IO mode where the memory buffer could be either dumped straight to a file or read directly by other libraries as feather format (NBLAST memory sharing?).

For bundling several parquets together I'd consider tar rather than zip (tar-quet??) - parquet files are probably best compressed using internal codecs which make best use of the file's structure, and any compression over the top of that will slow IO without significant space savings. Tarballs don't have that overhead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parquet skeletons #134

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parquet skeletons #134

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions