Skip to content

Parquet skeletons #134

@clbarnes

Description

@clbarnes

I had just been thinking about using feather (arrow IPC) or parquet for skeleton storage and as is often the case, you've thought about it first.

Something I'd considered would be to use a null rather than -1 for the parent_id of root nodes. This means we're not wasting a bit to encode a single value per skeleton, and we can map node IDs onto the whole uint64 space (not that we're likely to run out of IDs, but they're not necessarily counting up from 0). AFAIK, nulls are encoded in the header of the parquet so retrieving root nodes could be much faster than checking through the whole file. If we could fix the pandas version to >=2, we could use the arrow backend and switch navis generally onto using a nullable column, but until then it's a fairly simple switch to do at the IO stage. N.B. using the arrow backend would, I think, allow an extremely fast IO mode where the memory buffer could be either dumped straight to a file or read directly by other libraries as feather format (NBLAST memory sharing?).

For bundling several parquets together I'd consider tar rather than zip (tar-quet??) - parquet files are probably best compressed using internal codecs which make best use of the file's structure, and any compression over the top of that will slow IO without significant space savings. Tarballs don't have that overhead.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions