This repository was archived by the owner on Feb 18, 2024. It is now read-only.
Writing records row by row #1053
ishitatsuyuki
started this conversation in
General
Replies: 1 comment 3 replies
-
Hey @ishitatsuyuki and thanks for the issue. I converted it to a discussion for now since it seems a bit broad to be actionable (yet). Maybe the primary question I have here: do you need to go through Arrow? It seems to me that you can skip arrow entirely and work directly with parquet. I am asking because it seems to me that you would like to write to parquet on a streaming fashion and thus having "arrow arrays" does not seem beneficial. Thinking through this angle,
|
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm currently working on a profiler that uses Parquet for the on-disk profile format. We don't do much of columnar processing but would like to take advantage of the compression and efficiency benefits.
As such, the records come in a row-by-row manner, but the arrow2 crate's writer only takes an entire column in a row-group. It should be technically possible to serialize things on a tighter manner, e.g. for delta bitpacked, you can serialize as soon as you have buffered a minibatch-worth of data. Doing so allows the in-memory buffer to have a smaller footprint.
As such, it might be a worthy addition to the API to be able to write things one-by-one.
Beta Was this translation helpful? Give feedback.
All reactions