-
Notifications
You must be signed in to change notification settings - Fork 17
Description
In Follow up to #486, it'd sure be nice to be able to move away
from our current multiprocessing.shared_memory
approach for
real-time quote/tick ingest and possibly leverage an apache
standard format such as arrow
and parquet
.
As part of improving the .parquet
file based tsdb IO from #486
obviously it'd be ideal to support df appends instead of only full
overwrites 😂.
ToDo content from #486
pertaining to StorageClient.write_ohlcv()
write on backfills and
rt ingest. rn the write is masked out mostly bc there's some
details to work out on when/how frequent the writes to parquet
files should happen, particularly whether to "append" to parquet
files: turns out there's options for appending (faster then
overwriting i guess?) to parquet, particularly using fastparquet
,
see the below resources:
-
for python we can likely use: https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
- also note the
times
options with the int96 format which
embeds nanoseconds B) - the
custom_metadata
: dict can only be used on overwrite 👀- can use the https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.update_file_custom_metadata
to update metadata if needed?
- can use the https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.update_file_custom_metadata
- also note the
-
https://stackoverflow.com/questions/39234391/how-to-append-data-to-an-existing-parquet-file
-
other langs and spark related: