Design thought - tabular data #74
YoelShoshan
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
Sounds good @YoelShoshan ! |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The upcoming data pipeline is based on having each sample in a nested dict.
While it's super useful and flexible, sometimes it makes sense to operate in a vectorized way on large tabular datasets.
(Especially if we're talking about datasets with over million entries, and the processing of each sample is very small)
For that purpose, I wonder what are your thoughts about the following design idea:
The current BaseOp remains, and it continues to operate on a single sample level
We add a BaseFullDatasetOp (or whatever better name) which expects the following:
This should be useful in at least two scenarios:
imagined usage code:
in its simplest form, a user can call those pipelines directly.
one advantage is that it may mostly use our already existing code, and it just operates in a scenario that only a single "sample" is processed through the pipeline.
The final dataframe will be cached.
Additionally, a combination could be useful, initially process it in the "single-sample-is-world" way,
and from that point use our standard pipeline which operates in "single-sample.
One can imagine such scenario as being used like this: (very rough draft)
btw - TabularOp is not necessarily a good name - whatever name that captures that the nested dict contains the entire "world" in this "special pipeline" should be used.
Beta Was this translation helpful? Give feedback.
All reactions