Replies: 5 comments
-
|
Oh hey, amazing! I’ve used Buckaroo not long ago, glad to be of assistance 😄 So this was a library to fill a need to store arbitrary metadata on a Polars DataFrame, there’s a GitHub issue on the Polars repo which I landed on while looking for it:
If you run a code search on GitHub for
The I had to review the code there to recall what the use for the metadata plugin was.
The way it’s used in the polars expr hopper plugin is:
As people pointed out in that Polars repo issue thread, this is not really a foolproof substitute for official Polars DataFrame metadata support, but in my view it’s still pretty good in absence of that 😃 You can also persist metadata to Parquet files themselves (and I’ve seen people do some impressive tricks with that like this). There might be room for this to grow into the Rust extension side of things (especially if you’re trying to make it particularly performant), but so far it’s a pretty simple Python extension. I’m currently working on a different Polars plugin, polars-genson for JSON handling, but that is a little different to this one (which is a simple Python only namespace plugin). That one is a Series handling plugin, with Rust code interfacing with the Python code. I’ll check out your hashing plugin, and do feel free to open issues/ping me with an @ if I am not responding! |
Beta Was this translation helpful? Give feedback.
-
|
I left a comment on your repo issue about possibly using rapidhash as a faster hashing library, but don’t quote me on that I’ve never used it, only heard about it this week 😄 I’d say polars-config-meta definitely sounds like the best way to store your cached columns for Polars yeah. For polars-expr-hopper, if I were you I would only use it if you were happy to let the expressions you are storing apply “as soon as they are ready” (i.e. as soon as the dataframe gets the columns that the expressions being stored refer to). It isn’t designed to apply them with logic besides that “when they’re ready” triggering, so if you’re after more than that it might not be sufficient for your needs. But it sounds like you want that when you say “lazy application”, so could be a good fit yeah 👍 |
Beta Was this translation helpful? Give feedback.
-
|
Hi. I'm revisiting this because I'm starting to write my cache bits. I really do hope that Polars implements metadata. I just took a crack at trying to identify series that are re-used across dataframes. fc = FileCache()
assert fc.check_series(df['a1']) == False
assert fc.check_series(df['b2']) == False
fc.add_df(df)
assert fc.check_series(df['a1']) == True
# buffer info for string series is unreliable commented out for now
# look at polars-core/src/series/buffer.rs::get_buffers_from_string
#assert fc.check_series(df['b2']) == True
# I dont even see this reliably working
#assert not fc._get_buffer_key(df['b2']) == fc._get_buffer_key(df['b2'])
#this should show that the same physical memory is used by df2['a1'] as df['a1']
df2 = df.select(pl.col('a1').alias('alias_a1'),
pl.col('b2').alias('alias_b2'))
assert fc.check_series(df2['alias_a1']) == True
#assert fc.check_series(df2['alias_b2']) == Trueit works for numeric datatypes, but not strings, because strings don't consistently return the same buffer addresses. Do you want to have a video call sometime to talk about what we both are building. I'm still not completely sure I understand how polars-config-meta fits in with what I'm trying to do, but it seems clsoe. |
Beta Was this translation helpful? Give feedback.
-
|
So, would polars-config-meta work for this usecase: With Buckaroo, I don't expect to have control of dataframes that are passed in. It feels like a big lift to my users to say "always load a dataframe with polars-config-meta". Can I monkeypatch dataframes that are sent in? BuckarooWidget(polars_df)
# next cell
df2 = polars_df.with_columns(pl.col('profit').cumsum().alias('profit_cumsum'))Because BuckarooWidget monkeypatched (not returned) a polars-config-meta onto the |
Beta Was this translation helpful? Give feedback.
-
Very cool!
🧐
🤔 💭
Sure, are you on Discord? I am in the Polars Discord server perhaps you are too? Username @permutans |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
What's the usecase you built this project for?
I'm building Buckaroo - a DataFrame table display library with summary stats. I also just wrote pl_series_hash that hashes series to a single u64 for use in caching. I'm planning to build out a caching system for buckaroo polars so that summary stats need to be computed only once and dataframes can be lazily displayed.
All of this seems similar to things you are building for polars-expr-hopper and polars-config-meta. Want to talk and see how we can collaborate?
Beta Was this translation helpful? Give feedback.
All reactions