A Polars plugin for persistent DataFrame-level metadata.
polars-config-meta offers a simple way to store and propagate Python-side metadata for Polars DataFrames and LazyFrames. It achieves this by:
- Registering a custom
config_metanamespace on eachDataFrameandLazyFrame. - Keeping an internal dictionary keyed by the
id(df), with automatic weak-reference cleanup to avoid memory leaks. - Automatically patching common Polars methods (like
with_columns,select,filter, etc.) so that metadata is preserved even when using regular Polars syntax. - Providing a "fallthrough" mechanism so you can write
df.config_meta.some_polars_method(...)and have the resulting newDataFrameautomatically inherit the old metadata, for use to either explicitly note the metadata transfer or as a backup if a method was not monkeypatched (please file a bug report if you find any!). - Optionally embedding that metadata in file‐level Parquet metadata when you call
df.config_meta.write_parquet(...), and retrieving it withread_parquet_with_meta(...)(eager) orscan_parquet_with_meta(...)(lazy).
pip install polars-config-meta[polars]On older CPUs add the polars-lts-cpu extra:
pip install polars-config-meta[polars-lts-cpu]For parquet file-level metadata read/writing, add the pyarrow extra:
pip install polars-config-meta[pyarrow]-
Automatic Metadata Preservation (New in v0.2.0!) By default, the plugin patches common Polars DataFrame methods (
with_columns,select,filter,sort, etc.) to automatically preserve metadata. This means both of these will preserve metadata:df.with_columns(...)← regular Polars method (automatically patched)df.config_meta.with_columns(...)← through the namespace
This behavior can be configured globally (see Configuration below).
-
Weak-Reference Based We store metadata in class-level dictionaries keyed by
id(df)and hold aweakrefto the DataFrame. Once the DataFrame is garbage-collected, the metadata is removed too. -
Works with DataFrames and LazyFrames The plugin supports both eager (
DataFrame) and lazy (LazyFrame) execution modes. -
Parquet Integration
df.config_meta.write_parquet("file.parquet")automatically embeds the plugin metadata into the Arrow schema'smetadata.read_parquet_with_meta("file.parquet")reads the file, extracts that metadata, and reattaches it to the returnedDataFrame.scan_parquet_with_meta("file.parquet")scans the file, extracts that metadata, and reattaches it to the returnedLazyFrame.
-
Chainable Operations Since metadata is preserved across transformations, you can chain multiple operations:
result = ( df.config_meta.set(owner="Alice") .with_columns(doubled=pl.col("a") * 2) .filter(pl.col("doubled") > 5) .select(["doubled"]) ) # Metadata is preserved throughout the chain!
import polars as pl
import polars_config_meta # this registers the plugin
df = pl.DataFrame({"a": [1, 2, 3]})
df.config_meta.set(owner="Alice", confidence=0.95)
# Both of these preserve metadata (auto-patching is enabled by default):
df2 = df.with_columns(doubled=pl.col("a") * 2)
print(df2.config_meta.get_metadata())
# -> {'owner': 'Alice', 'confidence': 0.95}
df3 = df.config_meta.with_columns(tripled=pl.col("a") * 3)
print(df3.config_meta.get_metadata())
# -> {'owner': 'Alice', 'confidence': 0.95}
# Chain operations - metadata flows through:
df4 = (
df.with_columns(squared=pl.col("a") ** 2)
.filter(pl.col("squared") > 4)
.select(["a", "squared"])
)
print(df4.config_meta.get_metadata())
# -> {'owner': 'Alice', 'confidence': 0.95}
# Write to Parquet, storing the metadata in file-level metadata:
df4.config_meta.write_parquet("output.parquet")
# Later, read it back:
from polars_config_meta import read_parquet_with_meta
df_in = read_parquet_with_meta("output.parquet")
print(df_in.config_meta.get_metadata())
# -> {'owner': 'Alice', 'confidence': 0.95}The plugin provides a ConfigMetaOpts class to control automatic metadata preservation behavior:
from polars_config_meta import ConfigMetaOpts
# Disable automatic metadata preservation for regular DataFrame methods
ConfigMetaOpts.disable_auto_preserve()
df = pl.DataFrame({"a": [1, 2, 3]})
df.config_meta.set(owner="Alice")
df2 = df.with_columns(doubled=pl.col("a") * 2)
print(df2.config_meta.get_metadata())
# -> {} (metadata NOT preserved with regular methods)
df3 = df.config_meta.with_columns(tripled=pl.col("a") * 3)
print(df3.config_meta.get_metadata())
# -> {'owner': 'Alice'} (still works via namespace!)
# Re-enable automatic preservation
ConfigMetaOpts.enable_auto_preserve()
df4 = df.with_columns(quadrupled=pl.col("a") * 4)
print(df4.config_meta.get_metadata())
# -> {'owner': 'Alice'} (metadata preserved again)ConfigMetaOpts.enable_auto_preserve(): Enable automatic metadata preservation for regular DataFrame methods (this is the default behavior).ConfigMetaOpts.disable_auto_preserve(): Disable automatic preservation. Onlydf.config_meta.<method>()will preserve metadata.
Note: The df.config_meta.<method>() syntax always preserves metadata, regardless of the configuration setting.
-
df.config_meta.set(**kwargs): Set metadata key-value pairsdf.config_meta.set(owner="Alice", confidence=0.95, version=2)
-
df.config_meta.get_metadata(): Get all metadata as a dictionarymetadata = df.config_meta.get_metadata() # -> {'owner': 'Alice', 'confidence': 0.95, 'version': 2}
-
df.config_meta.update(mapping): Update metadata from a dictionarydf.config_meta.update({"confidence": 0.99, "validated": True})
-
df.config_meta.merge(*dfs): Merge metadata from other DataFramesdf3.config_meta.merge(df1, df2) # df3 now has metadata from both df1 and df2
-
df.config_meta.clear_metadata(): Remove all metadata for this DataFramedf.config_meta.clear_metadata()
-
df.config_meta.write_parquet(file_path, **kwargs): Write DataFrame to Parquet with embedded metadatadf.config_meta.write_parquet("output.parquet")
-
read_parquet_with_meta(file_path, **kwargs): Read Parquet file with metadata (eager)from polars_config_meta import read_parquet_with_meta df = read_parquet_with_meta("output.parquet")
-
scan_parquet_with_meta(file_path, **kwargs): Scan Parquet file with metadata (lazy)from polars_config_meta import scan_parquet_with_meta lf = scan_parquet_with_meta("output.parquet")
Any Polars DataFrame/LazyFrame method can be called through df.config_meta.<method>():
# All of these preserve metadata:
df.config_meta.with_columns(new_col=pl.col("a") * 2)
df.config_meta.select(["a", "b"])
df.config_meta.filter(pl.col("a") > 0)
df.config_meta.sort("a")
df.config_meta.unique()
df.config_meta.drop(["col1"])
df.config_meta.rename({"old": "new"})
# ... and many more!df = pl.DataFrame({"a": [1, 2, 3]})
df.config_meta.set(
source="user_upload",
timestamp="2025-01-15",
validated=False
)result = (
df.with_columns(normalized=pl.col("value") / pl.col("value").sum())
.filter(pl.col("normalized") > 0.1)
.sort("normalized", descending=True)
)
# Metadata flows through the entire chaindf1.config_meta.set(source="api", quality="high")
df2.config_meta.set(source="cache", timestamp="2025-01-15")
df3 = pl.concat([df1, df2])
df3.config_meta.merge(df1, df2)
# df3 now has: {'source': 'cache', 'quality': 'high', 'timestamp': '2025-01-15'}
# Note: Later DataFrames' values override earlier ones# Save with metadata
df.config_meta.set(lineage="raw_data", version=1)
df.config_meta.write_parquet("data_v1.parquet")
# Load with metadata
df_loaded = read_parquet_with_meta("data_v1.parquet")
print(df_loaded.config_meta.get_metadata())
# -> {'lineage': 'raw_data', 'version': 1}When you first access .config_meta on any DataFrame, the plugin automatically patches common Polars methods like:
with_columns,select,filter,sort,unique,drop,rename,castdrop_nulls,fill_null,fill_nanhead,tail,sample,slice,limitreverse,rechunk,clone,clear- ... and more
These patched methods automatically copy metadata from the source DataFrame to the result DataFrame.
Internally, the plugin stores metadata in a global dictionary, _df_id_to_meta, keyed by id(df),
and also keeps a weakref to each DataFrame. As soon as a DataFrame is out of scope and
garbage-collected, the entry in _df_id_to_meta is automatically removed. This prevents memory
leaks and keeps the plugin usage simple.
When you call df.config_meta.some_method(...):
- The plugin checks if
some_methodexists on the plugin itself (likeset,get_metadata,write_parquet) - If not, it forwards the call to the underlying DataFrame's method
- If the result is a new DataFrame/LazyFrame, it automatically copies the metadata
-
Python-Layer Only This is purely at the Python layer. Polars doesn't guarantee stable IDs or official hooks for such metadata.
-
Metadata is Ephemeral (Unless Saved) Metadata is stored in memory and tied to DataFrame object IDs. It won't survive serialization unless you explicitly use
df.config_meta.write_parquet()andread_parquet_with_meta(). -
Other Formats Not Supported Currently, only Parquet format supports automatic metadata embedding/extraction. For CSV, Arrow, IPC, etc., you'd need to implement your own serialization logic.
-
Configuration is Global The
ConfigMetaOptssettings apply globally to all DataFrames in your Python session.
- Issues & Discussions: Please open a GitHub issue for bugs, ideas, or questions.
- Pull Requests: PRs are welcome! This plugin is a community-driven approach to persist DataFrame-level metadata in Polars.
There is ongoing work to support file-level metadata in the Polars Parquet writing, see this PR for details. Once that lands, this plugin may be able to integrate more seamlessly.
This project is licensed under the MIT License.