-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Since last April, intake-esm
has implemented reading csv files with polars
. Which is pretty cool and should help for large ones like MRCC-disponible. However, polars.scan_csv
does not have the same arguments as pd.read_csv
and our built-in read_csv_kwargs
break the new version.
This means the following
"variable" column
We use the read kwargs to convert variable tuple reps (i.e. "('pr',)"
) to actual tuples. intake-esm
already has a trick for that in the columns_with_iterables=["variable"]
init arg, which we could use. However, the polars
implementation expects lists instead of tuples. So ['pr']
instead of "('pr',)"
. Which means we need to reformat all previously existing catalog before fixing this and upgrading intake-esm
.
Sadly, this has the consequence of storing the variables in lists, which are not hashable. This trigger errors in the test suite, where we look for "duplicates" for example (search_data_catalogs
). I am looking into that.
xrfreq
The read kwargs also convert old xrfreq
to the "new" ones.
I say that time has passed and we can simply drop that. "old" xrfreq will trigger errors and that's ok.
dtypes
The kwargs ensure string[pyarrow]
as the dtype for the "path" column and "category" for the other string columns. We could change the way we specify those to match what polars want or simply test the performance without. Maybe this is not needed anymore.