Skip to content

Intake-esm polars' implementation incompatible with xscen's CSVs #618

@aulemahal

Description

@aulemahal

Since last April, intake-esm has implemented reading csv files with polars. Which is pretty cool and should help for large ones like MRCC-disponible. However, polars.scan_csv does not have the same arguments as pd.read_csv and our built-in read_csv_kwargs break the new version.

This means the following

"variable" column

We use the read kwargs to convert variable tuple reps (i.e. "('pr',)") to actual tuples. intake-esm already has a trick for that in the columns_with_iterables=["variable"] init arg, which we could use. However, the polars implementation expects lists instead of tuples. So ['pr'] instead of "('pr',)". Which means we need to reformat all previously existing catalog before fixing this and upgrading intake-esm.

Sadly, this has the consequence of storing the variables in lists, which are not hashable. This trigger errors in the test suite, where we look for "duplicates" for example (search_data_catalogs). I am looking into that.

xrfreq

The read kwargs also convert old xrfreq to the "new" ones.

I say that time has passed and we can simply drop that. "old" xrfreq will trigger errors and that's ok.

dtypes

The kwargs ensure string[pyarrow] as the dtype for the "path" column and "category" for the other string columns. We could change the way we specify those to match what polars want or simply test the performance without. Maybe this is not needed anymore.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions