Replies: 1 comment
-
As a workaround I tried to coerce the variable to the vlen dtype that Xarray uses internally, but this had no effect: VLEN_DTYPE = xr.coding.strings.create_vlen_dtype(str)
ds["foo"] = ds["foo"].astype(VLEN_DTYPE) I think this is because
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I want to concatenate a collection of datasets with a variable that contains variable length strings and save them to Zarr. The strings are mostly very short (1-2 characters), but a few can be much longer (thousands of characters), and can be of arbitrary length, so not suitable for a fixed size representation.
Zarr's VLenUTF8 is ideal for this. It is recommended on http://xarray.pydata.org/en/latest/user-guide/io.html#zarr, where it says "To store variable length strings, convert them to object arrays first with dtype=object".
The datasets to be concatenated are produced in parallel, and the size of each is not known in advance.
Here's the MVCE:
This produces a warning, because it has loaded the entire
ds.foo
variable into memory to determine its type. Obviously for large datasets this is not scalable.So, my question is: how can I concatenate datasets with variable length strings without materalizing them in memory?
Beta Was this translation helpful? Give feedback.
All reactions