Can't Dataset class create table directly from dict？ #7202

forestbat · 2022-10-24T04:22:26Z

forestbat
Oct 24, 2022

I have a dict which name is train_test_dict structure is like this：
[datetime64ns]:[data1,data2,……datas]
But when I use Dataset.from_dict to convert the dict to xarray dataset，it crashed like this：

past_test.py:79: in gen_train_test_x
    train_test_ds = xa.Dataset.from_dict(train_test_dict)
C:\Python310\lib\site-packages\xarray\core\dataset.py:6522: in from_dict
    variable_dict = {
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

.0 = <dict_itemiterator object at 0x000001F6B01A9530>

    variable_dict = {
>       k: (v["dims"], v["data"], v.get("attrs")) for k, v in variables
    }
E   TypeError: list indices must be integers or slices, not str

Must I convert the dict to pandas.Dataframe and then convert the dataframe to dataset?

And then I encountered a new problem when I try to save my dataset:

nc_path = 'sta56_2020-10-18-2020-10-3_train_test_12d.nc'
test_ds.to_netcdf(nc_path)

It crashed with this report:

past_test.py:94: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
C:\Python310\lib\site-packages\xarray\core\dataset.py:1899: in to_netcdf
    return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
C:\Python310\lib\site-packages\xarray\backends\api.py:1182: in to_netcdf
    _validate_dataset_names(dataset)
C:\Python310\lib\site-packages\xarray\backends\api.py:162: in _validate_dataset_names
    check_name(k)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

name = 0

    def check_name(name: Hashable):
        if isinstance(name, str):
            if not name:
                raise ValueError(
                    f"Invalid name {name!r} for DataArray or Dataset key: "
                    "string must be length 1 or greater for "
                    "serialization to netCDF files"
                )
        elif name is not None:
>           raise TypeError(
                f"Invalid name {name!r} for DataArray or Dataset key: "
                "must be either a string or None for serialization to netCDF "
                "files"
            )
E           TypeError: Invalid name 0 for DataArray or Dataset key: must be either a string or None for serialization to netCDF files

C:\Python310\lib\site-packages\xarray\backends\api.py:155: TypeError

My test_ds looks like this:

<xarray.Dataset>
Dimensions:  (index: 73)
Coordinates:
  * index    (index) datetime64[ns] 2020-10-18 ... 2020-10-30
Data variables: (12/192)
    0        (index) float64 -0.2407 -0.1393 -0.712 -0.9285 ... 0.0 0.3513 0.0
    1        (index) float64 -0.09629 -0.04643 -0.8268 ... 0.7426 0.3513 0.0
    2        (index) float64 0.04815 0.0 -0.8268 -0.7892 ... 0.3713 0.0 0.0
    3        (index) float64 0.0 -0.04643 -0.8268 -0.5571 ... 0.0 0.1856 0.0 0.0
    4        (index) float64 -0.09629 -0.2553 -0.8728 -0.5107 ... 0.3713 0.0 0.0
    5        (index) float64 0.0 -0.3714 -0.7809 -0.2321 ... 0.3713 0.0 -0.361
    ...       ...
    186      (index) float64 0.0 0.0 0.1614 1.0 0.0 ... 0.0 0.0 1.0 0.0 -0.5
    187      (index) float64 0.0 0.0 0.1614 1.0 0.0 0.5 ... 0.0 0.0 1.0 0.5 -0.5
    188      (index) float64 0.0 0.0 0.1614 0.0 0.0 0.5 ... 0.0 0.0 1.0 0.0 -0.5
    189      (index) float64 0.1024 0.122 0.1614 0.0 0.0 ... 0.2825 0.0 0.0 -0.5
    190      (index) float64 0.0 0.122 0.1614 0.0 0.0 ... 0.565 0.0 -0.5 -0.5
    191      (index) float64 0.0 0.122 0.1614 0.0 1.0 ... 0.565 1.0 -0.5 -0.5

Answered by keewis

Oct 24, 2022

Given the structure you posted in #7202 (reply in thread) (mapping of timestep to row), you can translate to something .from_dict accepts with:

In [8]: import xarray as xr
   ...: from datetime import datetime
   ...: 
   ...: original_data = {
   ...:     datetime(2022, 10, 24, 0, 0, 0): [0, 1, 2, 3, 4, 5],
   ...:     datetime(2022, 10, 24, 1, 0, 0): [1, 2, 3, 4, 5, 6],
   ...:     datetime(2022, 10, 24, 2, 0, 0): [2, 3, 4, 5, 6, 7],
   ...: }
   ...: data = {
   ...:     "time": {"dims": "time", "data": list(original_data.keys())},
   ...:     "data": {"dims": ["time", "columns"], "data": list(original_data.values())},
   ...: }
   ...: ds = xr.Dataset.from_dict(data)
   ...: ds
Out[8]:

View full answer

TomNicholas · 2022-10-24T16:45:06Z

TomNicholas
Oct 24, 2022
Maintainer

Hi @forestbat , sorry to hear you're having problems.

when I use Dataset.from_dict to convert the dict to xarray dataset，it crashed

It doesn't look like your input follows the format required by Dataset.from_dict(). If you look at the docstring here, you can see the example given. I'm not sure exactly how you wish your dataset to be structured - do you have many data points for each point in time? If that is the case perhaps you want something like this

d = {
    "var": {"dims": ("t", "x"), "data": [[data1, data2, ...]]},
    "t": {"dims": ("t"), "data": [datetime64]},
}
ds = xr.Dataset.from_dict(d)

where here we would be creating a length-1 "t" coordinate containing one timestamp, and have your multiple data points vary along an additional dimension "x". It's hard to tell if this is what you are intending though without more detail.

If your data is already in a file format (e.g. netcdf) I encourage you to create your dataset with xarray.open_dataset() instead though.

Must I convert the dict to pandas.Dataframe and then convert the dataframe to dataset?

You should never need to do this. I recommend you read our page on data structures and first decide how you want your dataset to be structured, then you should be able to use from_dict to create what you want directly.

And then I encountered a new problem when I try to save my dataset:

TypeError: Invalid name 0 for DataArray or Dataset key: must be either a string or None for serialization to netCDF files

This error message is telling you what the problem is. Your test_ds uses integers as keys (i.e. names of variables in the dataset). This is allowed by xarray, but it's not allowed by netCDF. You therefore have a valid dataset, but you cannot save it as a valid netCDF file using ints as keys. You could use .rename to change the names of your variables to strings.

However it looks to me as if you are using many different variables to store difference values of your data, when you should instead be storing the different values as varying along a dimension of the dataset. Can you represent your data as a single numpy array first?

1 reply

forestbat Oct 24, 2022
Author

This is my data structure which in panda's Dataframe, and its origin form is like dict{datetime64ns:list}, index is datetime.
Although I have seen articles of xarray, in fact I'm a bit 'lazy' so I don't want to appoint columns of my dataset manually, so I asked this problem.

keewis · 2022-10-24T18:11:39Z

keewis
Oct 24, 2022
Maintainer

Given the structure you posted in #7202 (reply in thread) (mapping of timestep to row), you can translate to something .from_dict accepts with:

In [8]: import xarray as xr
   ...: from datetime import datetime
   ...: 
   ...: original_data = {
   ...:     datetime(2022, 10, 24, 0, 0, 0): [0, 1, 2, 3, 4, 5],
   ...:     datetime(2022, 10, 24, 1, 0, 0): [1, 2, 3, 4, 5, 6],
   ...:     datetime(2022, 10, 24, 2, 0, 0): [2, 3, 4, 5, 6, 7],
   ...: }
   ...: data = {
   ...:     "time": {"dims": "time", "data": list(original_data.keys())},
   ...:     "data": {"dims": ["time", "columns"], "data": list(original_data.values())},
   ...: }
   ...: ds = xr.Dataset.from_dict(data)
   ...: ds
Out[8]: 
<xarray.Dataset>
Dimensions:  (time: 3, columns: 6)
Coordinates:
  * time     (time) datetime64[ns] 2022-10-24 ... 2022-10-24T02:00:00
Dimensions without coordinates: columns
Data variables:
    data     (time, columns) int64 0 1 2 3 4 5 1 2 3 4 5 6 2 3 4 5 6 7

In [9]: ds.data
Out[9]: 
<xarray.DataArray 'data' (time: 3, columns: 6)>
array([[0, 1, 2, 3, 4, 5],
       [1, 2, 3, 4, 5, 6],
       [2, 3, 4, 5, 6, 7]])
Coordinates:
  * time     (time) datetime64[ns] 2022-10-24 ... 2022-10-24T02:00:00
Dimensions without coordinates: columns

This will give you a dataset with the structure @TomNicholas suggested. If you actually need the structure you described in #7202 (comment) (one variable per column), you can get that by converting the dataset:

In [10]: by_column = ds.data.to_dataset(dim="columns")
    ...: by_column
Out[10]: 
<xarray.Dataset>
Dimensions:  (time: 3)
Coordinates:
  * time     (time) datetime64[ns] 2022-10-24 ... 2022-10-24T02:00:00
Data variables:
    0        (time) int64 0 1 2
    1        (time) int64 1 2 3
    2        (time) int64 2 3 4
    3        (time) int64 3 4 5
    4        (time) int64 4 5 6
    5        (time) int64 5 6 7

However, the netcdf file format does not seem to allow integer variable names (even though xarray does), so I'd recommend to save ds instead of by_column, or to alternatively rename with:

In [11]: renamed = by_column.rename({name: str(name) for name in by_column.data_vars.keys()})
    ...: renamed
Out[11]: 
<xarray.Dataset>
Dimensions:  (time: 3)
Coordinates:
  * time     (time) datetime64[ns] 2022-10-24 ... 2022-10-24T02:00:00
Data variables:
    0        (time) int64 0 1 2
    1        (time) int64 1 2 3
    2        (time) int64 2 3 4
    3        (time) int64 3 4 5
    4        (time) int64 4 5 6
    5        (time) int64 5 6 7

and save that.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Can't Dataset class create table directly from dict？ #7202

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Can't Dataset class create table directly from dict？ #7202

Uh oh!

Uh oh!

forestbat Oct 24, 2022

Replies: 2 comments · 1 reply

Uh oh!

TomNicholas Oct 24, 2022 Maintainer

Uh oh!

forestbat Oct 24, 2022 Author

Uh oh!

Uh oh!

keewis Oct 24, 2022 Maintainer

forestbat
Oct 24, 2022

Replies: 2 comments 1 reply

TomNicholas
Oct 24, 2022
Maintainer

forestbat Oct 24, 2022
Author

keewis
Oct 24, 2022
Maintainer