DataArray.to_dataset() and Dataset.to_array() are already smart enough? #5712

ghiggi · 2021-08-18T09:25:59Z

ghiggi
Aug 18, 2021

Dataset.to_array() consist in broadcasting all data variables in the dataset against each other, then concatenates them along a new dimension into a new array while preserving coordinates.
DataArray.to_dataset() perform the inverse operation.

I was wondering if such two operations involve copying the data or the reshaping is done by referencing in the background to the original values.

When dealing with very large tensors lazily loaded with dask, everytime to_array() and to_dataset() are called, am I effectively passing through all the data? I can not find any documentation on this.

It happens often to me to read a large DataArray with various variables across a "feature" dimension, and apply a function to specific subset of such "feature" dimension. At first thought, I would immediately convert the DataArray to a Dataset and then apply the function. But is this efficient when dealing with large DataArrays? Is smart referencing performed?
On the other hand, defining functions on DataArray subsets is quite ugly (see below).

def ds_anomaly(ds, ds_subset_mean): 
  vars = list(ds_subset_mean.data_vars.keys())
  for var in vars:
        ds[var] = ds[var] - ds_subset_mean[var] 
  return ds       

def da_anomaly(da, da_subset_mean, variable_dim='feature'): 
  vars = da_subset_mean[variable_dim].values
  for var in vars:
        da.loc[dict(variable_dim=var)] = da.loc[dict(variable_dim=var)] - da_subset_mean.loc[dict(variable_dim=var)] # UGLY !!!
  return da       
      
def xr_anomaly(data,  ds_subset_mean, variable_dim='feature'): 
   ""Ideal general function accepting both DataArray and Dataset"""
   IS_DATARRAY = False
   if isinstance(data, xr.DataArray):
      IS_DATARRAY = True
      data = data.to_dataset(dim = variable_dim)  # ONE RESHAPHING
   vars = list(ds_subset_mean.data_vars.keys())  
   for var in vars:
        data[var] = data[var] - ds_subset_mean[var] 
   if IS_DATARRAY: 
        data = data.to_array(variable_dim)    # ANOTHER RESHAPHING
   return data

Thanks in advance for your answers :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DataArray.to_dataset() and Dataset.to_array() are already smart enough? #5712

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

DataArray.to_dataset() and Dataset.to_array() are already smart enough? #5712

Uh oh!

Uh oh!

ghiggi Aug 18, 2021

Replies: 0 comments

ghiggi
Aug 18, 2021