10. Time series - Try to reproduce the 'Windowing dataset' section with tf.keras.preprocessing.timeseries_dataset_from_array() #199

remy-r · 2021-09-17T10:06:36Z

remy-r
Sep 17, 2021

Hi,

I try to reproduce the section 'Format Data Part 2: Windowing dataset' here with the tensorflow dataset specific function : tf.keras.preprocessing.timeseries_dataset_from_array()

In the notbook, this is the method used :

...

full_windows, full_labels = make_windows(prices, window_size=WINDOW_SIZE, horizon=HORIZON)


...

train_windows, test_windows, train_labels, test_labels = make_train_test_splits(full_windows, full_labels)

What I tried using the timeseries_dataset_from_array function:

split = int(0.8 * len(prices))

windows_dataset = tf.keras.utils.timeseries_dataset_from_array(prices, prices[WINDOW:],
                                             sequence_length=WINDOW, batch_size=len(prices)).unbatch()

train_windows_ds = windows_dataset.take(split).batch(128)
test_windows_ds = windows_dataset.skip(split).batch(128)

I find suspicious the shape of my Datasets: I was hoping ((None, 7), (None,))

train_windows_ds
<BatchDataset shapes: ((None, None), (None,)), types: (tf.float64, tf.float64)>

The problem is when I train my model, I find a very different MAE from the notebook :

model_1 = tf.keras.Sequential([
  tf.keras.layers.Input(shape=(None, 7)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(HORIZON, activation='linear')
], name="model_1")

model_1.compile(loss=tf.keras.losses.mae,
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['mae'])

model_1.fit(train_windows_ds,
            epochs=100,
            validation_data=test_windows_ds,
            callbacks=[make_callbacks(name=model_1.name)])

model_1 = tf.keras.models.load_model("model_experiments/model_1")
model_1.evaluate(test_windows_ds)

=== >  890.6848754882812

890 is very different from the 568 of the Notebook when evaluating Model 1

Does someone see what I'm missing ?

Thank you for your help

Answered by mrdbourke

Sep 20, 2021

Edit: @remy-r has discovered the difference is because NumPy arrays are shuffled (shuffled=True) by default when using the fit() function - https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit

However when using tf.data.Dataset objects such as with tf.keras.utils.timeseries_dataset_from_array(), the shuffle parameter in fit() gets ignored.

This explains the different outcomes of results.

When tf.data.Dataset objects are shuffled manually with .shuffle(), the results start to line up.

See the comment below for more: #199 (reply in thread)

Hey there,

Massive effort giving this a go.

I'd say the difference could be coming from when you create your train/test sets.

You may have to m…

View full answer

mrdbourke · 2021-09-20T06:15:52Z

mrdbourke
Sep 20, 2021
Maintainer

Edit: @remy-r has discovered the difference is because NumPy arrays are shuffled (shuffled=True) by default when using the fit() function - https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit

However when using tf.data.Dataset objects such as with tf.keras.utils.timeseries_dataset_from_array(), the shuffle parameter in fit() gets ignored.

This explains the different outcomes of results.

When tf.data.Dataset objects are shuffled manually with .shuffle(), the results start to line up.

See the comment below for more: #199 (reply in thread)

Hey there,

Massive effort giving this a go.

I'd say the difference could be coming from when you create your train/test sets.

You may have to make the split before you window.

I'd also check to see where the starting indexes are using the start_index parameter in https://www.tensorflow.org/api_docs/python/tf/keras/utils/timeseries_dataset_from_array

See that windows_dataset = tf.keras.utils.timeseries_dataset_from_array(prices, prices[WINDOW:], line in your code is correct, you may need to index on prices to line things up correctly.

Example 2 in the docs is the most similar to what you'd want to set up:

input_data = data[:-10] # <- inputs indexed
targets = data[10:] # <- also indexed
dataset = tf.keras.preprocessing.timeseries_dataset_from_array(
    input_data, targets, sequence_length=10)
for batch in dataset:
  inputs, targets = batch
  assert np.array_equal(inputs[0], data[:10])  # First sequence: steps [0-9]
  assert np.array_equal(targets[0], data[10])  # Corresponding target: step 10
  break

4 replies

remy-r Sep 21, 2021
Author

Thank you Daniel for your time watching my problem,

I feel there is something wrong in my data splitting too ...

So I tried to split before I Window as you advice me:

split = int(0.8 * len(prices))

features_prices_train = prices[:split+WINDOW-1]
features_prices_test = prices[split:-1]

labels_prices_train = prices[WINDOW:split+WINDOW]
labels_prices_test = prices[split+WINDOW:]

train_windows_ds = tf.keras.utils.timeseries_dataset_from_array(features_prices_train, labels_prices_train,
                                             sequence_length=WINDOW).prefetch(tf.data.AUTOTUNE)
test_windows_ds = tf.keras.utils.timeseries_dataset_from_array(features_prices_test, labels_prices_test,
                                             sequence_length=WINDOW).prefetch(tf.data.AUTOTUNE)

With this implementation, I get exactly same split than with the notebook method :

First five train with notebook method :

train_windows[:5], train_labels[:5]
====>
(array([[123.65499, 125.455  , 108.58483, 118.67466, 121.33866, 120.65533,
         121.795  ],
        [125.455  , 108.58483, 118.67466, 121.33866, 120.65533, 121.795  ,
         123.033  ],
        [108.58483, 118.67466, 121.33866, 120.65533, 121.795  , 123.033  ,
         124.049  ],
        [118.67466, 121.33866, 120.65533, 121.795  , 123.033  , 124.049  ,
         125.96116],
        [121.33866, 120.65533, 121.795  , 123.033  , 124.049  , 125.96116,
         125.27966]]), array([[123.033  ],
        [124.049  ],
        [125.96116],
        [125.27966],
        [125.9275 ]]))

First five train with timeseries_dataset_from_array method :

train_windows_ds.take(1).get_single_element()[0][:5], train_windows_ds.take(1).get_single_element()[1][:5]
====>
(<tf.Tensor: shape=(5, 7), dtype=float64, numpy=
 array([[123.65499, 125.455  , 108.58483, 118.67466, 121.33866, 120.65533,
         121.795  ],
        [125.455  , 108.58483, 118.67466, 121.33866, 120.65533, 121.795  ,
         123.033  ],
        [108.58483, 118.67466, 121.33866, 120.65533, 121.795  , 123.033  ,
         124.049  ],
        [118.67466, 121.33866, 120.65533, 121.795  , 123.033  , 124.049  ,
         125.96116],
        [121.33866, 120.65533, 121.795  , 123.033  , 124.049  , 125.96116,
         125.27966]])>,
 <tf.Tensor: shape=(5,), dtype=float64, numpy=array([123.033  , 124.049  , 125.96116, 125.27966, 125.9275 ])>)

Last five train with notebook method :

train_windows[-5:], train_labels[-5:]
====>
(array([[9290.89660239, 9202.41545055, 9369.62808116, 9326.59962378,
         9335.75240233, 9226.48582088, 8794.35864452],
        [9202.41545055, 9369.62808116, 9326.59962378, 9335.75240233,
         9226.48582088, 8794.35864452, 8798.04205463],
        [9369.62808116, 9326.59962378, 9335.75240233, 9226.48582088,
         8794.35864452, 8798.04205463, 9081.18687849],
        [9326.59962378, 9335.75240233, 9226.48582088, 8794.35864452,
         8798.04205463, 9081.18687849, 8711.53433917],
        [9335.75240233, 9226.48582088, 8794.35864452, 8798.04205463,
         9081.18687849, 8711.53433917, 8760.89271814]]),
 array([[8798.04205463],
        [9081.18687849],
        [8711.53433917],
        [8760.89271814],
        [8749.52059102]]))

Last five train with timeseries_dataset_from_array method :

train_windows_ds.skip(len(train_windows_ds)-1).get_single_element()[0][-5:], train_windows_ds.skip(len(train_windows_ds)-1).get_single_element()[1][-5:]
====>
(<tf.Tensor: shape=(5, 7), dtype=float64, numpy=
 array([[9290.89660239, 9202.41545055, 9369.62808116, 9326.59962378,
         9335.75240233, 9226.48582088, 8794.35864452],
        [9202.41545055, 9369.62808116, 9326.59962378, 9335.75240233,
         9226.48582088, 8794.35864452, 8798.04205463],
        [9369.62808116, 9326.59962378, 9335.75240233, 9226.48582088,
         8794.35864452, 8798.04205463, 9081.18687849],
        [9326.59962378, 9335.75240233, 9226.48582088, 8794.35864452,
         8798.04205463, 9081.18687849, 8711.53433917],
        [9335.75240233, 9226.48582088, 8794.35864452, 8798.04205463,
         9081.18687849, 8711.53433917, 8760.89271814]])>,
 <tf.Tensor: shape=(5,), dtype=float64, numpy=
 array([8798.04205463, 9081.18687849, 8711.53433917, 8760.89271814,
        8749.52059102])>)

First five test with notebook method :

test_windows[:5], test_labels[:5]
====>
(array([[9226.48582088, 8794.35864452, 8798.04205463, 9081.18687849,
         8711.53433917, 8760.89271814, 8749.52059102],
        [8794.35864452, 8798.04205463, 9081.18687849, 8711.53433917,
         8760.89271814, 8749.52059102, 8656.97092235],
        [8798.04205463, 9081.18687849, 8711.53433917, 8760.89271814,
         8749.52059102, 8656.97092235, 8500.64355816],
        [9081.18687849, 8711.53433917, 8760.89271814, 8749.52059102,
         8656.97092235, 8500.64355816, 8469.2608989 ],
        [8711.53433917, 8760.89271814, 8749.52059102, 8656.97092235,
         8500.64355816, 8469.2608989 , 8537.33965197]]),
 array([[8656.97092235],
        [8500.64355816],
        [8469.2608989 ],
        [8537.33965197],
        [8205.80636599]]))

First five test with timeseries_dataset_from_array method :

test_windows_ds.take(1).get_single_element()[0][:5], test_windows_ds.take(1).get_single_element()[1][:5]
====>
(<tf.Tensor: shape=(5, 7), dtype=float64, numpy=
 array([[9226.48582088, 8794.35864452, 8798.04205463, 9081.18687849,
         8711.53433917, 8760.89271814, 8749.52059102],
        [8794.35864452, 8798.04205463, 9081.18687849, 8711.53433917,
         8760.89271814, 8749.52059102, 8656.97092235],
        [8798.04205463, 9081.18687849, 8711.53433917, 8760.89271814,
         8749.52059102, 8656.97092235, 8500.64355816],
        [9081.18687849, 8711.53433917, 8760.89271814, 8749.52059102,
         8656.97092235, 8500.64355816, 8469.2608989 ],
        [8711.53433917, 8760.89271814, 8749.52059102, 8656.97092235,
         8500.64355816, 8469.2608989 , 8537.33965197]])>,
 <tf.Tensor: shape=(5,), dtype=float64, numpy=
 array([8656.97092235, 8500.64355816, 8469.2608989 , 8537.33965197,
        8205.80636599])>)

Last five test with notebook method :

test_windows[-5:], test_labels[-5:]
====>
(array([[56583.84987917, 57107.12067189, 58788.20967893, 58102.19142623,
         55715.54665129, 56573.5554719 , 52147.82118698],
        [57107.12067189, 58788.20967893, 58102.19142623, 55715.54665129,
         56573.5554719 , 52147.82118698, 49764.1320816 ],
        [58788.20967893, 58102.19142623, 55715.54665129, 56573.5554719 ,
         52147.82118698, 49764.1320816 , 50032.69313676],
        [58102.19142623, 55715.54665129, 56573.5554719 , 52147.82118698,
         49764.1320816 , 50032.69313676, 47885.62525472],
        [55715.54665129, 56573.5554719 , 52147.82118698, 49764.1320816 ,
         50032.69313676, 47885.62525472, 45604.61575361]]),
 array([[49764.1320816 ],
        [50032.69313676],
        [47885.62525472],
        [45604.61575361],
        [43144.47129086]]))

Last five test with timeseries_dataset_from_array method :

test_windows_ds.skip(len(test_windows_ds)-1).get_single_element()[0][-5:], test_windows_ds.skip(len(test_windows_ds)-1).get_single_element()[1][-5:]
====>
(<tf.Tensor: shape=(5, 7), dtype=float64, numpy=
 array([[56583.84987917, 57107.12067189, 58788.20967893, 58102.19142623,
         55715.54665129, 56573.5554719 , 52147.82118698],
        [57107.12067189, 58788.20967893, 58102.19142623, 55715.54665129,
         56573.5554719 , 52147.82118698, 49764.1320816 ],
        [58788.20967893, 58102.19142623, 55715.54665129, 56573.5554719 ,
         52147.82118698, 49764.1320816 , 50032.69313676],
        [58102.19142623, 55715.54665129, 56573.5554719 , 52147.82118698,
         49764.1320816 , 50032.69313676, 47885.62525472],
        [55715.54665129, 56573.5554719 , 52147.82118698, 49764.1320816 ,
         50032.69313676, 47885.62525472, 45604.61575361]])>,
 <tf.Tensor: shape=(5,), dtype=float64, numpy=
 array([49764.1320816 , 50032.69313676, 47885.62525472, 45604.61575361,
        43144.47129086])>)

With this implementation I'm sure that :
train_windows, train_labels, test_windows, test_labels are equivalent to train_windows_ds, test_windows_ds

But when implementing the model :

model_1 = tf.keras.Sequential([
  tf.keras.layers.Input(shape=(None, 7)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(HORIZON, activation='linear')
], name="model_1")

model_1.compile(loss=tf.keras.losses.mae,
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['mae'])

model_1.fit(train_windows_ds,
            steps_per_epoch = len(train_windows_ds),
            epochs=100,
            validation_data=test_windows_ds,
            validation_steps = len(test_windows_ds),
            callbacks=[make_callbacks(name=model_1.name)])

model_1 = tf.keras.models.load_model("model_experiments/model_1")
model_1.evaluate(test_windows_ds)

The result is a MAE > 700

I don't understand how the MAE can be different with the same datas ?

Thank you for your help

mrdbourke Sep 22, 2021
Maintainer

Woah, they do look the same. I'm not sure what's going on here either.

It could be datatypes? Can you confirm both are float32 or float64? It shouldn't influence that much but maybe...

And are the batch sizes the same?

remy-r Sep 22, 2021
Author

I confirm datatypes and batchsizes are the same.

I found two possibilities to my problems :

Fisrt : my input Layers has a wrong shape ==> I have to use (7, ) and not (None, 7)
But even with this correction MAE are still very different.
Second: There is a notion of "shuffeling" data that differ between np.array and datasets inputs in the Model.fit method :

https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit

shuffle	Boolean (whether to shuffle the training data before each epoch) or str (for 'batch'). This argument is ignored when x is a generator or an object of tf.data.Dataset. 'batch' is a special option for dealing with the limitations of HDF5 data; it shuffles in batch-sized chunks. Has no effect when steps_per_epoch is not None.

The default value is shuffle=True

So I understand np.arraays inputs are shuffled before training by default. That is not the case with datasets.

So when I shuffle my train dataset before training, I finnaly achieve with a 590 MAE with tf.keras.preprocessing.timeseries_dataset_from_array

split = int(0.8 * len(prices))

features_prices_train = prices[:split+WINDOW-1]
features_prices_test = prices[split:-1]

labels_prices_train = prices[WINDOW:split+WINDOW]
labels_prices_test = prices[split+WINDOW:]

train_windows_ds = tf.keras.utils.timeseries_dataset_from_array(features_prices_train, labels_prices_train, batch_size=128,
                                             sequence_length=WINDOW).shuffle(buffer_size=10000).prefetch(tf.data.AUTOTUNE)
test_windows_ds = tf.keras.utils.timeseries_dataset_from_array(features_prices_test, labels_prices_test, batch_size=128,
                                             sequence_length=WINDOW).prefetch(tf.data.AUTOTUNE)

Am I wrong with this analysis ?

Thank you

mrdbourke Sep 23, 2021
Maintainer

Wow, great find.

You're right it seems shuffle=True is the default for fit() when using NumPy arrays (like the notebook) but shuffle gets ignored when using tf.data.Dataset (like your example).

Though when you manually set shuffle with .shuffle() on your tf.data.Dataset, the results start to line up more.

This makes sense of the different results you've found.

Thank you for following up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

10. Time series - Try to reproduce the 'Windowing dataset' section with tf.keras.preprocessing.timeseries_dataset_from_array() #199

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

10. Time series - Try to reproduce the 'Windowing dataset' section with tf.keras.preprocessing.timeseries_dataset_from_array() #199

Uh oh!

remy-r Sep 17, 2021

Replies: 1 comment · 4 replies

Uh oh!

Uh oh!

mrdbourke Sep 20, 2021 Maintainer

Uh oh!

remy-r Sep 21, 2021 Author

Uh oh!

mrdbourke Sep 22, 2021 Maintainer

Uh oh!

remy-r Sep 22, 2021 Author

Uh oh!

mrdbourke Sep 23, 2021 Maintainer

remy-r
Sep 17, 2021

Replies: 1 comment 4 replies

mrdbourke
Sep 20, 2021
Maintainer

remy-r Sep 21, 2021
Author

mrdbourke Sep 22, 2021
Maintainer

remy-r Sep 22, 2021
Author

mrdbourke Sep 23, 2021
Maintainer