-
Notifications
You must be signed in to change notification settings - Fork 11
Description
As more and more people use pytorch now, I wonder if we can have a pytorch example to use with pescador?
I fact, I already tried a few things....
Lets assume the infamous (among audio research) scenario of randomly sampling of small excerpts from longer audio tracks and we want to see all the data once for every epoch in random order:
import numpy as np
import pescador
# define audio tracks as (nb_samples, nb_features)
tracks = [np.random.random((track_length, 1)) for i in range(nb_tracks)]
# yield excerpts from audio tracks
def excerpt_gen(data):
for i in range(0, data.shape[0] - excerpt_length, excerpt_hop):
yield dict(X=data[i:i+excerpt_length, :])
# set up track streamers
streams = [pescador.Streamer(excerpt_gen, track) for track in tracks]
# randomly sample from streamers
mux = pescador.StochasticMux(streams, nb_tracks, rate=None, mode='exhaustive')
buffered_sample_gen = pescador.buffer_stream(mux, batch_size)
# iterate over data
for batch in buffered_sample_gen:
print(batch['X'].mean())
While this would obviously directly work with pytorch by feeding in batches of data, I wonder if we could leverage the pytorch dataset and dataloader classes to simplify the code and maybe utilize pytorchs internal parallelisation within the dataloader.
It turns out pytorch does allow to override the Sampler
and BatchSampler
(See here) methods for their dataloader. But since they are all based on indices in your dataset class, using pescador for this wouldn't be exactly elegant (or I just miss the point).
For now, I came up with the following that works and yields the same batches as the vanilla example above. It works by extending the dataset class so that the dataloader provides just more samples and ignores the index.
import numpy as np
import pescador
import torch.utils.data
# define audio tracks as (nb_samples, nb_features)
tracks = [np.random.random((track_length, 1)) for i in range(nb_tracks)]
class TrackData(torch.utils.data.Dataset):
def __init__(self, tracks):
self.tracks = tracks
self.streams = [pescador.Streamer(excerpt_gen, fn) for fn in tracks]
self.mux = pescador.tuples(
pescador.StochasticMux(
self.streams, nb_tracks, rate=None, mode='exhaustive'
),
'X', 'X'
)
def __len__(self):
return len(self.tracks) * (track_length // excerpt_hop) - excerpt_length
def __iter__(self):
return self.mux.iterate()
def __getitem__(self, idx):
return next(self.mux)
def excerpt_gen(data):
for i in range(0, data.shape[0] - excerpt_length, excerpt_hop):
yield dict(X=data[i:i+excerpt_length, :])
dataset = TrackData(tracks)
train_loader = torch.utils.data.DataLoader(
dataset,
batch_size=batch_size
)
for batch, (X, y) in enumerate(train_loader):
print(X.mean())
I would love to hear your feedback on this and of course I would be happy to make a PR once we agreed on an elegant solution.