add pytorch example

As more and more people use pytorch now, I wonder if we can have a pytorch example to use with pescador?

I fact, I already tried a few things....

Lets assume the _infamous_ (among audio research) scenario of randomly sampling of small excerpts from longer audio tracks and we want to see all the data once for every epoch in random order:

```python
import numpy as np
import pescador

# define audio tracks as (nb_samples, nb_features)
tracks = [np.random.random((track_length, 1)) for i in range(nb_tracks)]

# yield excerpts from audio tracks
def excerpt_gen(data):
    for i in range(0, data.shape[0] - excerpt_length, excerpt_hop):
        yield dict(X=data[i:i+excerpt_length, :])

# set up track streamers
streams = [pescador.Streamer(excerpt_gen, track) for track in tracks]
# randomly sample from streamers
mux = pescador.StochasticMux(streams, nb_tracks, rate=None, mode='exhaustive')
buffered_sample_gen = pescador.buffer_stream(mux, batch_size)

# iterate over data
for batch in buffered_sample_gen:
    print(batch['X'].mean())
```

While this would obviously directly work with pytorch by feeding in batches of data, I wonder if we could leverage the pytorch [dataset and dataloader classes](https://pytorch.org/docs/master/data.html) to simplify the code and maybe utilize pytorchs internal parallelisation within the dataloader.

It turns out pytorch does allow to override the `Sampler` and `BatchSampler` (See [here](https://pytorch.org/docs/stable/_modules/torch/utils/data/sampler.html#Sampler)) methods for their dataloader. But since they are all based on indices in your dataset class, using pescador for this wouldn't be exactly elegant (or I just miss the point).

For now, I came up with the following that works and yields the same batches as the vanilla example above. It works by extending the dataset class so that the dataloader provides just more samples and ignores the index.

```python
import numpy as np
import pescador
import torch.utils.data

# define audio tracks as (nb_samples, nb_features)
tracks = [np.random.random((track_length, 1)) for i in range(nb_tracks)]

class TrackData(torch.utils.data.Dataset):
    def __init__(self, tracks):
        self.tracks = tracks
        self.streams = [pescador.Streamer(excerpt_gen, fn) for fn in tracks]
        self.mux = pescador.tuples(
            pescador.StochasticMux(
                self.streams, nb_tracks, rate=None, mode='exhaustive'
            ),
            'X', 'X'
        )

    def __len__(self):
        return len(self.tracks) * (track_length // excerpt_hop) - excerpt_length

    def __iter__(self):
        return self.mux.iterate()

    def __getitem__(self, idx):
        return next(self.mux)

def excerpt_gen(data):
    for i in range(0, data.shape[0] - excerpt_length, excerpt_hop):
        yield dict(X=data[i:i+excerpt_length, :])

dataset = TrackData(tracks)

train_loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=batch_size
)
for batch, (X, y) in enumerate(train_loader):
    print(X.mean())
```

I would love to hear your feedback on this and of course I would be happy to make a PR once we agreed on an elegant solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add pytorch example #133

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

add pytorch example #133

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions