Skip to content

add pytorch example #133

@faroit

Description

@faroit

As more and more people use pytorch now, I wonder if we can have a pytorch example to use with pescador?

I fact, I already tried a few things....

Lets assume the infamous (among audio research) scenario of randomly sampling of small excerpts from longer audio tracks and we want to see all the data once for every epoch in random order:

import numpy as np
import pescador

# define audio tracks as (nb_samples, nb_features)
tracks = [np.random.random((track_length, 1)) for i in range(nb_tracks)]

# yield excerpts from audio tracks
def excerpt_gen(data):
    for i in range(0, data.shape[0] - excerpt_length, excerpt_hop):
        yield dict(X=data[i:i+excerpt_length, :])

# set up track streamers
streams = [pescador.Streamer(excerpt_gen, track) for track in tracks]
# randomly sample from streamers
mux = pescador.StochasticMux(streams, nb_tracks, rate=None, mode='exhaustive')
buffered_sample_gen = pescador.buffer_stream(mux, batch_size)

# iterate over data
for batch in buffered_sample_gen:
    print(batch['X'].mean())

While this would obviously directly work with pytorch by feeding in batches of data, I wonder if we could leverage the pytorch dataset and dataloader classes to simplify the code and maybe utilize pytorchs internal parallelisation within the dataloader.

It turns out pytorch does allow to override the Sampler and BatchSampler (See here) methods for their dataloader. But since they are all based on indices in your dataset class, using pescador for this wouldn't be exactly elegant (or I just miss the point).

For now, I came up with the following that works and yields the same batches as the vanilla example above. It works by extending the dataset class so that the dataloader provides just more samples and ignores the index.

import numpy as np
import pescador
import torch.utils.data

# define audio tracks as (nb_samples, nb_features)
tracks = [np.random.random((track_length, 1)) for i in range(nb_tracks)]

class TrackData(torch.utils.data.Dataset):
    def __init__(self, tracks):
        self.tracks = tracks
        self.streams = [pescador.Streamer(excerpt_gen, fn) for fn in tracks]
        self.mux = pescador.tuples(
            pescador.StochasticMux(
                self.streams, nb_tracks, rate=None, mode='exhaustive'
            ),
            'X', 'X'
        )

    def __len__(self):
        return len(self.tracks) * (track_length // excerpt_hop) - excerpt_length

    def __iter__(self):
        return self.mux.iterate()

    def __getitem__(self, idx):
        return next(self.mux)

def excerpt_gen(data):
    for i in range(0, data.shape[0] - excerpt_length, excerpt_hop):
        yield dict(X=data[i:i+excerpt_length, :])

dataset = TrackData(tracks)

train_loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=batch_size
)
for batch, (X, y) in enumerate(train_loader):
    print(X.mean())

I would love to hear your feedback on this and of course I would be happy to make a PR once we agreed on an elegant solution.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions