Skip to content

Trouble saving manifests for CutSets created with LazyHFDatasetIterator #1469

@kinanmartin

Description

@kinanmartin

I'm attempting to write a recipe for the English subset of the MLS dataset as available on Hugging Face here. This version of the dataset is restructured from the original version which the existing MLS preparation script in lhotse handles, and is packaged in .parquet files, which is why I imagine a new recipe would be needed.

I've found that this PR implemented a wrapper CutSet.from_huggingface_dataset for loading HF datasets using the LazyHFDatasetIterator object, and it seems like a good tool to help prepare this dataset as a Lhotse CutSet object.

However, because the Recording object is read from bytes and seemingly stored in memory (see here), after creating a CutSet object, I cannot save any manifests for this CutSet because the audio data cannot be serialized into JSON (as alluded to by this warning in the documentation). This means I cannot save the .jsonl.gz files for use by, for instance, icefall ASR model training scripts, and I can't save feature manifests after doing CutSet.compute_and_store_features either.

I'm not sure of the best way to solve this problem. Multiple possibilities come to mind, like somehow extracting or transforming the original HF dataset from .parquet format, or modifying LazyHFDatasetIterator to write intermediate temporary files, or somehow working around the lack of these manifests in the training code entirely (perhaps by relying on on-the-fly feature computation) but all of these seem quite intensive or suboptimal and perhaps defeat the purpose of this method.

Is there any easier solution that I'm missing to create manifests from these CutSets without losing the audio data?

By the way, I have the whole HF dataset downloaded to disk, so I'm not restricted to only streaming the dataset.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions