Trouble saving manifests for CutSets created with LazyHFDatasetIterator

I'm attempting to write a recipe for the English subset of the MLS dataset as available on Hugging Face [here](https://huggingface.co/datasets/parler-tts/mls_eng). This version of the dataset is restructured from the original version which the [existing MLS preparation script in lhotse](https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/mls.py) handles, and is packaged in `.parquet` files, which is why I imagine a new recipe would be needed.

I've found that [this PR](https://github.com/lhotse-speech/lhotse/pull/1433) implemented a wrapper `CutSet.from_huggingface_dataset` for loading HF datasets using the `LazyHFDatasetIterator` object, and it seems like a good tool to help prepare this dataset as a Lhotse CutSet object.

However, because the `Recording` object is read from bytes and seemingly stored in memory ([see here](https://github.com/lhotse-speech/lhotse/blob/aa38c0fd94373c1113c4d7a6794af340445d67e7/lhotse/hf.py#L365-L367)), after creating a CutSet object, I cannot save any manifests for this CutSet because the audio data cannot be serialized into JSON ([as alluded to by this warning in the documentation](https://lhotse.readthedocs.io/en/latest/api.html#lhotse.audio.Recording.from_bytes)). This means I cannot save the `.jsonl.gz` files for use by, for instance, icefall ASR model training scripts, and I can't save feature manifests after doing `CutSet.compute_and_store_features` either.

I'm not sure of the best way to solve this problem. Multiple possibilities come to mind, like somehow extracting or transforming the original HF dataset from `.parquet` format, or modifying `LazyHFDatasetIterator` to write intermediate temporary files, or somehow working around the lack of these manifests in the training code entirely (perhaps by relying on on-the-fly feature computation) but all of these seem quite intensive or suboptimal and perhaps defeat the purpose of this method. 

Is there any easier solution that I'm missing to create manifests from these CutSets without losing the audio data?

By the way, I have the whole HF dataset downloaded to disk, so I'm not restricted to only streaming the dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trouble saving manifests for CutSets created with LazyHFDatasetIterator #1469

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Trouble saving manifests for CutSets created with LazyHFDatasetIterator #1469

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions