-
Notifications
You must be signed in to change notification settings - Fork 248
Description
I'm attempting to write a recipe for the English subset of the MLS dataset as available on Hugging Face here. This version of the dataset is restructured from the original version which the existing MLS preparation script in lhotse handles, and is packaged in .parquet
files, which is why I imagine a new recipe would be needed.
I've found that this PR implemented a wrapper CutSet.from_huggingface_dataset
for loading HF datasets using the LazyHFDatasetIterator
object, and it seems like a good tool to help prepare this dataset as a Lhotse CutSet object.
However, because the Recording
object is read from bytes and seemingly stored in memory (see here), after creating a CutSet object, I cannot save any manifests for this CutSet because the audio data cannot be serialized into JSON (as alluded to by this warning in the documentation). This means I cannot save the .jsonl.gz
files for use by, for instance, icefall ASR model training scripts, and I can't save feature manifests after doing CutSet.compute_and_store_features
either.
I'm not sure of the best way to solve this problem. Multiple possibilities come to mind, like somehow extracting or transforming the original HF dataset from .parquet
format, or modifying LazyHFDatasetIterator
to write intermediate temporary files, or somehow working around the lack of these manifests in the training code entirely (perhaps by relying on on-the-fly feature computation) but all of these seem quite intensive or suboptimal and perhaps defeat the purpose of this method.
Is there any easier solution that I'm missing to create manifests from these CutSets without losing the audio data?
By the way, I have the whole HF dataset downloaded to disk, so I'm not restricted to only streaming the dataset.