Skip to content

FLAC decoder sync error when processing ReazonSpeech dataset #1468

@nroyliu

Description

@nroyliu

I encountered a LibsndfileError: flac decoder lost sync error while trying to download and process the ReazonSpeech "small-v1" dataset using lhotse. The error occurs during the dataset mapping phase.
always at 778

2025-03-27 14:48:34,459 INFO [config.py:54] PyTorch version 2.4.1+cu118 available.
2025-03-27 14:48:34,690 INFO [reazonspeech.py:101] Downloading ReazonSpeech part: small-v1
Downloading data: 100%|███████████████████████████████████████████████████████████████| 275k/275k [00:00<00:00, 895kB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████| 321M/321M [00:16<00:00, 20.0MB/s]
Generating train split: 2637 examples [00:00, 9166.16 examples/s]
Map:  30%|███████████████████▊                                               | 778/2637 [00:01<00:04, 458.21 examples/s]
Traceback (most recent call last):
  File "/home/nroy/anaconda3/envs/icefall/bin/lhotse", line 8, in <module>
    sys.exit(cli())
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/lhotse/bin/modes/recipes/reazonspeech.py", line 59, in reazonspeech
    download_reazonspeech(target_dir, dataset_parts=subset, num_jobs=num_jobs)
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/lhotse/recipes/reazonspeech.py", line 119, in download_reazonspeech
    ds = ds.map(
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 560, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3055, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3428, in _map_single
    example = apply_function_on_filtered_inputs(example, i, offset=offset)
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3320, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/lhotse/recipes/reazonspeech.py", line 113, in format_example
    example["audio_filepath"] = example["audio"]["path"]
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 279, in __getitem__
    value = self.format(key)
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 377, in format
    return self.formatter.format_column(self.pa_table.select([key]))[0]
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 449, in format_column
    column = self.python_features_decoder.decode_column(column, pa_table.column_names[0])
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/datasets/formatting/formatting.py", line 225, in decode_column
    return self.features.decode_column(column, column_name) if self.features else column
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/datasets/features/features.py", line 2066, in decode_column
    [decode_nested_example(self[column_name], value) if value is not None else None for value in column]
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/datasets/features/features.py", line 2066, in <listcomp>
    [decode_nested_example(self[column_name], value) if value is not None else None for value in column]
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/datasets/features/features.py", line 1405, in decode_nested_example
    return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/datasets/features/audio.py", line 184, in decode_example
    array, sampling_rate = sf.read(f)
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/soundfile.py", line 308, in read
    data = f.read(frames, dtype, always_2d, fill_value, out)
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/soundfile.py", line 942, in read
    frames = self._array_io('read', out, frames)
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/soundfile.py", line 1394, in _array_io
    return self._cdata_io(action, cdata, ctype, frames)
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/soundfile.py", line 1404, in _cdata_io
    _error_check(self._errorcode)
  File "/home/nroy/anaconda3/envs/icefall/lib/python3.8/site-packages/soundfile.py", line 1480, in _error_check
    raise LibsndfileError(err, prefix=prefix)
soundfile.LibsndfileError: Error : flac decoder lost sync.

Python: 3.8

PyTorch: 2.4.1+cu118

lhotse: 1.31.0.dev0+git.aa38c0f.clean

soundfile: 0.13.1

datasets: 3.1.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions