Skip to content

cisco-open/lrac_data_generation

 
 

Repository files navigation

urgent2025_challenge

Official data preparation scripts for the URGENT 2025 Challenge.

The metadata files generated by this repo is compatible with the baseline code. See the instruction for more details about how to run the baseline code.

Updates

❗️❗️[2025-1-5] There was a bug in prepare_espnet_data.sh that the final scp files (generated under data/speech_track*/) did not include the Chinese subset of the CommonVoice dataset. Please pull the latest commit and do the following:

# change the track number accordingly
track=track1

# then remove .done file so the necessary steps are not skipped
rm data/tmp/speech_train_${track}.done data/tmp/commonvoice19_${track}.done

# generate the scp files
. prepare_espnet_data.sh

❗️❗️[2024-11-27] We added trouble shooting for some known issues at the tail of this README. Please check it first if you encounter some problems.

❗️❗️[2024-11-19] We modified ESTOI evaluation to be deterministic (it has randomness).

❗️❗️[2024-11-18] We have added some missing files which are necessary for data preparation in Track 2, commonvoice_19.0_es_train_track2.json.gz. If you cloned the repogitory before Nov. 18, please pull the latest commit.

❗️❗️[2024-11-16] We have modified some data preparation and evaluation scripts. If you cloned the repogitory before Nov. 16, please pull the latest commit.

Notes

  • If you do not have the license of the WSJ corpora, please reach out to the organizers (urgent.challenge@gmail.com) for a temporary license supported by LDC. Please include your name, organization/affiliation, and the username used in the leaderboard in the email for a smooth procedure. Note that we do not accept the request unless you have registered to the challenge leaderboard (refer to this page to know how to register).

  • Please check the trouble shooting at the tail of this README first if you encounter some problems. Please raise an issue when you find any other problems.

  • The default generated data/speech_train subset is only intended for dynamic mixing (on-the-fly simulation) in the ESPnet framework. It has the same content in spk1.scp (clean reference speech) and wav.scp (originally intended to point noisy speech) to facilitate on-the-fly simulation of different distortions.

  • The validation set made by this script is different from the official validation set used in the leaderboard, although the data source and the type of distortions do not change. The official one is here. Note that we only provide the noisy data but not the ground truth of the official validation set until the leaderboard swithces to test phase (Dec. 23) to avoid cheating in the leaderboard.

  • The unofficial validation set made by this script can be used to select the best checkpoint. Participants can freely change the configuration to generate the unofficial validation set.

Dataset description

There are four data splits in the challenge. For more details, please refer to this page.

  • Training/unofficial validation set: the training and validation set automatically prepared by the scripts in this repo. Paritipants are allowed to make unofficial validation data with their own configuration.

  • Official validation set: the validation set used in the leaderboard during the validation phase (2024/11/25 - 2024/12/23). Noisy and clean speeches as well as the metadata are available.

  • Non-blind test set: the dataset used in the non-blind test phase (2024/12/23 - 2025/01/20). Only noisy speeches are available now and clean speeches and metadata will be released once the non-blind test phase ends.

  • Blind test set: the dataset used for the final ranking. Will be available after 2025/01/21.

Requirements for data preparation

  • >8 Cores

  • At least 1.3 TB of free disk space for the track 1 and ??? TB for the track 2

  • Data-size breakdown
    • Note: we only counted audio files and did not include the size of archived files (e.g., .zip or .tar.gz files). You can remove the archived files once the data preparation is done.
    • Speech
      • DNS5 speech (original 131 GB + resampled 94 GB): 225 GB
      • LibriTTS (original 44 GB + resampled 7 GB): 51 GB
      • VCTK: 12 GB
      • WSJ (original sph 24GB + converted 31 GB): 55 GB
      • EARS: 61 GB
      • CommonVoice 19.0 speech
        • Track 1 (original mp3 221 GB + resampled 200 GB): 421 GB
        • Track 2 (original mp3 221 GB + resampled ??? GB): ??? GB
      • MLS (less compressed version downloaded from LibriVox)
        • Track 1 (original 60 GB + resampled 60 GB): 120 GB
        • Track 2 (original 6TB + resampled ???TB): ???TB
    • Noise
      • DNS5 noise (original 58 GB + resampled 35 GB): 93 GB
      • WHAM! noise (48 kHz): 76 GB
      • FSD50K (original 24 GB + resampled 6 GB): 30 GB
      • FMA: (original 24 GB + resampled 36 GB): 60 GB
    • RIR
      • DNS5 RIRs (48 kHz): 6 GB
    • Others
      • default simulated validation data: 2 GB
      • simulated wind noise for training (with default config): 1 GB

Instructions for data preparation

  1. After cloning this repository, run the following command to initialize the submodules:

    git submodule update --init --recursive
  2. Install environmemnt. Python 3.10 and Torch 2.0.1+ are recommended. With Conda, just run

    conda env create -f environment.yaml
    conda activate urgent2025

    In case of the following error

      ERROR: Failed building wheel for pypesq
    ERROR: Could not build wheels for pypesq, which is required to install pyproject.toml-based projects
    

    you could manually install pypesq in advance via: (make sure you have numpy installed before trying this to avoid compilation errors)

    python -m pip install https://github.com/vBaiCai/python-pesq/archive/master.zip
  3. Get the download link of Commonvoice dataset v19.0 from https://commonvoice.mozilla.org/en/datasets

    For German, English, Spanish, French, and Chinese (China), please do the following.

    a. Select Common Voice Corpus 19.0

    b. Enter your email and check the two mandatory boxes

    c. Right-click the Download Dataset Bundle button and select "Copy link"

    d. Paste the link to Lines URLs=(...) in utils/prepare_CommonVoice19_speech.sh like

    URLs=(
      "https://storage.googleapis.com/common-voice-prod-prod-datasets/cv-corpus-19.0-2024-09-13/cv-corpus-19.0-2024-09-13-de.tar.gz?xxxxxx"
      "https://storage.googleapis.com/common-voice-prod-prod-datasets/cv-corpus-19.0-2024-09-13/cv-corpus-19.0-2024-09-13-en.tar.gz?xxxxxx"
      "https://storage.googleapis.com/common-voice-prod-prod-datasets/cv-corpus-19.0-2024-09-13/cv-corpus-19.0-2024-09-13-es.tar.gz?xxxxxx"
      "https://storage.googleapis.com/common-voice-prod-prod-datasets/cv-corpus-19.0-2024-09-13/cv-corpus-19.0-2024-09-13-fr.tar.gz?xxxxxx"
      "https://storage.googleapis.com/common-voice-prod-prod-datasets/cv-corpus-19.0-2024-09-13/cv-corpus-19.0-2024-09-13-zh-CN.tar.gz?xxxxxx"
    )
  4. Make a symbolic link to wsj0 and wsj1 data

    a. Make a directory ./wsj

    b. Make a symbolic link to wsj0 and wsj1 under ./wsj (./wsj/wsj0/ and ./wsj/wsj1/)

    NOTE: If you do not have the license of the WSJ corpora, please reach out to the organizers (urgent.challenge@gmail.com) for a temporary license supported by LDC. Please include your name, organization/affiliation, and the username used in the leaderboard in the email for a smooth procedure. Note that we do not accept the request unless you have registered to the challenge leaderboard (refer to this page to know how to register). Note that the paticipants are allowed to train their systems using only the subset of the given dataset, and thus preliminary investigation (or even final submission) can be done without WSJ corpora.

  1. FFmpeg-related

    To simulate wind noise and codec artifacts, our scripts utilize FFmpeg.

    a. Activate your python environment

    b. Get the path to FFmpeg by which ffmpeg

    c. Change /path/to/ffmpeg in simulation/simulate_data_from_param.py to the path to your ffmpeg.

  2. Run the script

    ./prepare_espnet_data.sh

    NOTE: Please do not change output_dir in each shell script called in prepare_{dataset}.sh. If you want to download datasets to somewhere else, make a symbolic link to that directory.

    # example when you want to download FSD50K noise to /path/to/somewhere
    # prepare_fsd50k_noise.sh specifies ./fsd50k as output_dir, so make a symbolic link from /path/to/somewhere to ./fsd50k
    mkdir -p /path/to/somewhere
    ln -s /path/to/somewhere ./fsd50k
  3. Install eSpeak-NG (used for the phoneme similarity metric computation)

Trouble Shooting

Errors when unpacking MLS .tar.gz files

Sometimes, an error like the following happens when unpacking .tar.gz files in utils/prepare_MLS_speech.sh.

If you encounter this error, please just retry the script after deleting ./mls_segments/download_mls_${lang}_${split}_${track}.done for the failed language, split (train or dev), and track (track1 or track2).

In the following example, one needs to remove ./mls_segments/download_mls_spanish_train_track1.done before rerunning the script again.

=== Preparing MLS data for track1 ===                                                                                                                                                                                 
=== Preparing MLS german train data ===                                                                                                                                                                               
[MLS-german-train_track1] downloading data                                                                                                                                                                            
=== Preparing MLS german dev data ===                                                                                                                                                                                 
[MLS-german-dev] downloading data                                                                                                                                                                                     
=== Preparing MLS french train data ===                                                                                                                                                                               
[MLS-french-train_track1] downloading data                                                                                                                                                                            
=== Preparing MLS french dev data ===                                                                                                                                                                                 
[MLS-french-dev] downloading data
=== Preparing MLS spanish train data ===
[MLS-spanish-train_track1] downloading data
tar: ./3946/3579: Cannot mkdir: No such file or directory
tar: ./3946/8075: Cannot mkdir: No such file or directory
tar: ./9972/10719: Cannot mkdir: No such file or directory
tar: Exiting with failure status due to previous errors
tar: Exiting with failure status due to previous errors
tar: Exiting with failure status due to previous errors

Warnings when processing FMA data

When preparing FMA data, following warnings appear but you can just ignore them.

[FMA noise] split training and validation data
[FMA noise] resampling to estimated audio bandwidth
  0%|                                                                                          | 0/19902 [00:00<?, ?it/s][src/libmpg123/layer3.c:INT123_do_layer3():1801] error: dequantization failed!
[src/libmpg123/layer3.c:INT123_do_layer3():1771] error: part2_3_length (3264) too large for available bit count (3224)
[src/libmpg123/layer3.c:INT123_do_layer3():1841] error: dequantization failed!
[src/libmpg123/layer3.c:INT123_do_layer3():1801] error: dequantization failed!
...

TypeError when running calculate_phoneme_similarity.py

The following error may happen when running evaluation_metrics/calculate_phoneme_similarity.py.

This is because phoneme recognizer requires lib while only bin directory exists, depending on how you built eSpeak-NG.

Adding the path to LD_LIBRARY_PATH solves the issue.

evaluation_metrics/calculate_phoneme_similarity.py", line 58, in __init__
  self.phoneme_predictor = PhonemePredictor(device=device)
urgent2025_challenge/evaluation_metrics/calculate_phoneme_similarity.py", line 29, in __init__
    self.processor = Wav2Vec2Processor.from_pretrained(checkpoint)

TypeError: Received a bool for argument tokenizer, but a PreTrainedTokenizerBase was expected.

SDR/ESTOI scores from calculate_intrusive_se_metrics.py are weird

There are cases where SDR and ESTOI scores obtained from evaluation_metrics/calculate_intrusive_se_metrics.py are obviously weird (e.g., SDR becomes 50 or -50).

This could be because np.linalg.solve can give solutions on different machines (cf. here).

We have found that upgrading Numpy to 1.26 solves this problem.

For debugging, you can run the evaluation on noisy speeches after upgrading Numpy and compare the scores with those in the leaderboard.

Note that while ESPnet requires Numpy<=1.24, we have not encountered the issue due to upgrading Numpy to 1.26.

About

Official data preparation and metric evaluation scripts for the Low Resource Audio Codec (LRAC) challenge.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 51.0%
  • Shell 29.6%
  • Perl 19.4%