Official data preparation scripts for the URGENT 2025 Challenge.
The metadata files generated by this repo is compatible with the baseline code. See the instruction for more details about how to run the baseline code.
❗️❗️[2025-1-5] There was a bug in prepare_espnet_data.sh
that the final scp files (generated under data/speech_track*/
) did not include the Chinese subset of the CommonVoice dataset. Please pull the latest commit and do the following:
# change the track number accordingly
track=track1
# then remove .done file so the necessary steps are not skipped
rm data/tmp/speech_train_${track}.done data/tmp/commonvoice19_${track}.done
# generate the scp files
. prepare_espnet_data.sh
❗️❗️[2024-11-27] We added trouble shooting for some known issues at the tail of this README. Please check it first if you encounter some problems.
❗️❗️[2024-11-19] We modified ESTOI evaluation to be deterministic (it has randomness).
❗️❗️[2024-11-18] We have added some missing files which are necessary for data preparation in Track 2, commonvoice_19.0_es_train_track2.json.gz. If you cloned the repogitory before Nov. 18, please pull the latest commit.
❗️❗️[2024-11-16] We have modified some data preparation and evaluation scripts. If you cloned the repogitory before Nov. 16, please pull the latest commit.
-
If you do not have the license of the WSJ corpora, please reach out to the organizers (urgent.challenge@gmail.com) for a temporary license supported by LDC. Please include your name, organization/affiliation, and the username used in the leaderboard in the email for a smooth procedure. Note that we do not accept the request unless you have registered to the challenge leaderboard (refer to this page to know how to register).
-
Please check the trouble shooting at the tail of this README first if you encounter some problems. Please raise an issue when you find any other problems.
-
The default generated
data/speech_train
subset is only intended for dynamic mixing (on-the-fly simulation) in the ESPnet framework. It has the same content inspk1.scp
(clean reference speech) andwav.scp
(originally intended to point noisy speech) to facilitate on-the-fly simulation of different distortions. -
The validation set made by this script is different from the official validation set used in the leaderboard, although the data source and the type of distortions do not change. The official one is here. Note that we only provide the noisy data but not the ground truth of the official validation set until the leaderboard swithces to test phase (Dec. 23) to avoid cheating in the leaderboard.
-
The unofficial validation set made by this script can be used to select the best checkpoint. Participants can freely change the configuration to generate the unofficial validation set.
There are four data splits in the challenge. For more details, please refer to this page.
-
Training/unofficial validation set: the training and validation set automatically prepared by the scripts in this repo. Paritipants are allowed to make unofficial validation data with their own configuration.
-
Official validation set: the validation set used in the leaderboard during the validation phase (2024/11/25 - 2024/12/23). Noisy and clean speeches as well as the metadata are available.
-
Non-blind test set: the dataset used in the non-blind test phase (2024/12/23 - 2025/01/20). Only noisy speeches are available now and clean speeches and metadata will be released once the non-blind test phase ends.
-
Blind test set: the dataset used for the final ranking. Will be available after 2025/01/21.
-
>8
Cores -
At least 1.3 TB of free disk space for the track 1 and ??? TB for the track 2
-
Data-size breakdown
- Note: we only counted audio files and did not include the size of archived files (e.g., .zip or .tar.gz files). You can remove the archived files once the data preparation is done.
- Speech
- DNS5 speech (original 131 GB + resampled 94 GB): 225 GB
- LibriTTS (original 44 GB + resampled 7 GB): 51 GB
- VCTK: 12 GB
- WSJ (original sph 24GB + converted 31 GB): 55 GB
- EARS: 61 GB
- CommonVoice 19.0 speech
- Track 1 (original mp3 221 GB + resampled 200 GB): 421 GB
- Track 2 (original mp3 221 GB + resampled ??? GB): ??? GB
- MLS (less compressed version downloaded from LibriVox)
- Track 1 (original 60 GB + resampled 60 GB): 120 GB
- Track 2 (original 6TB + resampled ???TB): ???TB
- Noise
- DNS5 noise (original 58 GB + resampled 35 GB): 93 GB
- WHAM! noise (48 kHz): 76 GB
- FSD50K (original 24 GB + resampled 6 GB): 30 GB
- FMA: (original 24 GB + resampled 36 GB): 60 GB
- RIR
- DNS5 RIRs (48 kHz): 6 GB
- Others
- default simulated validation data: 2 GB
- simulated wind noise for training (with default config): 1 GB
-
After cloning this repository, run the following command to initialize the submodules:
git submodule update --init --recursive
-
Install environmemnt. Python 3.10 and Torch 2.0.1+ are recommended. With Conda, just run
conda env create -f environment.yaml conda activate urgent2025
In case of the following error
ERROR: Failed building wheel for pypesq ERROR: Could not build wheels for pypesq, which is required to install pyproject.toml-based projects
you could manually install
pypesq
in advance via: (make sure you havenumpy
installed before trying this to avoid compilation errors)python -m pip install https://github.com/vBaiCai/python-pesq/archive/master.zip
-
Get the download link of Commonvoice dataset v19.0 from https://commonvoice.mozilla.org/en/datasets
For German, English, Spanish, French, and Chinese (China), please do the following.
a. Select
Common Voice Corpus 19.0
b. Enter your email and check the two mandatory boxes
c. Right-click the
Download Dataset Bundle
button and select "Copy link"d. Paste the link to Lines
URLs=(...)
in utils/prepare_CommonVoice19_speech.sh likeURLs=( "https://storage.googleapis.com/common-voice-prod-prod-datasets/cv-corpus-19.0-2024-09-13/cv-corpus-19.0-2024-09-13-de.tar.gz?xxxxxx" "https://storage.googleapis.com/common-voice-prod-prod-datasets/cv-corpus-19.0-2024-09-13/cv-corpus-19.0-2024-09-13-en.tar.gz?xxxxxx" "https://storage.googleapis.com/common-voice-prod-prod-datasets/cv-corpus-19.0-2024-09-13/cv-corpus-19.0-2024-09-13-es.tar.gz?xxxxxx" "https://storage.googleapis.com/common-voice-prod-prod-datasets/cv-corpus-19.0-2024-09-13/cv-corpus-19.0-2024-09-13-fr.tar.gz?xxxxxx" "https://storage.googleapis.com/common-voice-prod-prod-datasets/cv-corpus-19.0-2024-09-13/cv-corpus-19.0-2024-09-13-zh-CN.tar.gz?xxxxxx" )
-
Make a symbolic link to wsj0 and wsj1 data
a. Make a directory
./wsj
b. Make a symbolic link to wsj0 and wsj1 under
./wsj
(./wsj/wsj0/
and./wsj/wsj1/
)NOTE: If you do not have the license of the WSJ corpora, please reach out to the organizers (urgent.challenge@gmail.com) for a temporary license supported by LDC. Please include your name, organization/affiliation, and the username used in the leaderboard in the email for a smooth procedure. Note that we do not accept the request unless you have registered to the challenge leaderboard (refer to this page to know how to register). Note that the paticipants are allowed to train their systems using only the subset of the given dataset, and thus preliminary investigation (or even final submission) can be done without WSJ corpora.
-
FFmpeg-related
To simulate wind noise and codec artifacts, our scripts utilize FFmpeg.
a. Activate your python environment
b. Get the path to FFmpeg by
which ffmpeg
c. Change
/path/to/ffmpeg
in simulation/simulate_data_from_param.py to the path to your ffmpeg. -
Run the script
./prepare_espnet_data.sh
NOTE: Please do not change
output_dir
in each shell script called inprepare_{dataset}.sh
. If you want to download datasets to somewhere else, make a symbolic link to that directory.# example when you want to download FSD50K noise to /path/to/somewhere # prepare_fsd50k_noise.sh specifies ./fsd50k as output_dir, so make a symbolic link from /path/to/somewhere to ./fsd50k mkdir -p /path/to/somewhere ln -s /path/to/somewhere ./fsd50k
-
Install eSpeak-NG (used for the phoneme similarity metric computation)
- Follow the instructions in https://github.com/espeak-ng/espeak-ng/blob/master/docs/guide.md#linux
- NOTE: if you build
eSpeak-NG
from source (not by e.g., apt-get), it may cause an error when runningevaluation_metrics/calculate_phoneme_similarity.py
. Refer to the troubleshooting below if you encounter the issue.
Errors when unpacking MLS .tar.gz files
Sometimes, an error like the following happens when unpacking .tar.gz files in utils/prepare_MLS_speech.sh
.
If you encounter this error, please just retry the script after deleting ./mls_segments/download_mls_${lang}_${split}_${track}.done
for the failed language, split (train or dev), and track (track1 or track2).
In the following example, one needs to remove ./mls_segments/download_mls_spanish_train_track1.done
before rerunning the script again.
=== Preparing MLS data for track1 ===
=== Preparing MLS german train data ===
[MLS-german-train_track1] downloading data
=== Preparing MLS german dev data ===
[MLS-german-dev] downloading data
=== Preparing MLS french train data ===
[MLS-french-train_track1] downloading data
=== Preparing MLS french dev data ===
[MLS-french-dev] downloading data
=== Preparing MLS spanish train data ===
[MLS-spanish-train_track1] downloading data
tar: ./3946/3579: Cannot mkdir: No such file or directory
tar: ./3946/8075: Cannot mkdir: No such file or directory
tar: ./9972/10719: Cannot mkdir: No such file or directory
tar: Exiting with failure status due to previous errors
tar: Exiting with failure status due to previous errors
tar: Exiting with failure status due to previous errors
Warnings when processing FMA data
When preparing FMA data, following warnings appear but you can just ignore them.
[FMA noise] split training and validation data
[FMA noise] resampling to estimated audio bandwidth
0%| | 0/19902 [00:00<?, ?it/s][src/libmpg123/layer3.c:INT123_do_layer3():1801] error: dequantization failed!
[src/libmpg123/layer3.c:INT123_do_layer3():1771] error: part2_3_length (3264) too large for available bit count (3224)
[src/libmpg123/layer3.c:INT123_do_layer3():1841] error: dequantization failed!
[src/libmpg123/layer3.c:INT123_do_layer3():1801] error: dequantization failed!
...
TypeError when running calculate_phoneme_similarity.py
The following error may happen when running evaluation_metrics/calculate_phoneme_similarity.py
.
This is because phoneme recognizer requires lib
while only bin
directory exists, depending on how you built eSpeak-NG
.
Adding the path to LD_LIBRARY_PATH
solves the issue.
evaluation_metrics/calculate_phoneme_similarity.py", line 58, in __init__
self.phoneme_predictor = PhonemePredictor(device=device)
urgent2025_challenge/evaluation_metrics/calculate_phoneme_similarity.py", line 29, in __init__
self.processor = Wav2Vec2Processor.from_pretrained(checkpoint)
TypeError: Received a bool for argument tokenizer, but a PreTrainedTokenizerBase was expected.
SDR/ESTOI scores from calculate_intrusive_se_metrics.py are weird
There are cases where SDR and ESTOI scores obtained from evaluation_metrics/calculate_intrusive_se_metrics.py
are obviously weird (e.g., SDR becomes 50 or -50).
This could be because np.linalg.solve
can give solutions on different machines (cf. here).
We have found that upgrading Numpy to 1.26 solves this problem.
For debugging, you can run the evaluation on noisy speeches after upgrading Numpy and compare the scores with those in the leaderboard.
Note that while ESPnet requires Numpy<=1.24, we have not encountered the issue due to upgrading Numpy to 1.26.