Created by Zixiang Wan, 2025
Visualize the relationships between phonemes and codec tokens in a specialized speech dataset.
Use Montreal Forced Aligner (MFA) to obtain phoneme timestamps.
- Co-occurrence heatmap
- t-SNE visualization
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar -xjfv LJSpeech-1.1.tar.bz2
conda install -c conda-forge montreal-forced-aligner
Note: This step may take a while.
MFA only supports 16kHz audio.
mkdir -p LJSpeech-1.1/wavs_16k
for file in LJSpeech-1.1/wavs/*.wav; do
base=$(basename "$file")
sox "$file" -r 16000 "LJSpeech-1.1/wavs_16k/$base"
done
python prepare_files_for_MFA.py
mfa model download acoustic english_us_arpa
mfa model download dictionary english_us_arpa
Optional: Check available models.
mfa model list acoustic
mfa model list dictionary
More details about MFA commands can be found in the MFA User Guide.
Details about MFA models and dictionaries can be found in the MFA Models Documentation.
mfa align LJSpeech-1.1/wavs_16k english_us_arpa english_us_arpa textgrids
After alignment is completed, the output folder textgrids
will contain TextGrid files corresponding to the audio files. These files contain phoneme-level timestamp information.
python draw.py
Below are examples of the code execution results: