Created by Zixiang Wan, 2025
Visualize the relationships between phonemes and codec tokens in a specialized speech dataset.
Use Montreal Forced Aligner (MFA) to obtain phoneme timestamps.
- Co-occurrence heatmap
- t-SNE visualization
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar -xjfv LJSpeech-1.1.tar.bz2conda install -c conda-forge montreal-forced-alignerNote: This step may take a while.
MFA only supports 16kHz audio.
mkdir -p LJSpeech-1.1/wavs_16k
for file in LJSpeech-1.1/wavs/*.wav; do
base=$(basename "$file")
sox "$file" -r 16000 "LJSpeech-1.1/wavs_16k/$base"
donepython prepare_files_for_MFA.pymfa model download acoustic english_us_arpa
mfa model download dictionary english_us_arpaOptional: Check available models.
mfa model list acoustic
mfa model list dictionaryMore details about MFA commands can be found in the MFA User Guide.
Details about MFA models and dictionaries can be found in the MFA Models Documentation.
mfa align LJSpeech-1.1/wavs_16k english_us_arpa english_us_arpa textgridsAfter alignment is completed, the output folder textgrids will contain TextGrid files corresponding to the audio files. These files contain phoneme-level timestamp information.
python draw.pyBelow are examples of the code execution results:

