Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution

This repository provides instructions to reproduce the results from the Interspeech 2025 paper Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution. The study analyzes feature attribution explanations on the TIMIT corpus using a Conformer-based state-of-the-art ASR model and the with SPES explainability framework.

📦 Preparing Data

Audio Conversion
The first step involves converting TIMIT audio files from SPH to WAV format using the audio_conversion.py script.
Generate TSV Metadata
Use the generate_tsv.py script to generate a TSV file that includes information about each WAV file, along with its orthographic and phonetic annotations.
Preprocessing

Run the generate_fbanks.py script to preprocess the data. This will produce a ${DATA_FILENAME}.tsv file containing the processed metadata. The ${DATA_FOLDER} variable should point to the directory where the output TSV file will be saved. Make sure this path is correctly set, as it will be used by subsequent scripts to locate the preprocessed data.

Merge Annotations
For the analyses, it is useful to group all orthographic and phonetic annotations (originally stored in one file per sample) into two JSON files. This can be done using the merge_annotations.py script, which generates:

${JSON_ORTHOGRAPHIC}: A single JSON file containing all orthographic annotations.
${JSON_PHONETIC}: A single JSON file containing all phonetic annotations.

🤖 Generating Saliency Maps

To generate saliency maps, we use the FBK-fairseq repository, which must be installed beforehand. For detailed instructions on how to run SPES, refer to the XAI Feature Attribution README in the repository.

We begin by running standard inference to obtain the model's transcriptions:

python /path/to/FBK-fairseq/fairseq_cli/generate.py ${DATA_FOLDER} \
        --gen-subset ${DATA_FILENAME} \
        --user-dir examples/speech_to_text \
        --max-tokens 40000 \
        --model-overrides "{'batch_unsafe_relative_shift':False}" \
        --config-yaml config_generate.yaml \
        --beam 5 \
        --task speech_to_text_ctc \
        --criterion ctc_multi_loss \
        --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --no-repeat-ngram-size 5 \
        --path ${MODEL_CHECKPOINT} > ${TRANSCRIPTION_OUTPUT}

egrep '^H' ${TRANSCRIPTION_OUTPUT} | cut -d"-" -f2- | sort -n | cut -f3 > hyp1
echo "tgt_text" | cat - hyp1 > hyp2
paste <(cut -f1-4 ${DATA_FOLDER}/${DATA_FILENAME}.tsv) <(cut -f1 hyp2) <(cut -f6 ${DATA_FOLDER}/${DATA_FILENAME}.tsv) > ${DATA_FOLDER}/${DATA_FILENAME}_explain.tsv
rm -f hyp1 hyp2

Here, ${MODEL_CHECKPOINT} refers to the path of the pretrained ASR model. Pretrained models can be obtained from the SBAAM. The resulting transcriptions are saved to ${TRANSCRIPTION_OUTPUT} and then post-processed into ${DATA_FILENAME}_explain.tsv. This processed file is required for generating feature attribution maps.

Next, compute the original token probabilities needed by SPES, and store them in ${ORIG_PROBS}.

python /path/to/FBK-fairseq/examples/speech_to_text/get_probs_from_constrained_decoding.py ${DATA_FOLDER} \
        --gen-subset ${DATA_FILENAME}_explain \
        --user-dir examples/speech_to_text \
        --max-tokens 10000 \
        --config-yaml config_explain.yaml \
        --task speech_to_text_ctc \
        --model-overrides "{'batch_unsafe_relative_shift':False}" \
        --criterion ctc_multi_loss \
        --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --path ${MODEL_CHECKPOINT} \
        --save-file ${ORIG_PROBS}

Now generate the saliency heatmaps, which will be saved in ${SALIENCY_MAPS}.

python /path/to/FBK-fairseq/examples/speech_to_text/generate_occlusion_explanation.py ${DATA_FOLDER} \
        --gen-subset ${DATA_FILENAME}_explain \
        --user-dir examples/speech_to_text \
        --max-tokens 160000 \
        --num-workers 1 \
        --config-yaml config_explain.yaml \
        --perturb-config perturb_config.yaml \
        --task speech_to_text_ctc \
        --criterion ctc_multi_loss \
        --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --no-repeat-ngram-size 5 \
        --path ${MODEL_CHECKPOINT} \
        --original-probs ${ORIG_PROBS} \
        --save-file ${SALIENCY_MAPS}

🔍 Analyses

All results and plots included in the paper can be reproduced running the analyses.ipynb notebook.

📄 Citation

@inproceedings{fucci-et-al-2025-unveiling,
title = "Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution",
author = {Fucci, Dennis and Gaido, Marco and Negri, Matteo and Cettolo, Mauro and Bentivogli, Luisa},
booktitle = "Proc. of Interspeech 2025",
year = "2025"
address = "Rotterdam, The Netherlands"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution

📦 Preparing Data

🤖 Generating Saliency Maps

🔍 Analyses

📄 Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyses.ipynb		analyses.ipynb
config_explain.yaml		config_explain.yaml
config_generate.yaml		config_generate.yaml
convert_audio.py		convert_audio.py
generate_fbanks.py		generate_fbanks.py
generate_tsv.py		generate_tsv.py
merge_annotations.py		merge_annotations.py
perturb_config.yaml		perturb_config.yaml

License

hlt-mt/phonetic-analysis-xai

Folders and files

Latest commit

History

Repository files navigation

Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution

📦 Preparing Data

🤖 Generating Saliency Maps

🔍 Analyses

📄 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages