This repository presents a fast and efficient speech tokenization framework based on bidirectional Mamba, designed for spoken term detection (STD). The method introduces a speech tokenizer that produces language-agnostic and speaker-independent tokens, ensuring consistent token sequences across different utterances of the same word. The repository includes the implementation, datasets, and pre-trained models.
Language-Agnostic Speech Tokenizer for Spoken Term Detection with Efficient Retrieval
Anup Singh, Kris Demuynck, Vipul Arora
Paper: https://www.isca-archive.org/interspeech_2025/singh25d_interspeech.html
git clone https://github.com/anupsingh15/LAST.git
cd LAST
conda create -n mSTD anaconda
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install mamba-ssm
pip install causal-conv1d>=1.4.0
python -m pip install tslearn
pip install -U tensorboard
pip install POT
pip install librosa
pip install npy-append-array
pip install faiss-cpu
pip install Levenshtein
To train the model, run:
python main.py
To create the database, build the index, perform retrieval and word-pair tokenization, check: demo/
- Dataset: Kathbath Word Alignments
- Pre-trained Models: Download from Google Drive
If you find our work useful, please cite:
@inproceedings{singh25d_interspeech,
title = {{Language-Agnostic Speech Tokenizer for Spoken Term Detection with Efficient Retrieval}},
author = {{Anup Singh and Kris Demuynck and Vipul Arora}},
year = {{2025}},
booktitle = {{Interspeech 2025}},
pages = {{2630--2634}},
doi = {{10.21437/Interspeech.2025-2722}},
issn = {{2958-1796}},
}
👉 You may also check out our earlier work on the Monolingual Speech Tokenizer:
BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection
Anup Singh, Kris Demuynck, Vipul Arora
Paper: https://ieeexplore.ieee.org/abstract/document/10889633
We are actively working on enhancing this method with new features and improvements. Stay tuned for upcoming upgrades, including:
- More efficient tokens
- Improved token consistency across different noise conditions