We Speech Tookit, LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction.
-
Fully LLM-based: Standing on the shoulders of giants by reusing mature architectures, ecosystems (e.g., Hugging Face), and methods (e.g., sequence packing) from large models.
-
Full-stack: Supports tasks such as recognition, synthesis, understanding, dialogue, and multimodal capabilities, with extensibility to incorporate open-source models.
-
Simple and Stupid: A simple and stupid speech toolkit that everyone can Touch.
conda create -n west python=3.10
conda activate west
pip install -r requirements.txt| Task | Model | Recipe |
|---|---|---|
| Speech Recognition | TouchASU(Built-in) | aishell |
| Speech Synthesis | TouchTTS(Built-in) | libritts |
| Speech QA | TouchASU(Built-in) | belle_1.4M_qa |
| Speech Interaction | TouchChat(Built-in) | |
| MutliModal Interaction | TouchOmni(Built-in) |
Our paper is available on arXiv, and you can cite it as:
@misc{zhang2025westllmbasedspeech,
title={WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction},
author={Binbin Zhang and Chengdong Liang and Shuai Wang and Xuelong Geng and Zhao Guo and Haoyu Li and Hao Yin and Xipeng Yang and Pengshen Zhang and Changwei Ma and Lei Xie},
year={2025},
eprint={2509.19902},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.19902},
}
We created a WeChat group for better discussion and quicker response. Please scan the personal QR code on the left, who is responsible for inviting you to the chat group. You can also scan the QR code on the right to follow our official account of WeNet Community.
![]() |
![]() |
|---|

