Skip to content

Commit 6390127

Browse files
committed
add TensorRT-LLM backend
1 parent 15419cf commit 6390127

34 files changed

+774
-356
lines changed

README.MD

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -18,16 +18,16 @@
1818
1919
## ✨ 功能亮点
2020

21-
| | 功能 | 说明 |
22-
|-----|-------------|----------------------------------------------------|
23-
| 🚀 | **多后端推理加速** | 支持 `vllm``sglang``llama‑cpp``mlx‑lm` 等多种高性能推理引擎 |
24-
| 🎯 | **高并发** | 动态批处理与异步队列,轻松应对大流量请求 |
25-
| 🎛️ | **全参数控制** | 可调节音调、语速、温度、情感标签等 |
26-
| 📱 | **轻量部署** | 基于 FastAPI,一条命令即可启动;最小依赖 |
27-
| 🔊 | **长文本合成** | 支持超长文本,保持连续音色一致 |
28-
| 🔄 | **流式 TTS** | 边生成边播放,降低等待,提高交互体验 |
29-
| 🎭 | **多角色对话** | 同文本多角色合成,适合剧本配音 |
30-
| 🎨 | **现代化前端** | 适配Web端 |
21+
| | 功能 | 说明 |
22+
|-----|-------------|-------------------------------------------------------------------|
23+
| 🚀 | **多后端推理加速** | 支持 `vllm``sglang``llama‑cpp``mlx‑lm``tensorrt-llm` 等多种高性能推理引擎 |
24+
| 🎯 | **高并发** | 动态批处理与异步队列,轻松应对大流量请求 |
25+
| 🎛️ | **全参数控制** | 可调节音调、语速、温度、情感标签等 |
26+
| 📱 | **轻量部署** | 基于 FastAPI,一条命令即可启动;最小依赖 |
27+
| 🔊 | **长文本合成** | 支持超长文本,保持连续音色一致 |
28+
| 🔄 | **流式 TTS** | 边生成边播放,降低等待,提高交互体验 |
29+
| 🎭 | **多角色对话** | 同文本多角色合成,适合剧本配音 |
30+
| 🎨 | **现代化前端** | 适配Web端 |
3131

3232
## 🖼️ 前端示例
3333

README_EN.MD

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -19,16 +19,16 @@
1919
2020
## ✨ Highlights
2121

22-
| | Feature | Description |
23-
|-----|--------------------------------|------------------------------------------------------------------------------------------------|
24-
| 🚀 | **Multi-backend Acceleration** | Supports high-performance inference engines like `vllm`, `sglang`, `llama-cpp`, `mlx-lm`, etc. |
25-
| 🎯 | **High Concurrency** | Dynamic batching and asynchronous queues to handle heavy traffic with ease |
26-
| 🎛️ | **Full Parameter Control** | Adjust pitch, speaking rate, temperature, emotion tags, and more |
27-
| 📱 | **Lightweight Deployment** | Built on FastAPI—start with a single command; minimal dependencies |
28-
| 🔊 | **Long-form Synthesis** | Supports very long texts while maintaining consistent voice quality |
29-
| 🔄 | **Streaming TTS** | Generate and play audio in real time; reduces wait time, enhances interactivity |
30-
| 🎭 | **Multi-character Dialog** | Synthesize multiple roles within the same text—ideal for script dubbing |
31-
| 🎨 | **Modern Frontend** | Web-ready, responsive interface |
22+
| | Feature | Description |
23+
|-----|--------------------------------|---------------------------------------------------------------------------------------------------------------|
24+
| 🚀 | **Multi-backend Acceleration** | Supports high-performance inference engines like `vllm`, `sglang`, `llama-cpp`, `mlx-lm`,`tensorrt-llm`, etc. |
25+
| 🎯 | **High Concurrency** | Dynamic batching and asynchronous queues to handle heavy traffic with ease |
26+
| 🎛️ | **Full Parameter Control** | Adjust pitch, speaking rate, temperature, emotion tags, and more |
27+
| 📱 | **Lightweight Deployment** | Built on FastAPI—start with a single command; minimal dependencies |
28+
| 🔊 | **Long-form Synthesis** | Supports very long texts while maintaining consistent voice quality |
29+
| 🔄 | **Streaming TTS** | Generate and play audio in real time; reduces wait time, enhances interactivity |
30+
| 🎭 | **Multi-character Dialog** | Synthesize multiple roles within the same text—ideal for script dubbing |
31+
| 🎨 | **Modern Frontend** | Web-ready, responsive interface |
3232

3333
## 🖼️ Frontend Demo
3434

@@ -134,6 +134,7 @@ pip install flashtts
134134
For detailed installation steps, please refer to: [installation guide](docs/zh/get_started/installation.md)
135135

136136
Local inference command::
137+
137138
```bash
138139
flashtts infer \
139140
-i "hello world." \

docs/en/get_started/installation.md

Lines changed: 142 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -1,95 +1,172 @@
11
## Flash-TTS Installation Guide
22

33
> This document provides a detailed walkthrough for installing and deploying the Flash-TTS inference engine, including
4-
> environment requirements, model weight downloads, and dependency installation.
4+
> environment requirements, model weight downloads, and dependency installation steps.
55
66
---
77

88
### Environment Requirements
99

10-
- **Python**: 3.10+
11-
- **Operating System**: Linux x86_64, macOS, or Windows (WSL2 recommended)
12-
- **Required Dependencies**:
13-
- `fastapi`
14-
- One of the supported inference backends: `vllm`, `sglang`, `llama-cpp-python`, `mlx-lm`
10+
* **Python**: Version 3.10 or above
11+
* **Operating System**: Linux x86\_64, macOS, or Windows (WSL2 is recommended)
12+
* **Required Dependencies**:
13+
14+
* `fastapi`
15+
* At least one inference backend: `vllm`, `sglang`, `llama-cpp-python`, `mlx-lm`, or `tensorrt-llm`
1516

1617
---
1718

18-
### Downloading Model Weights
19+
### Model Weight Downloads
1920

20-
| Model | HuggingFace | ModelScope | GGUF |
21-
|:--------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------:|
22-
| Spark-TTS | [SparkAudio/Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) | [SparkAudio/Spark-TTS-0.5B](https://modelscope.cn/models/SparkAudio/Spark-TTS-0.5B) | [SparkTTS-LLM-GGUF](https://huggingface.co/mradermacher/SparkTTS-LLM-GGUF) |
23-
| Orpheus-TTS | [canopylabs/orpheus-3b-0.1-ft](https://huggingface.co/canopylabs/orpheus-3b-0.1-ft) & [hubertsiuzdak/snac_24khz](https://huggingface.co/hubertsiuzdak/snac_24khz) | [canopylabs/orpheus-3b-0.1-ft](https://modelscope.cn/models/canopylabs/orpheus-3b-0.1-ft) | [orpheus-gguf](https://huggingface.co/isaiahbjork/orpheus-3b-0.1-ft-Q4_K_M-GGUF) |
24-
| Orpheus-TTS (Multilingual) | [orpheus-multilingual-research-release](https://huggingface.co/collections/canopylabs/orpheus-multilingual-research-release-67f5894cd16794db163786ba) & [hubertsiuzdak/snac_24khz](https://huggingface.co/hubertsiuzdak/snac_24khz) | - | - |
25-
| MegaTTS3 | [ByteDance/MegaTTS3](https://huggingface.co/ByteDance/MegaTTS3) | - | - |
21+
| Model | Hugging Face | ModelScope | GGUF |
22+
|:--------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------:|
23+
| Spark-TTS | [SparkAudio/Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) | [SparkAudio/Spark-TTS-0.5B](https://modelscope.cn/models/SparkAudio/Spark-TTS-0.5B) | [SparkTTS-LLM-GGUF](https://huggingface.co/mradermacher/SparkTTS-LLM-GGUF) |
24+
| Orpheus-TTS | [canopylabs/orpheus-3b-0.1-ft](https://huggingface.co/canopylabs/orpheus-3b-0.1-ft) & [hubertsiuzdak/snac\_24khz](https://huggingface.co/hubertsiuzdak/snac_24khz) | [canopylabs/orpheus-3b-0.1-ft](https://modelscope.cn/models/canopylabs/orpheus-3b-0.1-ft) | [orpheus-gguf](https://huggingface.co/isaiahbjork/orpheus-3b-0.1-ft-Q4_K_M-GGUF) |
25+
| Orpheus-TTS (Multilingual) | [orpheus-multilingual-research-release](https://huggingface.co/collections/canopylabs/orpheus-multilingual-research-release-67f5894cd16794db163786ba) & [hubertsiuzdak/snac\_24khz](https://huggingface.co/hubertsiuzdak/snac_24khz) | - | - |
26+
| MegaTTS3 | [ByteDance/MegaTTS3](https://huggingface.co/ByteDance/MegaTTS3) | - | - |
2627

2728
---
2829

29-
### Installing Dependencies
30-
31-
#### 1. Install `torch` and `torchaudio`
30+
### Dependency Installation
3231

33-
Visit the [official PyTorch website](https://pytorch.org/get-started/locally/) to find the correct install command for
34-
your environment. Make sure to check your CUDA version and other device details.
32+
#### 1. Install PyTorch
3533

36-
For example, with CUDA 12.4:
34+
Visit the [official PyTorch website](https://pytorch.org/get-started/locally/) to get the installation command suitable
35+
for your system and CUDA version.
36+
For example, for CUDA 12.4:
3737

3838
```bash
3939
pip install torch==2.6.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
4040
```
4141

42-
#### 2. Install `flashtts`
42+
#### 2. Install Flash-TTS
43+
44+
* Install via pip:
45+
46+
```bash
47+
pip install flashtts
48+
```
49+
50+
* Or install from source:
4351

44-
- **pip**
45-
```bash
46-
pip install flashtts
47-
```
48-
49-
- **source code**
50-
```bash
51-
git clone https://github.com/HuiResearch/FlashTTS.git
52-
cd FlashTTS
53-
pip install .
54-
```
52+
```bash
53+
git clone https://github.com/HuiResearch/FlashTTS.git
54+
cd FlashTTS
55+
pip install .
56+
```
5557

56-
If you encounter an error installing `WeTextProcessing` in a Windows environment due to the need for a VS C++ compiler, you can first install `pynini==2.1.6` using `conda`:
58+
> **Windows User Notice**:
59+
> If you encounter compilation errors when installing `WeTextProcessing`, you can install dependencies via Conda first:
5760
5861
```bash
5962
conda install -c conda-forge pynini==2.1.6
6063
pip install WeTextProcessing==1.0.4.1
6164
```
6265

63-
#### 3. Install Inference Backend (Choose One)
64-
65-
- **vLLM** (version > 0.7.2)
66-
67-
Default installation command for CUDA 12.4:
68-
```bash
69-
pip install vllm
70-
```
71-
For other CUDA versions, refer to: https://docs.vllm.ai/en/latest/getting_started/installation.html
72-
73-
- **llama-cpp-python**
74-
```bash
75-
pip install llama-cpp-python
76-
```
77-
- If using GGUF weights, place `model.gguf` in the `checkpoints/<model>/LLM/` directory.
78-
- To convert manually:
79-
```bash
80-
git clone https://github.com/ggml-org/llama.cpp.git
81-
cd llama.cpp
82-
python convert_hf_to_gguf.py Spark-TTS-0.5B/LLM --outfile Spark-TTS-0.5B/LLM/model.gguf
83-
```
84-
85-
- **sglang**
86-
```bash
87-
pip install sglang
88-
```
89-
Reference: https://docs.sglang.ai/start/install.html
90-
91-
- **mlx-lm** (for Apple Silicon macOS)
92-
```bash
93-
pip install mlx-lm
94-
```
95-
Reference: https://github.com/ml-explore/mlx-lm
66+
---
67+
68+
### Install Inference Backends (choose as needed)
69+
70+
#### vLLM (Recommended)
71+
72+
* Version ≥ 0.7.2 is required. For CUDA 12.4:
73+
74+
```bash
75+
pip install vllm
76+
```
77+
78+
* For other versions, refer to
79+
the [vLLM official documentation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
80+
81+
---
82+
83+
#### llama-cpp-python
84+
85+
```bash
86+
pip install llama-cpp-python
87+
```
88+
89+
* If using GGUF format weights, place the `model.gguf` file under the `checkpoints/<model>/LLM/` directory.
90+
* To convert weights, use the following commands:
91+
92+
```bash
93+
git clone https://github.com/ggml-org/llama.cpp.git
94+
cd llama.cpp
95+
python convert_hf_to_gguf.py Spark-TTS-0.5B/LLM --outfile Spark-TTS-0.5B/LLM/model.gguf
96+
```
97+
98+
---
99+
100+
#### sglang
101+
102+
```bash
103+
pip install sglang
104+
```
105+
106+
* For more information, refer to the [sglang installation guide](https://docs.sglang.ai/start/install.html)
107+
108+
---
109+
110+
#### mlx-lm (Apple Silicon Only)
111+
112+
```bash
113+
pip install mlx-lm
114+
```
115+
116+
* More info: [mlx-lm GitHub project](https://github.com/ml-explore/mlx-lm)
117+
118+
---
119+
120+
#### TensorRT-LLM
121+
122+
Example for CUDA 12.4:
123+
124+
```bash
125+
pip install tensorrt-llm --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu124
126+
```
127+
128+
> **Notes**:
129+
>
130+
> * TensorRT-LLM on Windows currently supports only Python 3.10.
131+
> * Latest supported version for Windows: 0.16.0 (as of 2025-05-05)
132+
> * Details: [NVIDIA PyPI Repository](https://pypi.nvidia.com)
133+
134+
Verify installation:
135+
136+
```bash
137+
python -c "import tensorrt_llm; print(tensorrt_llm.__version__)"
138+
```
139+
140+
##### Convert LLM Weights to TensorRT Engine
141+
142+
1. Refer to the official model conversion docs:
143+
[https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/)
144+
145+
2. Choose the appropriate model type (e.g., Spark-TTS uses `qwen`).
146+
147+
3. After conversion, rename the output engine folder to `tensorrt-engine` and move it to the model directory, for
148+
example:
149+
150+
```bash
151+
Spark-TTS-0.5B/LLM/tensorrt-engine
152+
```
153+
154+
Flash-TTS can now load and infer from the converted model.
155+
156+
---
157+
158+
### Backend Support Matrix
159+
160+
| Inference Backend | Linux ✅ | Windows ✅ | macOS ✅ | Notes |
161+
|-------------------|:-------:|:---------:|:-------:|-----------------------------------------------------|
162+
| `vllm` |||| Linux-only, requires CUDA |
163+
| `sglang` |||| Linux-only, supports most GPUs |
164+
| `tensorrt-llm` || ⚠️ || Windows supports Python 3.10 only, version ≤ 0.16.0 |
165+
| `llama-cpp` |||| GGUF format supported, cross-platform |
166+
| `mlx-lm` |||| macOS only (Apple Silicon) |
167+
| `torch` |||| Core dependency, supported on all platforms |
168+
169+
> ⚠️ **Notes**:
170+
>
171+
> * On Windows, **WSL2** is recommended for full Linux feature support.
172+
> * On macOS, `mlx-lm` is not available for non-Apple Silicon chips.

docs/en/get_started/quick_start.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ flashtts infer \
3636
| `-b, --backend` | `str` || Yes | Inference backend: `llama-cpp, vllm, sglang, mlx-lm, torch` |
3737
| `--lang` | `str` | `None` | No | Language type for OrpheusTTS, e.g., `mandarin, english, french`, etc. |
3838
| `--snac_path` | `str` | `None` | No | Path to SNAC module for OrpheusTTS |
39+
| `--llm_tensorrt_path` | `str` | `None` | No | Path to the TensorRT model. Only effective when the backend is set to `tensorrt-llm`. If not provided, defaults to `{model_path}/tensorrt-engine` |
3940
| `--llm_device` | `str` | `auto` | No | Device for LLM computation: `cpu` or `cuda` |
4041
| `--tokenizer_device` | `str` | `auto` | No | Device for audio tokenizer |
4142
| `--detokenizer_device` | `str` | `auto` | No | Device for audio detokenizer |

0 commit comments

Comments
 (0)