|
1 | | -# TH TTS |
| 1 | +# Thai TTS (TH TTS) |
2 | 2 |
|
3 | | -## How to run |
| 3 | +## Model Attribution |
| 4 | + |
| 5 | +All model weights are provided by [VIZINTZOR](https://huggingface.co/VIZINTZOR) via Hugging Face: |
| 6 | + |
| 7 | +- **VITS Thai Female/Male**: |
| 8 | + [MMS-TTS-THAI-FEMALEV2](https://huggingface.co/VIZINTZOR/MMS-TTS-THAI-FEMALEV2), |
| 9 | + [MMS-TTS-THAI-MALEV2](https://huggingface.co/VIZINTZOR/MMS-TTS-THAI-MALEV2) |
| 10 | +- **F5-TTS Thai**: |
| 11 | + [F5-TTS-THAI](https://huggingface.co/VIZINTZOR/F5-TTS-THAI) |
| 12 | + [F5-TTS-TH-V2](https://huggingface.co/VIZINTZOR/F5-TTS-TH-V2) |
| 13 | + |
| 14 | +Please acknowledge and cite VIZINTZOR if you use these models in your work. |
| 15 | + |
| 16 | +--- |
| 17 | + |
| 18 | +## Recommended Model |
| 19 | + |
| 20 | +**For best quality and performance, use F5-TTS v1.** |
| 21 | + |
| 22 | +--- |
| 23 | + |
| 24 | +## How to Run |
| 25 | + |
| 26 | +You can run the server using either direct `uv` commands or the provided `entrypoint.sh` script (recommended for Docker and easy switching). |
| 27 | + |
| 28 | +### 1. Using `uv` Directly |
| 29 | + |
| 30 | +#### VITS Thai (Female/Male) |
4 | 31 |
|
5 | 32 | ```bash |
6 | | -uv run python src/wyoming_thai_vits.py --log-level DEBUG --host 0.0.0.0 --port 10200 \ |
| 33 | +uv run python src/wyoming_thai_vits.py --log-level INFO --host 0.0.0.0 --port 10200 \ |
7 | 34 | --model-id VIZINTZOR/MMS-TTS-THAI-FEMALEV2 |
8 | 35 |
|
9 | | -uv run python src/wyoming_thai_vits.py --log-level DEBUG --host 0.0.0.0 --port 10200 \ |
| 36 | +uv run python src/wyoming_thai_vits.py --log-level INFO --host 0.0.0.0 --port 10200 \ |
10 | 37 | --model-id VIZINTZOR/MMS-TTS-THAI-MALEV2 |
11 | 38 | ``` |
12 | 39 |
|
13 | | -## How to test |
14 | | - |
15 | | -### tool |
| 40 | +#### F5-TTS Thai v1 (**Recommended**) |
16 | 41 |
|
17 | 42 | ```bash |
18 | | -go install github.com/john-pettigrew/wyoming-cli@latest |
| 43 | +uv run python src/wyoming_thai_f5.py --log-level INFO --host 0.0.0.0 --port 10200 \ |
| 44 | + --model-version v1 |
19 | 45 | ``` |
20 | 46 |
|
21 | | -### info |
| 47 | +#### F5-TTS Thai v2 |
| 48 | + |
22 | 49 | ```bash |
23 | | -printf '{"type":"describe","data":{}}\n' | nc 127.0.0.1 10200 |
| 50 | +uv run python src/wyoming_thai_f5.py --log-level INFO --host 0.0.0.0 --port 10200 \ |
| 51 | + --model-version v2 |
24 | 52 | ``` |
25 | 53 |
|
26 | | -### synth |
27 | | -> Connect to HA seems to work much better, wyoming-cli only managed to get describe, so just let people in UFW |
| 54 | +### 2. Using `entrypoint.sh` (Recommended) |
| 55 | + |
| 56 | +Set the backend via `THTTS_BACKEND` environment variable: |
| 57 | + |
| 58 | +- `VITS` for VITS model |
| 59 | +- `F5_V1` for F5-TTS v1 (**recommended**) |
| 60 | +- `F5_V2` for F5-TTS v2 |
| 61 | + |
| 62 | +Example: |
| 63 | + |
28 | 64 | ```bash |
29 | | -sudo ufw allow 10200/tcp |
30 | | -sudo ufw delete allow 10200/tcp |
| 65 | +THTTS_BACKEND=F5_V1 ./entrypoint.sh |
31 | 66 | ``` |
32 | 67 |
|
33 | | -```bash |
34 | | -wyoming-cli tts -voice-name 'thai-female' -addr 'localhost:10200' -text 'สวัสดีชาวโลก' -output_file './hello.wav' |
| 68 | +You can override other parameters via environment variables (see below). |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +## Environment Variables |
| 73 | + |
| 74 | +| Variable | Default Value | Description | |
| 75 | +|-----------------------|-----------------------------------------------|--------------------------------------------------| |
| 76 | +| `THTTS_BACKEND` | `VITS` | Model backend: `VITS`, `F5_V1`, or `F5_V2` | |
| 77 | +| `THTTS_HOST` | `0.0.0.0` | Bind address | |
| 78 | +| `THTTS_PORT` | `10200` | Port to listen on | |
| 79 | +| `THTTS_LOG_LEVEL` | `INFO` | Log level (`DEBUG`, `INFO`, etc.) | |
| 80 | +| `THTTS_MODEL` | `VIZINTZOR/MMS-TTS-THAI-FEMALEV2` | VITS model ID | |
| 81 | +| `THTTS_REF_AUDIO` | `hf_sample` | F5 reference audio path | |
| 82 | +| `THTTS_REF_TEXT` | *(empty)* | F5 reference transcript | |
| 83 | +| `THTTS_DEVICE` | `auto` | `auto`, `cpu`, or `cuda` | |
| 84 | +| `THTTS_SPEED` | `1.0` | F5 speech speed multiplier | |
| 85 | +| `THTTS_NFE_STEPS` | `32` | F5 denoising steps | |
| 86 | +| `THTTS_MAX_CONCURRENT`| `1` | Max concurrent synth requests | |
| 87 | +| `THTTS_CKPT_FILE` | *(auto-selected by backend)* | F5 checkpoint file path | |
| 88 | +| `THTTS_VOCAB_FILE` | *(auto-selected by backend)* | F5 vocab file path | |
| 89 | + |
| 90 | + |
| 91 | +## 3. Docker Compose (NVIDIA GPU) |
| 92 | + |
| 93 | +```yaml |
| 94 | +services: |
| 95 | + thtts: |
| 96 | + image: ghcr.io/zen3515/thtts:latest |
| 97 | + container_name: thtts |
| 98 | + restart: unless-stopped |
| 99 | + shm_size: "2g" # please adjust |
| 100 | + environment: |
| 101 | + - THTTS_BACKEND=F5_V1 |
| 102 | + - THTTS_HOST=0.0.0.0 |
| 103 | + - THTTS_PORT=10200 |
| 104 | + - THTTS_LOG_LEVEL=INFO |
| 105 | + - THTTS_DEVICE=auto |
| 106 | + - NVIDIA_VISIBLE_DEVICES=all |
| 107 | + - NVIDIA_DRIVER_CAPABILITIES=compute,utility |
| 108 | + ports: |
| 109 | + - "10200:10200" |
| 110 | + deploy: |
| 111 | + resources: |
| 112 | + reservations: |
| 113 | + devices: |
| 114 | + - driver: nvidia |
| 115 | + count: all |
| 116 | + capabilities: [gpu] |
35 | 117 | ``` |
36 | 118 |
|
| 119 | +**Note:** |
| 120 | +- Make sure you have [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) installed. |
| 121 | +- Adjust the `THTTS_BACKEND` and other environment variables as needed. |
| 122 | + |
| 123 | +--- |
| 124 | + |
| 125 | +## How to Test |
| 126 | + |
| 127 | +### Query Info |
| 128 | + |
37 | 129 | ```bash |
38 | | -( printf '{"type":"synthesize","data":{"text":"สวัสดีครับ ยินดีที่ได้รู้จัก","voice":"thai-female"}}\n'; ) \ |
39 | | -| nc 127.0.0.1 10200 \ |
40 | | -| tee responses.ndjson \ |
41 | | -| jq -r 'select(.type=="audio-start") or select(.type=="audio-chunk") or select(.type=="audio-stop")' > audio_events.ndjson |
42 | | - |
43 | | -# Extract audio chunks (base64) -> raw PCM |
44 | | -jq -r 'select(.type=="audio-chunk") | .data.audio' audio_events.ndjson | base64 -d > out.pcm |
45 | | - |
46 | | -# Convert PCM (s16le, 22.05kHz, mono) -> WAV (use either ffmpeg or sox) |
47 | | -ffmpeg -f s16le -ar 22050 -ac 1 -i out.pcm out.wav -y |
48 | | -# or: |
49 | | -sox -t raw -r 22050 -e signed -b 16 -c 1 out.pcm out.wav |
50 | | -``` |
| 130 | +printf '{"type":"describe","data":{}}\n' | nc 127.0.0.1 10200 |
| 131 | +``` |
| 132 | + |
| 133 | +### Synthesize Speech |
| 134 | + |
| 135 | +Just connect it to homeassistant, it's probably the most up to spec with wyoming protocol |
| 136 | + |
| 137 | +--- |
| 138 | + |
| 139 | + |
| 140 | +## License |
| 141 | + |
| 142 | +See individual model pages on Hugging Face for license details. |
0 commit comments