Orpheus TTS Server with Low Latency Streaming

RTX-4090, cuda12.8
200ms ttfb on fp16 using vllm
<160 ms ttfb on fp16 using trt-llm

installation

sudo apt-get -y install libopenmpi-dev
conda create -n trt python=3.10 && conda activate trt or use a virtual env with python3.10
pip install -r requirements.txt
update start.sh with your hf-token
bash start.sh

with runpod

start a pod with RTX 4090 gpu and image runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04
ensure port 9090 is open if you want to access externally
bash install.sh or copy paste install.sh within the entrypoint of container

Note:

The hf repo of model canopylabs/orpheus-3b-0.1-ft has optimiser / fsdp files which are not required for inference. However trt-llm download all the files so ensure you have enough storage (~100GB) available in the pod / machine. Or you can stop the process midway (after tokeniser and model safetensors are downloaded) and cleanup these extra files and set export TRANSFORMERS_OFFLINE=1 in start.sh and start the process again

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
src		src
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
emotions.txt		emotions.txt
install.sh		install.sh
main.py		main.py
requirements.txt		requirements.txt
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Orpheus TTS Server with Low Latency Streaming

installation

with runpod

Note:

About

Uh oh!

Languages

License

taresh18/orpheus-streaming

Folders and files

Latest commit

History

Repository files navigation

Orpheus TTS Server with Low Latency Streaming

installation

with runpod

Note:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages