I want to express my gratitude for this project #3185

zacksiri · 2025-04-21T16:05:48Z

zacksiri
Apr 21, 2025

I've been searching for a way to efficiently run models across my GPUs (I'm specifically using two A4500s with NVLink). After using llama cpp initially, I was disappointed to find that the performance was subpar, even when utilizing the --split-mode row. Feeling that I was leaving performance on the table, I began exploring alternative options.

My search led me to vLLM, and although I appreciate the vllm team's efforts, I found the project to be lacking in terms of stability. It seemed to be plagued by bugs, with some features working for certain models but not others. I encountered issues that prevented all of the models I needed (granite, gemma, llama, and mistral) from functioning as expected. The experience felt like playing a never-ending game of "whack-a-mole," where each problem I solved just led to another popping up.

However, my fortunes changed when I discovered TGI. After giving it a try, I was thrilled to find that running a model on my first attempt was a seamless experience. The speed and smoothness of the generated text were impressive, and I was delighted to discover that tool calling worked out of the box without requiring any additional configuration. This has been the most reliable and efficient way I've found to run LLM models. It's clear that someone put a lot of thought into designing TGI, and the polish shows. kudos to whoever was responsible for bringing this project to life with this kind of polish.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

I want to express my gratitude for this project #3185

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

I want to express my gratitude for this project #3185

Uh oh!

zacksiri Apr 21, 2025

Replies: 0 comments

zacksiri
Apr 21, 2025