You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been searching for a way to efficiently run models across my GPUs (I'm specifically using two A4500s with NVLink). After using llama cpp initially, I was disappointed to find that the performance was subpar, even when utilizing the --split-mode row. Feeling that I was leaving performance on the table, I began exploring alternative options.
My search led me to vLLM, and although I appreciate the vllm team's efforts, I found the project to be lacking in terms of stability. It seemed to be plagued by bugs, with some features working for certain models but not others. I encountered issues that prevented all of the models I needed (granite, gemma, llama, and mistral) from functioning as expected. The experience felt like playing a never-ending game of "whack-a-mole," where each problem I solved just led to another popping up.
However, my fortunes changed when I discovered TGI. After giving it a try, I was thrilled to find that running a model on my first attempt was a seamless experience. The speed and smoothness of the generated text were impressive, and I was delighted to discover that tool calling worked out of the box without requiring any additional configuration. This has been the most reliable and efficient way I've found to run LLM models. It's clear that someone put a lot of thought into designing TGI, and the polish shows. kudos to whoever was responsible for bringing this project to life with this kind of polish.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I've been searching for a way to efficiently run models across my GPUs (I'm specifically using two A4500s with NVLink). After using llama cpp initially, I was disappointed to find that the performance was subpar, even when utilizing the --split-mode row. Feeling that I was leaving performance on the table, I began exploring alternative options.
My search led me to vLLM, and although I appreciate the vllm team's efforts, I found the project to be lacking in terms of stability. It seemed to be plagued by bugs, with some features working for certain models but not others. I encountered issues that prevented all of the models I needed (granite, gemma, llama, and mistral) from functioning as expected. The experience felt like playing a never-ending game of "whack-a-mole," where each problem I solved just led to another popping up.
However, my fortunes changed when I discovered TGI. After giving it a try, I was thrilled to find that running a model on my first attempt was a seamless experience. The speed and smoothness of the generated text were impressive, and I was delighted to discover that tool calling worked out of the box without requiring any additional configuration. This has been the most reliable and efficient way I've found to run LLM models. It's clear that someone put a lot of thought into designing TGI, and the polish shows. kudos to whoever was responsible for bringing this project to life with this kind of polish.
Beta Was this translation helpful? Give feedback.
All reactions