Replies: 1 comment
-
If you just want an inference server solution for this you can try https://github.com/autonomi-ai/nos/tree/main/examples/tutorials/03-llm-streaming-chat. Llama 7B works out of the box locally with a grpc streaming interface and is pretty easy to set up. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Has anyone in the community attempted to use Triton for streaming tokens, similar to Open AI's Chat GPT?
I've come across the ModelStreamInfer method exposed in the grpc_service.proto interface, but this seems to be responding with all the tokens at once.
Beta Was this translation helpful? Give feedback.
All reactions