Triton server prediction streaming interface for LLM #6090

sepo-eng · 2023-07-22T04:02:59Z

sepo-eng
Jul 22, 2023

Has anyone in the community attempted to use Triton for streaming tokens, similar to Open AI's Chat GPT?

I've come across the ModelStreamInfer method exposed in the grpc_service.proto interface, but this seems to be responding with all the tokens at once.

outtanames · 2024-02-12T20:07:18Z

outtanames
Feb 12, 2024

If you just want an inference server solution for this you can try https://github.com/autonomi-ai/nos/tree/main/examples/tutorials/03-llm-streaming-chat. Llama 7B works out of the box locally with a grpc streaming interface and is pretty easy to set up.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Triton server prediction streaming interface for LLM #6090

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Triton server prediction streaming interface for LLM #6090

Uh oh!

sepo-eng Jul 22, 2023

Replies: 1 comment

Uh oh!

outtanames Feb 12, 2024

sepo-eng
Jul 22, 2023

outtanames
Feb 12, 2024