Replies: 6 comments 4 replies
-
@raghotham Here's a simple Mermaid diagram representing my understanding of the topology you outlined above. A Llama Stack Server container, multiple Llama "mini-Stack" containers that we proxy to via the remote::passthrough provider, and also a remote vLLM provider with its container thrown in for completeness. Does this represent the direction as proposed, as far as allowing for delegation of actual inline provider implementations to remote containers while still allow remote providers to be directly used within the original, user-facing Llama Stack Server container? flowchart TD
A(Incoming Inference API Request)
B(Llama Stack Server)
C{Internal Routing}
D(remote::vllm provider)
E(remote::passthrough provider)
F(vLLM Server)
G(Llama MiniStack Server)
H(inline::meta-reference provider)
I(remote::passthrough provider)
J(LLama MiniStack Server)
K(inline::meta-reference provider)
subgraph lls [Llama Stack Container]
A --> B --> C
C -- Granite-3.2-8b-instruct --> D
C -- Llama-3.2-3B-Instruct --> E
C -- Llama-3.1-8B-Instruct --> I
end
D --> vllm
subgraph vllm [vLLM Container]
F
end
E --> llms_32
subgraph llms_32 [3.2-3B model Container]
G --> H
end
I --> llms_31
subgraph llms_31 [3.1-8B model Container]
J --> K
end
|
Beta Was this translation helpful? Give feedback.
-
@bbrowning Yes, thanks for the mermaid diagram. I think it might be even more clear to have an agent calling inference to show how the data plane is separated from the control plane. Something like below (maybe there's a better way to render it). Thoughts? flowchart TD
A(Incoming Inference/Agent API Requests)
B(Distro Server)
C{Internal Routing}
D(remote::vllm inference provider)
aD(remote::vllm inference provider)
E(remote::passthrough agent provider)
F(vLLM Server)
L(Llama Agent MiniStack Server)
M(inline::meta-reference agent provider)
A --> lls
subgraph lls [Distro Container]
B --> C
C -- Granite-3.2-8b-instruct --> D
C -- Agent --> E
end
D --> vllm
subgraph agent [Agent Container]
L --> M
M --> aD
end
aD --> vllm
subgraph vllm [vLLM Container]
F
end
E --> agent
|
Beta Was this translation helpful? Give feedback.
-
One thing I am not seeing here and I think should be thought of as part of this effort is the ability to store data for these servers externally. For better production deployments, we need to make our containers stateless. Users should be able to store persistent data in an external database. Right now, if a container instance, like a Kubernetes pod, is terminated, data is lost unless a Persistent Volume is configured. Even with a PV, production setups need backups and sharding. Using existing database solutions for these features simplifies things. We should support connecting to external SQL databases, ideally all major providers, with MySQL support as a minimum. |
Beta Was this translation helpful? Give feedback.
-
Is this meant to happen on the fly as requests are received, or is the idea that a few pre-decided sets of providers will be bundled up in sub clusters at build time based on expected use cases and started at the same time as the main stack server? It's definitely an interesting idea and I would like to understand it better! |
Beta Was this translation helpful? Give feedback.
-
at this point, the intention is certainly for this to be static (pre-determined) and not on-the-fly. if you wanted to add providers (meaning code), you will need to do a re-deployment. of course, over time one can make this fancier and allow for more dynamism. |
Beta Was this translation helpful? Give feedback.
-
Great initiative to become more modular, but still enabling a local experience! I think the concept that is meant here for a control plane (essentially the routing part) is different from the concept of a control-plane/data-plane in Kubernetes. In K8s, the control-plane is for setting up and managing something that serves value to the end-users of the workloads that are running on K8s. It doesn't run workloads but decides who goes where and keeps track of everything. I.e., the K8s Api-server and the scheduler are part of the control plane (it allows you to deploy your workloads), but kube-proxy and kubelet are not part of the control plane as they are triggered to run the workloads. They are part of the data-plane. This is about core K8s, but the concepts apply to platforms like LLS running on top of Kubernetes as well: Controllers and operators reconciling CRDs for a specific application are part of the control-plane, whereas the workload deployed on behalf of those CRDs is part of the data-plane. When mapping this to LLS, it's difficult to clearly separate those concerns, so it's hard to have clearly separated planes. For me, the control-plane of LLS is:
The data plane would be just every distribution deployed on Kubernetes, regardless of how it is internally structured. As a rule of thumb, everything on the data path (handling and responding to an HTTP request) of an API request is part of the data plane. We could consider tighter integration into the K8s control plane by, e.g., not just running a container image that contains a distribution and is static but allowing a more fine-granular deployment on the provider level. For example, one could imagine allowing Kubernetes CRDs like Also, it's not really clear how to map this to a local experience (although you could, of course, evaluate such CRDs also locally, outside the context of K8s) TLDR: I really love the approach to becoming more modular, opening the doors for external contributions that are managed outside of LLS, but I think we should be careful how to use well-established concepts in K8s and re-interpret them in the context of LLS as it might lead to confusion. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem Definition
One of Llama Stack's guiding principles has been to always allow for a simple single-node developer experience. We do this with "inline" providers which are "linked" into the Stack server, so the functionality is available immediately without a complex deployment setup. This is rather convenient for getting started and early iteration.
However, this is not scalable -- and certainly not advisable -- in production settings. The Stack server should act like a thin router and complex functionality should be backed by independent containers. With remote providers this works by definition as the implementation is hosted somewhere else. For inline providers (for example, the default agents implementation) though, this is tricky. It is not easy to separate that out and externalize it as an independent container. Beyond just plain containerization, one must allow its dependencies (for example, the agents API depends on Inference and Safety APIs) to be correctly resolved as well without needing to go back to the central Stack server for each dependency.
Solution Sketch
We need the following primitives:
{ provider_type: remote::passthrough, url: http://some_container/ }
The actual work involves:
This is similar to kube-proxy that is built as part of kubernetes.
Beta Was this translation helpful? Give feedback.
All reactions