Separating the control plane and data plane #1681

raghotham · 2025-03-18T16:32:30Z

raghotham
Mar 18, 2025
Collaborator

Problem Definition

One of Llama Stack's guiding principles has been to always allow for a simple single-node developer experience. We do this with "inline" providers which are "linked" into the Stack server, so the functionality is available immediately without a complex deployment setup. This is rather convenient for getting started and early iteration.

However, this is not scalable -- and certainly not advisable -- in production settings. The Stack server should act like a thin router and complex functionality should be backed by independent containers. With remote providers this works by definition as the implementation is hosted somewhere else. For inline providers (for example, the default agents implementation) though, this is tricky. It is not easy to separate that out and externalize it as an independent container. Beyond just plain containerization, one must allow its dependencies (for example, the agents API depends on Inference and Safety APIs) to be correctly resolved as well without needing to go back to the central Stack server for each dependency.

Solution Sketch

We need the following primitives:

Given a set of (api, provider) tuples, construct and serve a mini-Stack REST API consisting of just those providers without adding any further routing or other distribution-specific functionalities.
Generalize the notion of a passthrough remote provider. i.e., allow pointing a single API implementation to a remote Llama Stack compatible endpoint. For example, for inference, I could configure my Stack to use
{ provider_type: remote::passthrough, url: http://some_container/ }
Glue together these primitives so that given an overall Stack configuration, one can split it into several mini-containers (depending on requirements), start appropriate sub-clusters for these containers and point the dependencies to each other via appropriate passthrough providers. The orchestration itself should be done via Kubernetes.

The actual work involves:

llama_stack/server/server.py and resolve_impls need to be generalized slightly so that they don't
- add any routing support
- only support one provider for an API (since there is no routing)
- don't serve any "built-in" APIs (e.g., inspect) or routing table APIs (e.g., models)
One can add a set of utilities on top (perhaps a small generalization of llama stack build) to allow for quick containerization given a yaml configuration file.

This is similar to kube-proxy that is built as part of kubernetes.

bbrowning · 2025-03-18T18:59:02Z

bbrowning
Mar 18, 2025
Collaborator

@raghotham Here's a simple Mermaid diagram representing my understanding of the topology you outlined above. A Llama Stack Server container, multiple Llama "mini-Stack" containers that we proxy to via the remote::passthrough provider, and also a remote vLLM provider with its container thrown in for completeness.

Does this represent the direction as proposed, as far as allowing for delegation of actual inline provider implementations to remote containers while still allow remote providers to be directly used within the original, user-facing Llama Stack Server container?

flowchart TD
    A(Incoming Inference API Request)
    B(Llama Stack Server)
    C{Internal Routing}
    D(remote::vllm provider)
    E(remote::passthrough provider)
    F(vLLM Server)
    G(Llama MiniStack Server)
    H(inline::meta-reference provider)
    I(remote::passthrough provider)
    J(LLama MiniStack Server)
    K(inline::meta-reference provider)
    subgraph lls [Llama Stack Container]
    A --> B --> C
    C -- Granite-3.2-8b-instruct --> D
    C -- Llama-3.2-3B-Instruct --> E
    C -- Llama-3.1-8B-Instruct --> I
    end
    D --> vllm
    subgraph vllm [vLLM Container]
    F
    end
    E --> llms_32
    subgraph llms_32 [3.2-3B model Container]
    G --> H
    end
    I --> llms_31
    subgraph llms_31 [3.1-8B model Container]
    J --> K
    end

0 replies

raghotham · 2025-03-18T22:41:34Z

raghotham
Mar 18, 2025
Collaborator Author

@bbrowning Yes, thanks for the mermaid diagram. I think it might be even more clear to have an agent calling inference to show how the data plane is separated from the control plane. Something like below (maybe there's a better way to render it). Thoughts?

flowchart TD
    A(Incoming Inference/Agent API Requests)
    B(Distro Server)
    C{Internal Routing}
    D(remote::vllm inference provider)
    aD(remote::vllm inference provider)
    E(remote::passthrough agent provider)
    F(vLLM Server)
    L(Llama Agent MiniStack Server)
    M(inline::meta-reference agent provider)
    A --> lls
    subgraph lls [Distro Container]
    B --> C
    C -- Granite-3.2-8b-instruct --> D
    C -- Agent --> E
    end
    D --> vllm
    subgraph agent [Agent Container]
    L --> M
    M --> aD
    end
    aD --> vllm
    subgraph vllm [vLLM Container]
    F
    end
    E --> agent

1 reply

bbrowning Mar 19, 2025
Collaborator

@raghotham Even with the updated diagram, it's not exactly clear to me what's the control plane vs the data plane here. It may be worth giving an example of some kind of communication happening on the control plane and showing the path of that request through the system? And then, an example of a request coming from a user to the API and showing how that flows through the system? I'm assuming that we mean there are separate communication paths here - one for the control plane, which is typically coordination-type work - and a second path for the data plane, which would be the request path user requests take through the system.

To use an example from Kubernetes, the control plane is the API server, the node controllers, the software-defined networking setup to expose new services, etc. The data plane is the routers / ingress servers that then traverse this network stack setup by the control plane.

In the Llama Stack world, this would be a set of APIs doing control plane tasks (what are these? do they exist yet?) vs a set of APIs doing data plane tasks (ie the user-facing APIs).

jland-redhat · 2025-03-19T14:09:52Z

jland-redhat
Mar 19, 2025

One thing I am not seeing here and I think should be thought of as part of this effort is the ability to store data for these servers externally. For better production deployments, we need to make our containers stateless. Users should be able to store persistent data in an external database.

Right now, if a container instance, like a Kubernetes pod, is terminated, data is lost unless a Persistent Volume is configured. Even with a PV, production setups need backups and sharding.

Using existing database solutions for these features simplifies things. We should support connecting to external SQL databases, ideally all major providers, with MySQL support as a minimum.

1 reply

leseb Mar 21, 2025
Collaborator

I haven't tested but reading the code, this is possible already, we just need to use something else than SQLite. postgres, mongo and redis are supported and they allow an external connection.

jaideepr97 · 2025-03-19T14:40:05Z

jaideepr97
Mar 19, 2025

start appropriate sub-clusters for these containers

Is this meant to happen on the fly as requests are received, or is the idea that a few pre-decided sets of providers will be bundled up in sub clusters at build time based on expected use cases and started at the same time as the main stack server?
I imagine as the need to support different providers grows, it will become hard to predict all the required permutations and combinations of providers ahead of time. On the fly sounds like it would start making more sense then, but potentially at the cost of increased processing time as subclusters need to be spun up as required for each request

It's definitely an interesting idea and I would like to understand it better!

0 replies

ashwinb · 2025-03-19T23:03:38Z

ashwinb
Mar 19, 2025
Collaborator

@jaideepr97

Is this meant to happen on the fly as requests are received, or is the idea that a few pre-decided sets of providers will be bundled up in sub clusters at build time based on expected use cases and started at the same time as the main stack server?

at this point, the intention is certainly for this to be static (pre-determined) and not on-the-fly. if you wanted to add providers (meaning code), you will need to do a re-deployment. of course, over time one can make this fancier and allow for more dynamism.

2 replies

jaideepr97 Mar 20, 2025

Will this not essentially necessitate kube on single node boxes as well? or is the idea to make this kind of kube enabled deployment just an additional way to set up things and users would be free to continue using the existing set up if they wish?

ashwinb Mar 20, 2025
Collaborator

@jaideepr97 yeah it should not necessitate kube. it should be an additional way. basically, in a distributed deployment with Kubernetes, you are going to need some support from Stack to make the job easier.

rhuss · 2025-03-21T16:09:41Z

rhuss
Mar 21, 2025

Great initiative to become more modular, but still enabling a local experience!

I think the concept that is meant here for a control plane (essentially the routing part) is different from the concept of a control-plane/data-plane in Kubernetes.

In K8s, the control-plane is for setting up and managing something that serves value to the end-users of the workloads that are running on K8s. It doesn't run workloads but decides who goes where and keeps track of everything. I.e., the K8s Api-server and the scheduler are part of the control plane (it allows you to deploy your workloads), but kube-proxy and kubelet are not part of the control plane as they are triggered to run the workloads. They are part of the data-plane.

This is about core K8s, but the concepts apply to platforms like LLS running on top of Kubernetes as well: Controllers and operators reconciling CRDs for a specific application are part of the control-plane, whereas the workload deployed on behalf of those CRDs is part of the data-plane.

When mapping this to LLS, it's difficult to clearly separate those concerns, so it's hard to have clearly separated planes.

For me, the control-plane of LLS is:

The creation of a distribution with llama stack build
The generated run.yaml to describe the distribution
The deployment of LLS server (creating the Deployment resource for LLS)
A LLS operator (as proposed in the latest community meeting)

The data plane would be just every distribution deployed on Kubernetes, regardless of how it is internally structured.

As a rule of thumb, everything on the data path (handling and responding to an HTTP request) of an API request is part of the data plane.

We could consider tighter integration into the K8s control plane by, e.g., not just running a container image that contains a distribution and is static but allowing a more fine-granular deployment on the provider level. For example, one could imagine allowing Kubernetes CRDs like LlamaStackProvider that has a type, etc., and a LlamaStackDistribution that references a list of LlamaStackProvider. This would be an alternative to the run.yaml and allows for a more dynamic assembly during runtime (not build time). Or, to make it less fine-granular, you could more or less transform the run.yaml. The critical point here is that you move the creation of a distribution from build time (pre-deployment) to run time (when deployed), but that will be a completely different setup.

Also, it's not really clear how to map this to a local experience (although you could, of course, evaluate such CRDs also locally, outside the context of K8s)

TLDR: I really love the approach to becoming more modular, opening the doors for external contributions that are managed outside of LLS, but I think we should be careful how to use well-established concepts in K8s and re-interpret them in the context of LLS as it might lead to confusion.

0 replies

Separating the control plane and data plane #1681

Uh oh!

raghotham Mar 18, 2025 Collaborator

Problem Definition

Solution Sketch

Replies: 6 comments · 4 replies

Uh oh!

bbrowning Mar 18, 2025 Collaborator

Uh oh!

raghotham Mar 18, 2025 Collaborator Author

Uh oh!

bbrowning Mar 19, 2025 Collaborator

Uh oh!

jland-redhat Mar 19, 2025

Uh oh!

leseb Mar 21, 2025 Collaborator

Uh oh!

Uh oh!

jaideepr97 Mar 19, 2025

Uh oh!

ashwinb Mar 19, 2025 Collaborator

Uh oh!

jaideepr97 Mar 20, 2025

Uh oh!

ashwinb Mar 20, 2025 Collaborator

Uh oh!

Uh oh!

rhuss Mar 21, 2025

raghotham
Mar 18, 2025
Collaborator

Replies: 6 comments 4 replies

bbrowning
Mar 18, 2025
Collaborator

raghotham
Mar 18, 2025
Collaborator Author

bbrowning Mar 19, 2025
Collaborator

jland-redhat
Mar 19, 2025

leseb Mar 21, 2025
Collaborator

jaideepr97
Mar 19, 2025

ashwinb
Mar 19, 2025
Collaborator

ashwinb Mar 20, 2025
Collaborator

rhuss
Mar 21, 2025