Idea/Feature Request: ramalama to run AI models: Adopt basetenlabs/truss model packaging system to create containers with the model + code + dependency #1165

qdrddr · 2025-04-07T13:26:18Z

qdrddr
Apr 7, 2025

To run an AI model from HF.co we need to pack the model weights, the code developer provides to run the model and the dependency and qute offten add something like Attention. This can be a cumbersome and timeconsuming every user doing it from scratch. Instead this can be done in an easilly sharable and easilly adjustable YAML config files or containers.

Solution you'd like:

Adopt basetenlabs/truss packaging system that describes a simple config.yaml with the model and required dependencies that automatically creates the Dockerfile with the model and the environment and automatically builds container. Truss typically requires two main files: the config.yaml to describe the container environment and where from to download the model, and the file with code to run the model which is stored in model/model.py.

If done right this can become a universal/standard model packaging system for ANY model users can easily share and consume, not just LLM or those you typically run with llama.cpp/vLLM.

The goal

Is to provide a simple and COMPLETE packaging system for ALL and any possible model out there that can be esilly sharable, modifiable and package build reproducable.

ericcurtin · 2025-04-07T13:42:33Z

ericcurtin
Apr 7, 2025
Maintainer

To run an AI model from HF.co we need to pack the model weights, the code developer provides to run the model and the dependency and qute offten add something like Attention. This can be a cumbersome and timeconsuming every user doing it from scratch. Instead this can be done in an easilly sharable and easilly adjustable YAML config files or containers.

Solution you'd like:

Adopt basetenlabs/truss packaging system that describes a simple config.yaml with the model and required dependencies that automatically creates the Dockerfile with the model and the environment and automatically builds container. Truss typically requires two main files: the config.yaml to describe the container environment and where from to download the model, and the file with code to run the model which is stored in model/model.py.

If done right this can become a universal/standard model packaging system for ANY model users can easily share and consume, not just LLM or those you typically run with llama.cpp/vLLM.

I agree, but we also have this problem right now:

https://xkcd.com/927/

The goal

Is to provide a simple and COMPLETE packaging system for ALL and any possible model out there that can be esilly sharable, modifiable and package build reproducable.

0 replies

qdrddr · 2025-04-07T13:51:13Z

qdrddr
Apr 7, 2025
Author

Point taken. You cannot force everyone to accept a single standart. @ericcurtin
Though what you can is just find a good solution you can think of and provide it to the community to judge and hope for the best.

It seaems there is a demand in a simple model consumption mechanism to run your models. And things like Ollama and Ramalama seams to be a living proof of that.

So packaging a model with all it needs to run using Tuss might be simple and flexible enough for the community to pick this up, and not just those LLM models you would typically run wityh llama.cpp or vLLM but also embedding, reranker, classifier models and so on.

0 replies

rhatdan · 2025-04-07T14:47:53Z

rhatdan
Apr 7, 2025
Maintainer

@bmahabirbu @benoitf WDYT?

0 replies

rhatdan · 2025-04-07T14:51:53Z

rhatdan
Apr 7, 2025
Maintainer

I see this area exploding and crying for a standard.

Docker just came out with a new way of shipping models, using artifacts, I believe.
Ollama has their own hidden version of this.
RamaLama convert supports two additional ways of packaging models. ModelCar which is what is used in OpenShift right now, but might be going away. And ModelRaw which is a Raw image which includes the model under /model with a link to it /model/model-file->/model/MODELNAME. RamaLama will move to OCI artifacts once Podman and Docker support them. rather then using OCI Images directly.

If OCI would standardize on something or there was a major player to stamp their approval,then we could go with a defacto standard.

For now it is a mess.

0 replies

qdrddr · 2025-04-07T15:29:18Z

qdrddr
Apr 7, 2025
Author

Just FYI. What I propose is an attempt to standardize the packing of a model with code (or the runners), dependencies/environment into a single package. This might make sense to be done using containerization. And the proposed truss tool actually already does that.

0 replies

rhatdan · 2025-04-07T16:04:36Z

rhatdan
Apr 7, 2025
Maintainer

I care more about the standardization of the format? Where does the model files sit? What is the annotations describing the model? Can we split the model files into different layers or different artifacts under the same OCI name.

I have users who want to pull just the tensors without pulling the entire model.

0 replies

bmahabirbu · 2025-04-08T00:31:50Z

bmahabirbu
Apr 8, 2025
Maintainer

This is an interesting idea, I don't know how basetenlabs/truss packaging system works entirely, but this sounds extremely similar to how ai lab recipes packages and serves models configurable via a YAML file.

https://github.com/containers/ai-lab-recipes/blob/main/recipes/natural_language_processing/codegen/quadlet/codegen.yaml

As others have said, there are a lot of competing standards for packaging models themselves, let alone configurable dependencies. It would be really nice to have this standard on the developer end to share configurations, but it becomes more complex when we talk about maintainability and community adoption in such a fast-growing field.

We are at a point where the ai pipeline is becoming holistic in nature, so every part should be easily editable. How rigid do we want this standard to be before the community runs into limitations that prevent them from making necessary changes? We never know what the community needs or what technology gets thrown out in the process.

As a whole, I agree that containers are the way to go for doing something like this! However, I don't think it's our place to create/manage an all-in-one standard, rather, we seek to create a method that's accessible to change when new tech comes out! Having the container technology as it is right now works for our current development, but things will change as we move forward

Thanks for bringing up the idea, these are just my thoughts! Maybe as things move forward, I'll have a clearer picture of how we can achieve this!

0 replies

qdrddr · 2025-04-08T04:40:31Z

qdrddr
Apr 8, 2025
Author

@rhatdan If you'd like to discuss model architecture and file format standardization, this issue is probably not the best place to do so. I suggest trying to discuss this at HuggingFace on the Discord server.

This feature request is not about model standardization but only about model packaging standardization of whatever files, formats, and architectures there are with whichever code and dependencies the model developers deem to be needed.

0 replies

qdrddr · 2025-04-08T12:47:21Z

qdrddr
Apr 8, 2025
Author

Imagine we use a docker container: we can pack into the container the model, we can add the code to run the model and create an environment with python (or anything else), and we can install attention modules and other dependencies for the code to run.

Container would not care which format of the model, how many file, model file format, model architecture, it'll just pack everything. And now, suddenly, the container itself with this package becomes a universal, sharable, reproducable form of delivering models without limiting the model creators in their expressiveness and need for innovation, just letting them do their thing however they think is best. Yet, for the consumer, it can be just one container with everything you need to run the model without digging into the details of how to run the model.

And to make packaging easy, sharable, and reproducable, I propose to use the existing basetenlabs/truss tool that can take a yaml file and create such a container so the burden on the model developers would be minimal, and we can simply update the container if needed.

0 replies

rhatdan · 2025-04-08T13:26:51Z

rhatdan
Apr 8, 2025
Maintainer

Well that is where we differ, I don't think the model and the model runtime should be packaged together. The model should be useable with multiple model runtimes. whisper-serve, llama-serve, vllm, ...

0 replies

rhatdan · 2025-04-08T13:27:21Z

rhatdan
Apr 8, 2025
Maintainer

Models can be mounted into containers to be used with different packages of runtimes.

0 replies

qdrddr · 2025-04-08T21:19:40Z

qdrddr
Apr 8, 2025
Author

Yes, but you if you can implement truss it's very simple to have as many packages with different runtimes as you wish. And they can be layers of your docker container, so when you downloaded one model packaged like this if you need the same model with a different runtimes/inference engine, it'll download it very fast, only the difference, the layer that is missing. When you decoupling model from inference engine, you open a potential for bugs and your model stops working.

Also when you thinking of models, you mostly thinking of LLM, while I was thinking about a broader meaning of the model, which not necessarily can be an LLM, and often requires some custom inference runtime engine provided by the model developer. Such as embedding, reranking, classifier etc these models will never work with llama.cpp or vLLM. And even the newer LLM models, they often not yet supported for a long time, while if packaged with custom code the developer provided, , you can package and start using it the day one, exposing much larger user base to the new model without waiting (and hoping) llama.cpp or vLLM will start supporting it.

Hope this makes sense.

0 replies

rhatdan · 2025-04-09T12:21:28Z

rhatdan
Apr 9, 2025
Maintainer

First of all I understand containers, I was one of the creator of Podman and the entire github.com/containers world. We don't use the "docker" adjective to describe container technologies either. :^)

The point being is once you embed software into an OCI Model, then you need to deal with CVEs and support headaches like Red Hat only can support RHEL based images and Ubuntu wants to only support Ubuntu based images.

We don't want to have to update HUGE images everytime a fundamental script has a CVE.
By keeping the software separate from the AI Models, they can live different lifecycles. People can easily experiment, with different ways of interacting with models etc.

The things you described above can be done with separating the Runtime from the MODELs as well without an explosion of the number of images necessary.

If we package models with runtimes, then you end up supporting (#MODELS * #RUNTIMES) images. When they are separate you support (#MODELS + #runtimes) images.

0 replies

ericcurtin · 2025-04-09T12:54:08Z

ericcurtin
Apr 9, 2025
Maintainer

^
|
This by @rhatdan 👍 . Not decoupling models from runtimes, makes the support matrix difficult. Note you can still test combinations when decoupled. But I don't see monoliths being as useful here.

The example we use is lets say for example if we use a database instead of an AI model to retrieve information. We don't mix the database data with the mariadb container. I mean you can, but the benefits don't seem worth it.

0 replies

qdrddr · 2025-04-09T14:26:00Z

qdrddr
Apr 9, 2025
Author

If containers such a problem, why the whole world is packing apps in containers then? Same question for models they can be packed and updated via containers perfectly well if done right. And if you know containers and how layers work in the image, then even if you have 1TB container, and update a small layer with the library to fix CVE or replace an inference engine, then you know perfectly well, you’ll be downloading only that change, a few megabytes, layer (assuming you packed your data with the model weights at the top layer, as it should be done), then there is simply no problem. No problem in downloading, and no problem in replacing runtime and no problem in updating.

The reason you guys don't see packaging models useful for you is 1) you are tech savvy people, you know how to take a model from HF and run it, what's easy for you is not easy for most of people. And 2) you are focusing on a subset of models, namely on LLMs only.

Most of the people complaine is they are looking for a way how to just download a model (any model not just LLM) and run it like an EXE file on Windows.

Currently when they download a random model from HF, it doesn't run with llama.cpp, it's in some weird new format or some new not yet supported architecture or require some library or module that is not released yet and requires you to go and compile it or needs some environment settings and files that the developer forget to include or explain what that is or explain in a way non-tech savvy people do not understand. There's this unnecessary complexity and chaos that is needed for creative process to create a model, and ok for tech savvy people such as yourselves, but can be hidden for the average person that simply needs to run the model.

And I thought RamaLama is here to solve that. That's why I proposed this idea here.

0 replies

ericcurtin · 2025-04-09T14:29:50Z

ericcurtin
Apr 9, 2025
Maintainer

The reason you guys don't see packaging models useful for you is 1) you are tech savvy people, you know how to take a model from HF and run it, what's easy for you is not easy for most of people. And 2) you are focusing on a subset of models, namely on LLMs only.

Most of the people complaine is they are looking for a way how to just download a model (any model not just LLM) and run it like an EXE file on Windows.

This space, we typically rely on Podman Desktop FWIW

They download a model and it doesn't run with llama.cpp, it's in some weird new format or some new architecture or require some library that is not released yet and requires you to go and compile it or needs some settings and files that the developer forget to include or explain what that is or explain in a way non-tech savvy people do not understand. There's this unnecessary complexity and chaos that is needed for creative process, and ok for tech savvy people such as yourselves, but can be hidden for the average person that simply needs to run the model.

And I thought RamaLama is here to solve that. That's why I proposed this idea here.

0 replies

rhatdan · 2025-04-10T17:29:18Z

rhatdan
Apr 10, 2025
Maintainer

If someone wants to use truss, and do what download and execute the model, that is fine with me, but it is not what the RamaLama project is trying to do.

I am just pointing out the problems.

0 replies

Idea/Feature Request: ramalama to run AI models: Adopt basetenlabs/truss model packaging system to create containers with the model + code + dependency #1165

Uh oh!

qdrddr Apr 7, 2025

Solution you'd like:

The goal

Replies: 17 comments

Uh oh!

ericcurtin Apr 7, 2025 Maintainer

Solution you'd like:

The goal

Uh oh!

Uh oh!

qdrddr Apr 7, 2025 Author

Uh oh!

rhatdan Apr 7, 2025 Maintainer

Uh oh!

rhatdan Apr 7, 2025 Maintainer

Uh oh!

Uh oh!

qdrddr Apr 7, 2025 Author

Uh oh!

rhatdan Apr 7, 2025 Maintainer

Uh oh!

Uh oh!

bmahabirbu Apr 8, 2025 Maintainer

Uh oh!

Uh oh!

qdrddr Apr 8, 2025 Author

Uh oh!

Uh oh!

qdrddr Apr 8, 2025 Author

Uh oh!

rhatdan Apr 8, 2025 Maintainer

Uh oh!

rhatdan Apr 8, 2025 Maintainer

Uh oh!

Uh oh!

qdrddr Apr 8, 2025 Author

Uh oh!

rhatdan Apr 9, 2025 Maintainer

Uh oh!

Uh oh!

ericcurtin Apr 9, 2025 Maintainer

Uh oh!

Uh oh!

qdrddr Apr 9, 2025 Author

Uh oh!

ericcurtin Apr 9, 2025 Maintainer

Uh oh!

rhatdan Apr 10, 2025 Maintainer

qdrddr
Apr 7, 2025

ericcurtin
Apr 7, 2025
Maintainer

qdrddr
Apr 7, 2025
Author

rhatdan
Apr 7, 2025
Maintainer

rhatdan
Apr 7, 2025
Maintainer

qdrddr
Apr 7, 2025
Author

rhatdan
Apr 7, 2025
Maintainer

bmahabirbu
Apr 8, 2025
Maintainer

qdrddr
Apr 8, 2025
Author

qdrddr
Apr 8, 2025
Author

rhatdan
Apr 8, 2025
Maintainer

rhatdan
Apr 8, 2025
Maintainer

qdrddr
Apr 8, 2025
Author

rhatdan
Apr 9, 2025
Maintainer

ericcurtin
Apr 9, 2025
Maintainer

qdrddr
Apr 9, 2025
Author

ericcurtin
Apr 9, 2025
Maintainer

rhatdan
Apr 10, 2025
Maintainer