Idea/Feature Request: ramalama to run AI models: Adopt basetenlabs/truss model packaging system to create containers with the model + code + dependency #1165
Replies: 17 comments
-
I agree, but we also have this problem right now:
|
Beta Was this translation helpful? Give feedback.
-
Point taken. You cannot force everyone to accept a single standart. @ericcurtin It seaems there is a demand in a simple model consumption mechanism to run your models. And things like Ollama and Ramalama seams to be a living proof of that. So packaging a model with all it needs to run using Tuss might be simple and flexible enough for the community to pick this up, and not just those LLM models you would typically run wityh llama.cpp or vLLM but also embedding, reranker, classifier models and so on. |
Beta Was this translation helpful? Give feedback.
-
@bmahabirbu @benoitf WDYT? |
Beta Was this translation helpful? Give feedback.
-
I see this area exploding and crying for a standard. Docker just came out with a new way of shipping models, using artifacts, I believe. If OCI would standardize on something or there was a major player to stamp their approval,then we could go with a defacto standard. For now it is a mess. |
Beta Was this translation helpful? Give feedback.
-
Just FYI. What I propose is an attempt to standardize the packing of a model with code (or the runners), dependencies/environment into a single package. This might make sense to be done using containerization. And the proposed truss tool actually already does that. |
Beta Was this translation helpful? Give feedback.
-
I care more about the standardization of the format? Where does the model files sit? What is the annotations describing the model? Can we split the model files into different layers or different artifacts under the same OCI name. I have users who want to pull just the tensors without pulling the entire model. |
Beta Was this translation helpful? Give feedback.
-
This is an interesting idea, I don't know how basetenlabs/truss packaging system works entirely, but this sounds extremely similar to how ai lab recipes packages and serves models configurable via a YAML file. As others have said, there are a lot of competing standards for packaging models themselves, let alone configurable dependencies. It would be really nice to have this standard on the developer end to share configurations, but it becomes more complex when we talk about maintainability and community adoption in such a fast-growing field. We are at a point where the ai pipeline is becoming holistic in nature, so every part should be easily editable. How rigid do we want this standard to be before the community runs into limitations that prevent them from making necessary changes? We never know what the community needs or what technology gets thrown out in the process. As a whole, I agree that containers are the way to go for doing something like this! However, I don't think it's our place to create/manage an all-in-one standard, rather, we seek to create a method that's accessible to change when new tech comes out! Having the container technology as it is right now works for our current development, but things will change as we move forward Thanks for bringing up the idea, these are just my thoughts! Maybe as things move forward, I'll have a clearer picture of how we can achieve this! |
Beta Was this translation helpful? Give feedback.
-
@rhatdan If you'd like to discuss model architecture and file format standardization, this issue is probably not the best place to do so. I suggest trying to discuss this at HuggingFace on the Discord server. This feature request is not about model standardization but only about model packaging standardization of whatever files, formats, and architectures there are with whichever code and dependencies the model developers deem to be needed. |
Beta Was this translation helpful? Give feedback.
-
Imagine we use a docker container: we can pack into the container the model, we can add the code to run the model and create an environment with python (or anything else), and we can install attention modules and other dependencies for the code to run. Container would not care which format of the model, how many file, model file format, model architecture, it'll just pack everything. And now, suddenly, the container itself with this package becomes a universal, sharable, reproducable form of delivering models without limiting the model creators in their expressiveness and need for innovation, just letting them do their thing however they think is best. Yet, for the consumer, it can be just one container with everything you need to run the model without digging into the details of how to run the model. And to make packaging easy, sharable, and reproducable, I propose to use the existing basetenlabs/truss tool that can take a yaml file and create such a container so the burden on the model developers would be minimal, and we can simply update the container if needed. |
Beta Was this translation helpful? Give feedback.
-
Well that is where we differ, I don't think the model and the model runtime should be packaged together. The model should be useable with multiple model runtimes. whisper-serve, llama-serve, vllm, ... |
Beta Was this translation helpful? Give feedback.
-
Models can be mounted into containers to be used with different packages of runtimes. |
Beta Was this translation helpful? Give feedback.
-
Yes, but you if you can implement truss it's very simple to have as many packages with different runtimes as you wish. And they can be layers of your docker container, so when you downloaded one model packaged like this if you need the same model with a different runtimes/inference engine, it'll download it very fast, only the difference, the layer that is missing. When you decoupling model from inference engine, you open a potential for bugs and your model stops working. Also when you thinking of models, you mostly thinking of LLM, while I was thinking about a broader meaning of the model, which not necessarily can be an LLM, and often requires some custom inference runtime engine provided by the model developer. Such as embedding, reranking, classifier etc these models will never work with llama.cpp or vLLM. And even the newer LLM models, they often not yet supported for a long time, while if packaged with custom code the developer provided, , you can package and start using it the day one, exposing much larger user base to the new model without waiting (and hoping) llama.cpp or vLLM will start supporting it. Hope this makes sense. |
Beta Was this translation helpful? Give feedback.
-
First of all I understand containers, I was one of the creator of Podman and the entire github.com/containers world. We don't use the "docker" adjective to describe container technologies either. :^) The point being is once you embed software into an OCI Model, then you need to deal with CVEs and support headaches like Red Hat only can support RHEL based images and Ubuntu wants to only support Ubuntu based images. We don't want to have to update HUGE images everytime a fundamental script has a CVE. The things you described above can be done with separating the Runtime from the MODELs as well without an explosion of the number of images necessary. If we package models with runtimes, then you end up supporting (#MODELS * #RUNTIMES) images. When they are separate you support (#MODELS + #runtimes) images. |
Beta Was this translation helpful? Give feedback.
-
^ The example we use is lets say for example if we use a database instead of an AI model to retrieve information. We don't mix the database data with the mariadb container. I mean you can, but the benefits don't seem worth it. |
Beta Was this translation helpful? Give feedback.
-
If containers such a problem, why the whole world is packing apps in containers then? Same question for models they can be packed and updated via containers perfectly well if done right. And if you know containers and how layers work in the image, then even if you have 1TB container, and update a small layer with the library to fix CVE or replace an inference engine, then you know perfectly well, you’ll be downloading only that change, a few megabytes, layer (assuming you packed your data with the model weights at the top layer, as it should be done), then there is simply no problem. No problem in downloading, and no problem in replacing runtime and no problem in updating. The reason you guys don't see packaging models useful for you is 1) you are tech savvy people, you know how to take a model from HF and run it, what's easy for you is not easy for most of people. And 2) you are focusing on a subset of models, namely on LLMs only. Most of the people complaine is they are looking for a way how to just download a model (any model not just LLM) and run it like an EXE file on Windows. Currently when they download a random model from HF, it doesn't run with llama.cpp, it's in some weird new format or some new not yet supported architecture or require some library or module that is not released yet and requires you to go and compile it or needs some environment settings and files that the developer forget to include or explain what that is or explain in a way non-tech savvy people do not understand. There's this unnecessary complexity and chaos that is needed for creative process to create a model, and ok for tech savvy people such as yourselves, but can be hidden for the average person that simply needs to run the model. And I thought RamaLama is here to solve that. That's why I proposed this idea here. |
Beta Was this translation helpful? Give feedback.
-
This space, we typically rely on Podman Desktop FWIW
|
Beta Was this translation helpful? Give feedback.
-
If someone wants to use truss, and do what download and execute the model, that is fine with me, but it is not what the RamaLama project is trying to do. I am just pointing out the problems. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
To run an AI model from HF.co we need to pack the model weights, the code developer provides to run the model and the dependency and qute offten add something like Attention. This can be a cumbersome and timeconsuming every user doing it from scratch. Instead this can be done in an easilly sharable and easilly adjustable YAML config files or containers.
Solution you'd like:
Adopt basetenlabs/truss packaging system that describes a simple
config.yaml
with the model and required dependencies that automatically creates the Dockerfile with the model and the environment and automatically builds container.Truss typically requires two main files
: the config.yaml to describe the container environment and where from to download the model, and the file with code to run the model which is stored inmodel/model.py
.If done right this can become a universal/standard model packaging system for ANY model users can easily share and consume, not just LLM or those you typically run with llama.cpp/vLLM.
The goal
Is to provide a simple and COMPLETE packaging system for ALL and any possible model out there that can be esilly sharable, modifiable and package build reproducable.
Beta Was this translation helpful? Give feedback.
All reactions