Structure of model manifests (families, quantizations, flavors) #55

iboB · 2024-08-26T11:38:49Z

iboB
Aug 26, 2024
Maintainer

A provider should be able to er... well... provide a model manifest. This is the list of available models and data about them.

This data includes:

name - human-readable name
desc - human readable description
id - unique id of the model which can be loaded
tags - machine-readable tags
hparams schema - parameters which can be used to load the model
instance and session schemas - what instances with what sessions can be created with the model, respectively schemas for their params
quantizations - what quantizations are available
backends - backends where the model can be loaded (such as CPU, CUDA, Vulkan...), or rather backends for which we have working inference
assets - assets which need to be available to load that model - weights basically

There are questions about how to structure and present this data. The question whether something should be available to the user is also always present for each item. A potential resolution might be, that we just hide this. Nevertheless, the data should always be available to us.

Data structure

It is not obvious. We would ideally like to have some DRY in the manifest, but how much?

DRY can be achieved by having multiple tables which reference each other, or by coming up with a magical hierarchical structure which somehow minimizes it.

Here are some examples of repeated/reusable data:

The instance and session schemas (likely the largest, in terms of bytes, entry in the manifest) will certainly be the same for all quantizations of a model.
- They will likely be the same for an entire family of models: X and its finetunes
- They might be the same across families: "Classic LLM schema"
The above also goes for name, desc, tags, hparams.
However certain quatizations may not be grouped with the same hparams or backends. We might have a CUDA implementation for X, but say only for X@f16 and not for X@5_Q_K.
Both llama.cpp and whisper cpp support parts of the model being loaded in CPU and parts on GPU. Do we make this available to users? Do we even support it internally?
Also mentioned in Main discussion for Model Infos and Assets #25, but some models may share assets
Some assets may have multiple possible sources
Some instances or sessions might only be available for certain hparams invariants
Some supported quantizations might be materialized on the spot (as opposed to represented by physical assets)

The finer grain the allowed invariants, the less benefit there is from tables referencing each other.

Provider specific data

Some data is provider specific. This includes

available backends - backends which are available to the provider. We may have code for CUDA inference for the model (known from the manifest), but the concrete provider might not have a cuda-capable device. Or it might have one, but not big enough for that particular model
immediately available - are all assets present locally for that provider, or does fetching need to be done (which might potentially take a lot of time).

iboB · 2024-09-05T06:17:29Z

iboB
Sep 5, 2024
Maintainer Author

The source of the model manifest data is also interesting here.

Inference libs are hardcoded in a way. They can only read specific data from Dict-s and ignore or throw on unknowns.

This means that at least a part of the schema is already present (as code) in the inference API implementations (local subdirs of inference libs).

We could (and likely should) make use of this and generate some schema from the code itself.

0 replies

tzanko-matev · 2024-09-10T10:09:13Z

tzanko-matev
Sep 10, 2024
Maintainer

Interaction between Manifest and AC

We use dictionaries to interact with AC. In the code we access dictionary values directly. This means that it is very simple to make a change of the code which changes the schema. Calling any method on the dictionary can potentially do that:

value = params.find("valueWittTypo");

This flexibility make writing inference models easy, but keeping up with the schema becomes difficult.

Note that some parameters can have default values. E.g.

auto antiprompts = Dict_optValueAt(params, "antoprompts", std::vector<std::string>{});

This adds a further complication.

Here are some ideas how to keep the model manifest in sync with the code:

1. Write (property-based) tests

We can try to generate test automatically from the model manifest description. Each test could run a function and be considered successful if the function doesn't return an error. We can generate parameter dictionaries based on the model manifest, run the function and make sure that it finishes correctly. Then, we can also validate the result dictionary. This testing approach is called property-based testing. One of the first tools that implement it QuickCheck for Haskell.

Problems:

Need to find tooling for C++ to do the property-based tests that we need to implement. If it doesn't exist we need to write something from scratch to generate dictionaries based on a schema
With tests we can only prove that the schema described in the model manifest is supported by the library. But we cannot catch cases when the library has extended the schema with default values for the new parameters.

Work that needs to be done:

Find the necessary property-testing solution
Come up with a suitable language for schema description
Write the property-based tests based on the schema description. This could possibly need code-generation of tests

2. Use proper types instead of dicts

We can still retain the dictionary interface to the outside world. However internally we use proper types and convert to dicts as a first step of executing a function and just before we return the result. The types and the conversion/validation should be generated from the model manifest schema. This solution will solve both problems stated above

Problems:

Need to rewrite current implementation to use types and not dicts internally

Work that needs to be done:

Create a format for the schema description
Write a code-generator for the schema to generate the proper types
Rewrite current AC inference code to use those types internally

3. Add validation to dicts

This might go beyond my current understanding of C++ but probably can be done
When using the API we can pass not normal dicts, but a subclass which has hidden validation logic. For example we have currently a code like:

    instance->runOp("run", {{"prompt", prompt}, {"max_tokens", 20}, {"antiprompts", antiprompts}}, {
        [&](ac::CallbackResult<void> result) {
            if (result.has_error()) {
                opError = std::move(result.error().text);
                return;
            }
            latch->count_down();
        },
        [](std::string_view, ac::Dict result) {
            std::cout << result.at("result").get<std::string_view>();
        }
    });

We could do instead

    
	auto dict = new ValidatedDict<LLAMA_INSTANCE_RUN_PARAMS_SCHEMA>({{"prompt", prompt}, {"max_tokens", 20}, {"antiprompts", antiprompts}})
    instance->runOp("run", dict, {
        [&](ac::CallbackResult<void> result) {
            if (result.has_error()) {
                opError = std::move(result.error().text);
                return;
            }
            latch->count_down();
        },
        [](std::string_view, ac::ValidatedDict<LLAMA_INSTANCE_RUN_RESULT_SCHEMA> result) {
            std::cout << result.at("result").get<std::string_view>();
        }
    });

In this example LLAMA_INSTANCE_RUN_PARAMS_SCHEMA and LLAMA_INSTANCE_RUN_RESULT_SCHEMA will be code-generated from the model manifest. The ValidatedDict object will make sure that:

The object passed to the constructor matches the schema;
all dicts operations that are called internally match the schema.

This way we can test the schema using our normal tests. Also if we decide to use ValidatedDict in the core library, we can use a macro to remove any validation for the release build. But such optimisations are probably not necessary because we don't expect that validation will be a performance bottleneck for the ways that the library will be used.

This solution avoids rewriting the core inference code.

Work to be done:

Create a format for schema description
Write the validated dict class
Write code-generation for the schema from the manifest
Rewrite tests to use validated dicts

4. Write schema from code?

This one is very hypothetical. In all other ideas the model manifest is the source of truth that the code should follow. Could we reverse this? We can try to use code-gen on the inference code to generate a part of the model manifest. Then this part must be merged with the human-generated portions of the manifest to create the final model manifest.

In order to do this we must be able to collect information from all occurrences of dict constructors or any other dict methods. For example if we find the following code:

void run(Dict params, ...) {
...
   auto prompt = Dict_optValueAt(params, "antiprompts", std::vector<std::string>{});
}

we must be able to deduce that the params object can contain a field called "antipromtps" which is of type list of strings and which has a default value of an empty list.

We must be able to extract all such facts from the code and merge them to produce the final schema. The only way that I can think of doing this is to run example code or tests using a special Dict subclass which records all Dict ops that were called and produces a schema based on it. It would look something like:

auto params = new SchemaWritingDict({{"prompt", prompt}, {"max_tokens", 20}, {"antiprompts", antiprompts}})
instance->runOp("run", params, {
    [&](ac::CallbackResult<void> result) {
        if (result.has_error()) {
            opError = std::move(result.error().text);
            return;
        }
        latch->count_down();
    },
    [](std::string_view, ac::SchemaWritingDict result) {
        // Add the discovered schema to the schema discovered so far by other runs. Use the schema discovered from write ops
        result.mergeWriteSchemaWith("llama_instance_run_result.schema")
        std::cout << result.at("result").get<std::string_view>();
    }
});
// Add the discovered schema to the schema which was discovered so far by other runs. Use the schema discovered from read ops
params.mergeReadSchemaWith("llama_instance_run_params.schema")

The advantage to this approach is that the C++ programmers who will write the inference code will have an easy to use way to generate schemas in their preferred language. The disadvantage is that this approach looks to be the craziest from all current approaches. I don't know if it can be done.

Problems

I don't know if this can be done. For example how do we deal with cases where the existence of certain fields depends on the value of another field? Consider the following code:

auto type = params.find("type");
if (type=="type1") {
   auto val1 = params.find("param1");
  ...
} else {
  auto val2 = params.find("param2");
  ...
}

Work to be done:

Write the SchemaWritingDict class
Write the necessary code-gen to produce the schemas. We need to write enough tests to have full code coverage so that we can extract the full schema!
Once the rest of the manifest format is determined write the necessary code for merging the schemas with the human-generated part of the manifest

3 replies

iboB Sep 10, 2024
Maintainer Author

There is a misunderstanding here. The inference libs themselves are strongly typed. It's the api-lib integration layer (local) that currently translates dicts to symbols. It does so in an ad-hoc manner, but it can be formalized.

However I am not a fan of the idea to generate C++ code from the schema. It would add unnecessary restrictions to the generated code and would make modifications harder and improvements harder. The flow of data and functionality should start with the inference lib, and not with the schema.

iboB Sep 10, 2024
Maintainer Author

validation of dicts would likely be needed regardless of which approach we choose. It will help with saner error messages and will allow client code to do validations without having to resort to requests to the server.

Schemas and validations have been mentioned here: #27

iboB Sep 10, 2024
Maintainer Author

Property based testing is hard mainly because of #95... or rather it adds another layer. Could we meaningfully shortcut the actual inference? #73 was a step in this direction, but it does not affect other inference libs

iboB · 2024-09-10T10:39:40Z

iboB
Sep 10, 2024
Maintainer Author

So far I'm favoring generating schema from code, though it doesn't need to be that seamless.

We could add (hardcode) a formal mapping from dict entry to symbol. Think MACRO("string", "description", symbol). Then we can use the mapping to generate the schemas and in the code only use the mapping to access symbols, thus also validating in a way.

It is true that the generated schemas will have to be merged with the model manifest separately

3 replies

tzanko-matev Sep 10, 2024
Maintainer

I'll think about it

tzanko-matev Sep 10, 2024
Maintainer

Update, it looks like it is possible to do this. Here's how a schema definition would look:

BEGIN_SCHEMA(LlamaSchema)

BEGIN_PARAMS()
DEFINE_PARAM(prompt, std::string, "The input prompt for the model", "")
DEFINE_PARAM(max_tokens, uint32_t, "Maximum number of tokens to generate", 2000u)
DEFINE_PARAM(antiprompts, std::vector<std::string>, "List of strings to stop generation", {})
END_PARAMS()

BEGIN_RESULT()
DEFINE_RESULT(result, std::string, "The generated text")
END_RESULT()

END_SCHEMA()

This can be used like so:

    void run(Dict params, std::function<void(Dict)> streamCb) {
        auto prompt = ac::LlamaSchema::RunParams::prompt(params);
        auto antiprompts = ac::LlamaSchema::RunParams::antiprompts(params);
        const uint32_t maxTokens = ac::LlamaSchema::RunParams::max_tokens(params);
        ...
        Dict resultDict;
        ac::LlamaSchema::RunResult::set_result(resultDict, astl::move(result));
        streamCb(resultDict);
    }

Here's the current json schema that I produce (this is WIP):

{
    "RunParams": [
        {
            "default": "\"\"",
            "description": "The input prompt for the model",
            "name": "prompt",
            "type": "std::string"
        },
        {
            "default": "2000u",
            "description": "Maximum number of tokens to generate",
            "name": "max_tokens",
            "type": "uint32_t"
        },
        {
            "default": "{}",
            "description": "List of strings to stop generation",
            "name": "antiprompts",
            "type": "std::vector<std::string>"
        }
    ],
    "RunResult": [
        {
            "default": "",
            "description": "The generated text",
            "name": "result",
            "type": "std::string"
        }
    ]
}

I'll push a PR with a POC tomorrow.

iboB Sep 10, 2024
Maintainer Author

Good. But C++ types in json schema are pointless and confusing. Use json types

iboB · 2024-09-12T06:11:01Z

iboB
Sep 12, 2024
Maintainer Author

I think that at this point we should focus on the actual model manifest structure rather than the source of the usage schemas.

I think it's important to make a distinction between the two. The usage schema is not the model manifest, but just a part of it.

I'll edit the titles of the existing issues and add new ones

0 replies

tzanko-matev · 2024-09-12T15:32:07Z

tzanko-matev
Sep 12, 2024
Maintainer

@iboB What do you envision to be the purpose of DRY? Is it just to simplify the work of manifest writers or is there anything else?

5 replies

iboB Sep 12, 2024
Maintainer Author

As mentioned multiple times: mainly to simplify the work of manifest readers. If a model follows has llama usage schema, it would be much easier for me to "llama usage" than a wall of text, which is the llama usage.

I doubt that models manifests would ever be written by hand.

tzanko-matev Sep 13, 2024
Maintainer

Next dumb question: Why would someone want to read the model manifest if not to write/update it?

In my current way of thinking people would be able to read docs which are linked from the manifest. You can have the same document be linked from many models and it would describe all of them. Reading a well-formatted document should be easier than reading a json or yaml file. This fits the purposes for the manifest that I could come up with.

iboB Sep 13, 2024
Maintainer Author

Two types of human readers:

Developers: they will very often have tor read raw json/yaml for testing and debugging. App logs may contain snippets of model manifest entries
SDK Users: imagine as an SDK user you open the fancy model browser. What would you rather read for llama-v3-q8_q: llama-usage or "a wall of text with all model params, istance types, and ops, which is identical with the one in 200 other models"?

tzanko-matev Sep 13, 2024
Maintainer

If a model manifest won't be written by hand then how do you imagine that it would be created?

iboB Sep 13, 2024
Maintainer Author

well... tools

usage manifests are generated from code, metadata from assets (gguf for example).

stage some items from different inference-specific tools, use another tool to merge them...

Probably the first instances will be written manually, but that's not the long term goal

tzanko-matev · 2024-09-13T08:49:40Z

tzanko-matev
Sep 13, 2024
Maintainer

As part of this project is it a business goal to incentivise users to create as many models (i.e. model manifests) as possible? Like Huggingface or Ollama? Or do we rather expect that we will be creating the models and users will just use them (more like jan.ai)? I suppose we would prefer the first option, but I want to make sure.

3 replies

iboB Sep 13, 2024
Maintainer Author

Yes it is. We provide the inference libs, but ideally most models would come from users

tzanko-matev Sep 13, 2024
Maintainer

Do we envision that we will be hosting the central model repository or do we plan to incentivise multiple repositories or even some decentralised way of storing them? I suppose that we want the second option.

Ollama is an example for a centralized repository. They have model manifests, but the human-readable documentation is not part of it. Instead it is uploaded separately to the ollama website. This design leads to centralisation

iboB Sep 13, 2024
Maintainer Author

MVP will be released with a hardcoded model manifest, but in the future we would likely allow both. Model sources (manifest repositories) are somewhat tbd

tzanko-matev · 2024-09-13T09:51:08Z

tzanko-matev
Sep 13, 2024
Maintainer

Do we plan to include default prompts as part of a model like ollama does? This can be a good hook to incentivise people to write models, because it is very easy to take a model, add a new prompt and publish that. Ollama uses Go templates for their prompts

2 replies

iboB Sep 13, 2024
Maintainer Author

We don't plan to, but it's not a bad idea.

iboB Sep 13, 2024
Maintainer Author

on the other hand, thinking about this, wouldn't it be better to detach prompts from the model manifest.

A prompt repository might be a good idea for an API server, but likely not very useful for edge apps.

The more I think about it, the more it seems to me that prompts should be separate. That's not to say that we won't accommodate prompt engineers, but it will likely be through means other than the model manifest.

Structure of model manifests (families, quantizations, flavors) #55

Uh oh!

iboB Aug 26, 2024 Maintainer

Data structure

Provider specific data

Replies: 7 comments · 16 replies

Uh oh!

iboB Sep 5, 2024 Maintainer Author

Uh oh!

Uh oh!

tzanko-matev Sep 10, 2024 Maintainer

Interaction between Manifest and AC

1. Write (property-based) tests

Problems:

Work that needs to be done:

2. Use proper types instead of dicts

Problems:

Work that needs to be done:

3. Add validation to dicts

Work to be done:

4. Write schema from code?

Problems

Work to be done:

Uh oh!

iboB Sep 10, 2024 Maintainer Author

Uh oh!

iboB Sep 10, 2024 Maintainer Author

Uh oh!

iboB Sep 10, 2024 Maintainer Author

Uh oh!

iboB Sep 10, 2024 Maintainer Author

Uh oh!

tzanko-matev Sep 10, 2024 Maintainer

Uh oh!

Uh oh!

tzanko-matev Sep 10, 2024 Maintainer

Uh oh!

iboB Sep 10, 2024 Maintainer Author

Uh oh!

iboB Sep 12, 2024 Maintainer Author

Uh oh!

tzanko-matev Sep 12, 2024 Maintainer

Uh oh!

Uh oh!

iboB Sep 12, 2024 Maintainer Author

Uh oh!

Uh oh!

tzanko-matev Sep 13, 2024 Maintainer

Uh oh!

Uh oh!

iboB Sep 13, 2024 Maintainer Author

Uh oh!

tzanko-matev Sep 13, 2024 Maintainer

Uh oh!

iboB Sep 13, 2024 Maintainer Author

Uh oh!

Uh oh!

tzanko-matev Sep 13, 2024 Maintainer

Uh oh!

iboB Sep 13, 2024 Maintainer Author

Uh oh!

tzanko-matev Sep 13, 2024 Maintainer

Uh oh!

Uh oh!

iboB Sep 13, 2024 Maintainer Author

Uh oh!

tzanko-matev Sep 13, 2024 Maintainer

Uh oh!

iboB Sep 13, 2024 Maintainer Author

Uh oh!

Uh oh!

iboB Sep 13, 2024 Maintainer Author

iboB
Aug 26, 2024
Maintainer

Replies: 7 comments 16 replies

iboB
Sep 5, 2024
Maintainer Author

tzanko-matev
Sep 10, 2024
Maintainer

iboB Sep 10, 2024
Maintainer Author

iboB Sep 10, 2024
Maintainer Author

iboB Sep 10, 2024
Maintainer Author

iboB
Sep 10, 2024
Maintainer Author

tzanko-matev Sep 10, 2024
Maintainer

tzanko-matev Sep 10, 2024
Maintainer

iboB Sep 10, 2024
Maintainer Author

iboB
Sep 12, 2024
Maintainer Author

tzanko-matev
Sep 12, 2024
Maintainer

iboB Sep 12, 2024
Maintainer Author

tzanko-matev Sep 13, 2024
Maintainer

iboB Sep 13, 2024
Maintainer Author

tzanko-matev Sep 13, 2024
Maintainer

iboB Sep 13, 2024
Maintainer Author

tzanko-matev
Sep 13, 2024
Maintainer

iboB Sep 13, 2024
Maintainer Author

tzanko-matev Sep 13, 2024
Maintainer

iboB Sep 13, 2024
Maintainer Author

tzanko-matev
Sep 13, 2024
Maintainer

iboB Sep 13, 2024
Maintainer Author

iboB Sep 13, 2024
Maintainer Author