|
| 1 | +## Add a new model architecture to `llama.cpp` |
| 2 | + |
| 3 | +Adding a model requires few steps: |
| 4 | + |
| 5 | +1. Convert the model to GGUF |
| 6 | +2. Define the model architecture in `llama.cpp` |
| 7 | +3. Build the GGML graph implementation |
| 8 | + |
| 9 | +After following these steps, you can open PR. |
| 10 | + |
| 11 | +Also, it is important to check that the examples and main ggml backends (CUDA, METAL, CPU) are working with the new architecture, especially: |
| 12 | +- [main](../examples/main) |
| 13 | +- [imatrix](../examples/imatrix) |
| 14 | +- [quantize](../examples/quantize) |
| 15 | +- [server](../examples/server) |
| 16 | + |
| 17 | +### 1. Convert the model to GGUF |
| 18 | + |
| 19 | +This step is done in python with a `convert` script using the [gguf](https://pypi.org/project/gguf/) library. |
| 20 | +Depending on the model architecture, you can use either [convert.py](../convert.py) or [convert-hf-to-gguf.py](../convert-hf-to-gguf.py). |
| 21 | + |
| 22 | +The convert script reads the model configuration, tokenizer, tensor names+data and converts them to GGUF metadata and tensors. |
| 23 | + |
| 24 | +The required steps to implement for an HF model are: |
| 25 | + |
| 26 | +1. Define the model `Model.register` annotation in a new `Model` subclass, example: |
| 27 | + |
| 28 | +```python |
| 29 | +@Model.register("MyModelForCausalLM") |
| 30 | +class MyModel(Model): |
| 31 | + model_arch = gguf.MODEL_ARCH.GROK |
| 32 | +``` |
| 33 | + |
| 34 | +2. Define the layout of the GGUF tensors in [constants.py](../gguf-py/gguf/constants.py) |
| 35 | + |
| 36 | +Add an enum entry in `MODEL_ARCH`, the model human friendly name in `MODEL_ARCH_NAMES` and the GGUF tensor names in `MODEL_TENSORS`. |
| 37 | + |
| 38 | +Example for `falcon` model: |
| 39 | +```python |
| 40 | + MODEL_ARCH.FALCON: [ |
| 41 | + MODEL_TENSOR.TOKEN_EMBD, |
| 42 | + MODEL_TENSOR.OUTPUT_NORM, |
| 43 | + MODEL_TENSOR.OUTPUT, |
| 44 | + MODEL_TENSOR.ATTN_NORM, |
| 45 | + MODEL_TENSOR.ATTN_NORM_2, |
| 46 | + MODEL_TENSOR.ATTN_QKV, |
| 47 | + MODEL_TENSOR.ATTN_OUT, |
| 48 | + MODEL_TENSOR.FFN_DOWN, |
| 49 | + MODEL_TENSOR.FFN_UP, |
| 50 | + ] |
| 51 | +``` |
| 52 | + |
| 53 | +3. Map the original tensor names to the standardize equivalent in GGUF |
| 54 | + |
| 55 | +As a general rule, before adding a new tensor name to GGUF, be sure the equivalent naming does not already exist. |
| 56 | + |
| 57 | +Once you have found the GGUF tensor name equivalent, add it to the [tensor_mapping.py](../gguf-py/gguf/tensor_mapping.py) file. |
| 58 | + |
| 59 | +If the tensor name is part of a repetitive layer/block, the key word `bid` substitutes it. |
| 60 | + |
| 61 | +Example for the normalization tensor in attention layers: |
| 62 | + |
| 63 | +```python |
| 64 | +block_mappings_cfg: dict[MODEL_TENSOR, tuple[str, ...]] = { |
| 65 | + # Attention norm |
| 66 | + MODEL_TENSOR.ATTN_NORM: ( |
| 67 | + "gpt_neox.layers.{bid}.input_layernorm", # gptneox |
| 68 | + "transformer.h.{bid}.ln_1", # gpt2 gpt-j refact qwen |
| 69 | + "transformer.blocks.{bid}.norm_1", # mpt |
| 70 | + ... |
| 71 | + ) |
| 72 | +} |
| 73 | +``` |
| 74 | + |
| 75 | +`transformer.blocks.{bid}.norm_1` will be mapped to `blk.{bid}.attn_norm` in GGUF. |
| 76 | + |
| 77 | +Depending on the model configuration, tokenizer, code and tensors layout, you will have to override: |
| 78 | +- `Model#set_gguf_parameters` |
| 79 | +- `Model#set_vocab` |
| 80 | +- `Model#write_tensors` |
| 81 | + |
| 82 | +NOTE: Tensor names must end with `.weight` suffix, that is the convention and several tools like `quantize` expect this to proceed the weights. |
| 83 | + |
| 84 | +### 2. Define the model architecture in `llama.cpp` |
| 85 | + |
| 86 | +The model params and tensors layout must be defined in `llama.cpp`: |
| 87 | +1. Define a new `llm_arch` |
| 88 | +2. Define the tensors layout in `LLM_TENSOR_NAMES` |
| 89 | +3. Add any non standard metadata in `llm_load_hparams` |
| 90 | +4. Create the tensors for inference in `llm_load_tensors` |
| 91 | +5. If the model has a RoPE operation, add the rope type in `llama_rope_type` |
| 92 | + |
| 93 | +NOTE: The dimensions in `ggml` are typically in the reverse order of the `pytorch` dimensions. |
| 94 | + |
| 95 | +### 3. Build the GGML graph implementation |
| 96 | + |
| 97 | +This is the funniest part, you have to provide the inference graph implementation of the new model architecture in `llama_build_graph`. |
| 98 | + |
| 99 | +Have a look to existing implementation like `build_llama`, `build_dbrx` or `build_bert`. |
| 100 | + |
| 101 | +When implementing a new graph, please note that the underlying `ggml` backends might not support them all, support of missing backend operations can be added in another PR. |
| 102 | + |
| 103 | +## GGUF specification |
| 104 | + |
| 105 | +https://github.com/ggerganov/ggml/blob/master/docs/gguf.md |
| 106 | + |
| 107 | +## Resources |
| 108 | + |
| 109 | +- YaRN RoPE scaling https://github.com/ggerganov/llama.cpp/pull/2268 |
| 110 | +- support Baichuan serial models https://github.com/ggerganov/llama.cpp/pull/3009 |
| 111 | +- support attention bias https://github.com/ggerganov/llama.cpp/pull/4283 |
| 112 | +- Mixtral support https://github.com/ggerganov/llama.cpp/pull/4406 |
| 113 | +- BERT embeddings https://github.com/ggerganov/llama.cpp/pull/5423 |
| 114 | +- Grok-1 support https://github.com/ggerganov/llama.cpp/pull/6204 |
| 115 | +- Command R Plus support https://github.com/ggerganov/llama.cpp/pull/6491 |
| 116 | +- support arch DBRX https://github.com/ggerganov/llama.cpp/pull/6515 |
| 117 | +- How to convert HuggingFace model to GGUF format https://github.com/ggerganov/llama.cpp/discussions/2948 |
0 commit comments