Skip to content

Commit 0aedae0

Browse files
gabe-l-hartcompiladeggerganovCISC
authored
model : Granite Four (#13550)
* wip: llama : separate recurrent states from the KV cache This will be necessary to support Jamba (and other recurrent models mixed with Attention). Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states. * llama : use std::find for seq_nodes in llama_rs_cache * llama : state checkpoints for recurrent models * llama : correctly handle more edge cases for the rs cache * llama : rename many llama_kv_cache_* functions * llama : remove useless return value for some llama_cache_* functions * llama : rethink recurrent state cell counts * llama : begin work on support for variable GQA This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads. * llama : gracefully fail when not finding hybrid slot * llama : support Jamba * llama : fix BERT inference without KV cache * convert-hf : check for unprocessed Jamba experts * convert-hf : support Mini-Jamba conversion * llama : fix Jamba quantization sanity checks * llama : sequence-length-aware batch splitting * llama : use equal-sequence-length sub-batches for recurrent models * ggml : simplify SSM-related operators * llama : make recurrent state slot allocation contiguous * llama : adapt internal uses of batches to llama_ubatch * llama : fix batch split output count for embeddings * llama : minimize swaps when reordering logits This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models. * llama : fix edge case finding batch seq_id of split recurrent cell This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash. * llama : avoid copies for simple batch splits * llama : use im2col and mul_mat to perform convolution for Mamba This removes the need for ggml_ssm_conv!!! But performance seems slighly worse on my system, especially for prompt processing. Maybe ggml_mul_mat isn't optimized for small row sizes? More performance testing is necessary until GGML_OP_SSM_CONV is removed. * ggml : make ggml_ssm_scan not modify its source tensors * llama : fix shared recurrent tail cell count for small ubatch sizes Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model. * llama : fix .base() compilation error on Windows * llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL * ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors The implementation already supported it, and this makes Mamba's conv step slightly faster. * llama : rename llama_cache to llama_past This can be changed back later if the name change is wrong. I was renaming the functions anyway to generalize kv-cache-related functions to hybrid and recurrent model architectures. I think llama_past is a better name than llama_cache for a combined kv cache and recurrent state cache, because the states it contains pretty much always come before the newly-added ones for any particular sequence. Also 'llama_past_clear' sounds more obvious in what it does than 'llama_kv_cache_clear'. The future is what the models generate. (For embeddings, the kv cache isn't really used anyway) Still, I'm open to better suggestions. * examples : replace llama_kv_cache_seq_* with llama_past_seq_* * mamba : fix non-contiguous usage of ggml_silu * llama : initial Mamba-2 support * ggml : SIMD ggml_ssm_scan for Mamba-2 * ggml : improve ggml_mul speed when masking recurrent states * llama : support running Mamba-Codestral-7B-v0.1 * llama : fix Mamba-2 conv state saving * ggml : make the ggml_mul fast broadcast path more consistently formatted * llama : remove unused variable * llama : add missing break * convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly. * llama : session saving and reloading for hybrid models * convert_hf : fix Jamba conversion * llama : fix mixed signedness comparison * llama : use unused n_embd_k_gqa in k_shift This also slightly reduces the diff from the master branch * llama : begin renaming llama_past back to llama_kv_cache * llama : avoid redundant state copy for Mamba 1 and 2 * metal : attempt to adapt SSM_SCAN for Mamba-2 * metal : fix SSM_SCAN pipeline scope * metal : use log and exp instead of log1pf and expf in SSM_SCAN * metal : remove unused arguments for SSM_SCAN The max index is 31, so trimming the arguments is necessary. * metal : add back n_seqs to SSM_SCAN args Whoops, this is needed for the offset in the concatenated output. * metal : fix SSM_SCAN state head offset * metal : fix wrong number of tokens per sequence in SSM_SCAN * ggml : remove unused fast broadcast path in GGML_MUL This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity. * ggml : avoid multiply by D in GGML_OP_SSM_SCAN This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks * convert : fix flake8 lint * llama : remove implicit recurrent state rollbacks * llama : partially apply clang-format style * metal : fix confusion between ; and , * metal : add missing args for nb references in ssm_scan_f32_group * metal : single-user mamba2 inference works * kv-cache : remove const_cast when setting inputs for s_copy And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy. * convert : avoid AutoConfig for Mamba and Mamba2 hparams * kv-cache : allow context shift for recurrent models * graph : fix recurrent state copies when avoiding copies Works, but using lambda functions might not be that clean. * ggml : fix mamba2 ssm scan when compiled with SVE * ggml-cpu : reorder SVE FMA for consistency with other SIMD arches * cuda : implement ssm scan for Mamba2 There is still room for improvement, but it works! * cuda : adapt Mamba1 ssm scan to shape changes from Mamba2 * feat: Add conversion for Bamba models This is borrowed and adapted from the original implementation #10810 Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add Granite 4 conversion This is a manual copy from my draft branch https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076 Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Plumb bamba through llama-arch Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add bamba to llama_arch_is_hybrid_recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add optional mamba ssm_in bias tensor Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add template specialization for get_arr to load a vector<uint32_t> for layer index arr in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Use an explicit bool to determine mamaba vs mamba2 This allows other architectures like bamba and granitemoehybrid to use mamab2 without a growing architecture `if` statement inside the mamba implementation. Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Isolate mamba(2) and granite attention layer building in static methods This will allow these layer-builder methods to be used from other build structs without complex inheritance. Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizes in granite build_attention_layer Also no need to pass in kv cache since it's already in the inp_attn Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First (broken) pass at end-to-end Bamba implementation It generates (garbage) tokens! Still lots of debugging to do. Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Only do Granite multipliers if set Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Pull granite ffn portion into a static function and reuse in hybrid Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(py): Allow gguf duplicate keys if they match by value and type This is helpful for hybrid models that want to do gguf param setting by calling multiple parent classes without needing to make those parent classes try/except on every attempt to set a gguf value. Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(py): Simplify granitemoehybrid conversion to use parents better Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add GRANITE_MOE_HYBRID through llama-arch Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support GRANITE_MOE_HYBRID in llama-model This re-uses the Bamba code paths heavily and simply adds the missing parts for loading MoE and the shared expert. Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix flake8 errors Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix recurrent cache get after rebase Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix hybrid granite implementation for signature changes in build_mamba*_layer Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Refactor relationship between non-hybrid classes and hybrid impl to use mixins The challenge here is to give both the non-hybrid classes (llm_build_mamba and llm_build_granite) AND the hybrid class (llm_build_hybrid_mamba) access to the same intermediate "base class" functionality (build_mamba*_layer, build_granite_attention_layer) without running into trouble with diamond inheritance of llm_graph_context. Due to the non-trivial initialization that happens in llm_graph_context, diamond inheritance results in multiple initializations of the common base which cause problems around the unique ptrs. I wanted to get away from `self->` everywhere, but this is still a bit cleaner than making those methods static I think. Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Implement the full copy-paste version to duplicate the layer builders This follows the pattern where the type of input is pinned to the type of memory and that is used to dispatch to the correct version of `build_rs` / `build_attn`. There's a lot of code duplication that can hopefully be pulled into common functions in the graph later. Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llm_build_hybrid_mamba -> llm_build_granite_hybrid I've got back-and-forth a lot about how/if to try to implement reuse of the "child model" layer types for hybrid models. At the end of the day, I think hybrid models are their own beast and even if their layers are inspired by other models, they should maintain control of their own layer building (in other words, the copy-paste method). Given that, the name should reflect that this is not a generic hybrid model builder, but rather a granite- specific hybrid model builder that can do MoE (granite 4) or dense (bamba). As part if this, I also cleaned up dangling comments from previous attempts at using static methods for reusability. Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * mamba : fix mismatched new and delete size for llm_build_mamba Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON * memory : correctly handle failure in apply() ggml-ci * style: Remove TODO for adding first hybrid models to the switch Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix bad merge in tensor_mapping.py w/ SSM_NORM Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix bad merge resolution with variable renames/moves in llm_build_mamba Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * docs: Fix comment about duplicate key check Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Conform to standard way of initializing inp_out_ids Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * convert : fix jamba conv1d shape squeezing * fix: Fix input initialization in granite_hybrid after removal of hybrid inputs Branch: GraniteFourWithJamba Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use llm_graph_context_mamba in llm_build_granite_hybrid Branch: GraniteFourWithJamba Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Refactor mamba2/granite/jamba/granite_hybrid relationships as mixins The key is for the mixin classes (llm_graph_context_mamba, llm_graph_context_granite) to use virtual inheritance from llm_graph_context. This allows the common members to exist only once in the class hierarchy. The downside is that llm_graph_context will be re-initialized once for each parent (ie 2x for single mixin, 3x for two mixins, etc...). Branch: GraniteFourWithJamba Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * graph : add back hybrid memory graph input But this time it contains the sub-cache graph inputs. This *should* make it easier to handle updating the inputs when caching the graph (eventually). * model : add Jamba to Mamba-specific hparams printing * fix: Fix input setup after upstream merge Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * jamba : remove redundant nullptr initializations * model : remove unnecessary prefix for tensor loading constants Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * model : use ggml_swiglu_split for Mamba Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * feat: Add support for dense FFN in GraniteMoeHybrid This was already partially supported via reusing the granite ffn builder, and there may be models that leverage this architecture going forward. The naming is a bit odd, but in the transformers version, it reuses the same model class and simply has zero regular experts and a single shared expert (which is the same as a single dense FFN). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add support for dense FFN tensor names on c++ side Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use child inputs for Falcon H1 after merge resolution Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary prefix on tensor constants Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * model : make falcon-h1 use shared mamba2 layer builder * memory : avoid referring to KV in recurrent cache logs * fix: Revert order changes for Falcon H1 to stay consistent with upstream Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * gguf-py : avoid adding duplicate tensor mappings for Jamba Some of the tensor names are common with Llama4 * refactor: Collapse Bamba and GraniteMoeHybrid into GraniteHybrid The only key difference is the use of rope which is now set via rope_finetuned in the hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove use of diamond inheritance Per PR discussion, it's simpler to keep this with basic inheritance and not introduce the complexity of virtual inheritance and multiple inheritance #13550 (comment) Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Log mamba params for Granite Hybrid Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unused ssm_in_b Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove ATTENTION_LAYER_INDICES hparam in favor of n_head_kv This matches how recurrent vs attention heads are identified for Jamba Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unused template expansion for get_arr Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Review cleanup in convert_hf_to_gguf The gist is to be explicit about which base class is being used with the multiple inheritance setup Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Undo hidden warnings about duplicate identical keys in add_key_value After further discussion, this encourages sloppy overwriting in the model converters Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: If not using ROPE, context is "infinite" Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * doc: Add a comment outlining expected duplicate key warnings Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary duplicate keys in converter Co-authored-by: Francis Couture-Harpin <git@compilade.net> (thanks for the sharp eyes and patience!) Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
1 parent 6bdda13 commit 0aedae0

File tree

6 files changed

+728
-144
lines changed

6 files changed

+728
-144
lines changed

convert_hf_to_gguf.py

Lines changed: 146 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -4890,6 +4890,9 @@ def __init__(self, dir_model: Path, *args, **kwargs):
48904890
with open(dir_model / "config.json", "r", encoding="utf-8") as f:
48914891
hparams = json.load(f)
48924892
super().__init__(dir_model, *args, hparams=hparams, **kwargs)
4893+
self.d_model = self.find_hparam(["hidden_size", "d_model", "dim"])
4894+
self.d_inner = self.find_hparam(["mamba_d_ssm", "intermediate_size", "d_inner"], optional=True) or 2 * self.d_model
4895+
self.n_group = self.find_hparam(["n_groups"], optional=True) or 1
48934896

48944897
def set_vocab(self):
48954898
vocab_size = self.hparams["vocab_size"]
@@ -4912,32 +4915,29 @@ def set_vocab(self):
49124915
self._set_vocab_builtin("gpt-neox", vocab_size)
49134916

49144917
def set_gguf_parameters(self):
4915-
d_model = self.find_hparam(["hidden_size", "d_model", "dim"])
4916-
d_conv = self.find_hparam(["conv_kernel", "d_conv"], optional=True) or 4
4917-
d_inner = self.find_hparam(["mamba_d_ssm", "intermediate_size", "d_inner"], optional=True) or 2 * d_model
4918-
d_state = self.find_hparam(["state_size", "d_state"], optional=True) or 128
4919-
head_dim = self.find_hparam(["mamba_d_head", "head_dim"], optional=True) or 64
4920-
n_group = self.find_hparam(["n_groups"], optional=True) or 1
4918+
d_conv = self.find_hparam(["conv_kernel", "d_conv"], optional=True) or 4
4919+
d_state = self.find_hparam(["state_size", "d_state"], optional=True) or 128
4920+
head_dim = self.find_hparam(["mamba_d_head", "head_dim"], optional=True) or 64
49214921

49224922
rms_norm_eps = self.find_hparam(["layer_norm_epsilon", "rms_norm_eps"], optional=True) or 1e-5
49234923

49244924
# Fail early for models which don't have a block expansion factor of 2
49254925
# TODO: does this really matter?
49264926
# skip the assertion for FalconH1 Model
49274927
if self.model_arch != gguf.MODEL_ARCH.FALCON_H1:
4928-
assert d_inner == 2 * d_model
4929-
assert d_inner % head_dim == 0
4928+
assert self.d_inner == 2 * self.d_model
4929+
assert self.d_inner % head_dim == 0
49304930

49314931
self.gguf_writer.add_context_length(2**20) # arbitrary value; for those who use the default
4932-
self.gguf_writer.add_embedding_length(d_model)
4932+
self.gguf_writer.add_embedding_length(self.d_model)
49334933
self.gguf_writer.add_feed_forward_length(0) # unused, but seemingly required when loading
49344934
self.gguf_writer.add_head_count(0) # unused, but seemingly required when loading
49354935
self.gguf_writer.add_block_count(self.block_count)
49364936
self.gguf_writer.add_ssm_conv_kernel(d_conv)
4937-
self.gguf_writer.add_ssm_inner_size(d_inner)
4937+
self.gguf_writer.add_ssm_inner_size(self.d_inner)
49384938
self.gguf_writer.add_ssm_state_size(d_state)
4939-
self.gguf_writer.add_ssm_time_step_rank(d_inner // head_dim)
4940-
self.gguf_writer.add_ssm_group_count(n_group)
4939+
self.gguf_writer.add_ssm_time_step_rank(self.d_inner // head_dim)
4940+
self.gguf_writer.add_ssm_group_count(self.n_group)
49414941
self.gguf_writer.add_layer_norm_rms_eps(rms_norm_eps)
49424942
self.gguf_writer.add_file_type(self.ftype)
49434943

@@ -4962,10 +4962,7 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter
49624962
# (D is also unsqueezed, but for more straightforward broadcast internally)
49634963
data_torch = data_torch.reshape((*data_torch.shape, 1))
49644964
elif self.match_model_tensor_name(new_name, gguf.MODEL_TENSOR.SSM_NORM, bid):
4965-
d_model = self.find_hparam(["hidden_size", "d_model", "dim"])
4966-
d_inner = self.find_hparam(["mamba_d_ssm", "intermediate_size", "d_inner"], optional=True) or 2 * d_model
4967-
n_group = self.hparams.get("n_groups", 1)
4968-
data_torch = data_torch.reshape((n_group, d_inner // n_group))
4965+
data_torch = data_torch.reshape((self.n_group, self.d_inner // self.n_group))
49694966

49704967
if name.endswith(".A_log"):
49714968
logger.debug("A_log --> A ==> " + new_name)
@@ -6452,18 +6449,148 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter
64526449
(self.format_tensor_name(gguf.MODEL_TENSOR.FFN_UP_EXP, bid), up),
64536450
]
64546451

6452+
has_experts = bool(self.hparams.get('num_local_experts'))
6453+
64556454
if name.endswith("shared_mlp.input_linear.weight"):
64566455
ffn_dim = self.hparams["shared_intermediate_size"]
64576456
assert data_torch.shape[-2] == 2 * ffn_dim, "Merged FFN tensor size must be 2 * shared_intermediate_size"
64586457
gate, up = data_torch.split(ffn_dim, dim=-2)
6458+
if has_experts:
6459+
return [
6460+
(self.format_tensor_name(gguf.MODEL_TENSOR.FFN_GATE_SHEXP, bid), gate),
6461+
(self.format_tensor_name(gguf.MODEL_TENSOR.FFN_UP_SHEXP, bid), up),
6462+
]
64596463
return [
6460-
(self.format_tensor_name(gguf.MODEL_TENSOR.FFN_GATE_SHEXP, bid), gate),
6461-
(self.format_tensor_name(gguf.MODEL_TENSOR.FFN_UP_SHEXP, bid), up),
6464+
(self.format_tensor_name(gguf.MODEL_TENSOR.FFN_GATE, bid), gate),
6465+
(self.format_tensor_name(gguf.MODEL_TENSOR.FFN_UP, bid), up),
6466+
]
6467+
6468+
if not has_experts and name.endswith("shared_mlp.output_linear.weight"):
6469+
return [
6470+
(self.format_tensor_name(gguf.MODEL_TENSOR.FFN_DOWN, bid), data_torch)
64626471
]
64636472

64646473
return super().modify_tensors(data_torch, name, bid)
64656474

64666475

6476+
@ModelBase.register("GraniteMoeHybridForCausalLM", "BambaForCausalLM")
6477+
class GraniteHybridModel(Mamba2Model, GraniteMoeModel):
6478+
"""GraniteHybrid is a hybrid SSM + Attention model that uses Mamba2 SSM
6479+
layers and optionally uses MoE w/ a shared expert"""
6480+
model_arch = gguf.MODEL_ARCH.GRANITE_HYBRID
6481+
undo_permute = True
6482+
6483+
def __init__(self, *args, **kwargs):
6484+
6485+
# Hybrid mamba models use a prefix for the mamba-specific params.
6486+
# TODO: Extend this if the prefix(es) need to be configurable
6487+
self.hparam_prefixes = ["mamba"]
6488+
6489+
super().__init__(*args, **kwargs)
6490+
6491+
# Lists of which layers use ssm vs attention
6492+
self._attn_layers = self.get_attn_layers()
6493+
self._ssm_layers = [
6494+
i for i in range(self.block_count)
6495+
if i not in self._attn_layers
6496+
]
6497+
6498+
# n_group and d_inner are used during reshape_tensors for mamba2
6499+
self.d_model = self.find_hparam(["hidden_size", "d_model"])
6500+
self.n_group = self.find_hparam(["n_groups"])
6501+
self.d_inner = self.find_hparam(["expand"]) * self.d_model
6502+
6503+
def get_attn_layers(self):
6504+
# Explicit list of layer type names
6505+
if layer_types := self.hparams.get("layer_types"):
6506+
return [
6507+
i for i, typ in enumerate(layer_types)
6508+
if typ == "attention"
6509+
]
6510+
6511+
# Layer types indicated by index or period
6512+
attn_layers = self.hparams.get("attn_layer_indices", [])
6513+
if not attn_layers:
6514+
attn_period = self.hparams.get("attn_layer_period")
6515+
assert attn_period, "Didn't find attn_layer_indices or attn_layer_period"
6516+
attn_offset = self.hparams.get("attn_layer_offset")
6517+
assert attn_offset is not None, "No attention layer offset set with attn_layer_period"
6518+
attn_layers = [
6519+
i for i in range(self.block_count)
6520+
if i % attn_period == attn_offset
6521+
]
6522+
return attn_layers
6523+
6524+
def find_hparam(self, keys: Iterable[str], *args, **kwargs) -> Any:
6525+
prefixed = []
6526+
for pfx in self.hparam_prefixes:
6527+
prefixed.extend(
6528+
"_".join([pfx, k])
6529+
for k in keys
6530+
)
6531+
keys = list(keys) + prefixed
6532+
return Mamba2Model.find_hparam(self, keys, *args, **kwargs)
6533+
6534+
def modify_tensors(
6535+
self, data_torch: Tensor, name: str, bid: int | None
6536+
) -> Iterable[tuple[str, Tensor]]:
6537+
if (
6538+
name.endswith("block_sparse_moe.input_linear.weight")
6539+
or "shared_mlp" in name
6540+
):
6541+
return GraniteMoeModel.modify_tensors(self, data_torch, name, bid)
6542+
6543+
# Determine whether this is a mamba layer or an attention layer
6544+
if bid in self._ssm_layers:
6545+
return Mamba2Model.modify_tensors(self, data_torch, name, bid)
6546+
elif bid in self._attn_layers:
6547+
return GraniteMoeModel.modify_tensors(self, data_torch, name, bid)
6548+
return [(self.map_tensor_name(name), data_torch)]
6549+
6550+
def set_gguf_parameters(self):
6551+
"""This method merges params from both parents and some that are
6552+
specific to this model. The result is some duplication of how the params
6553+
get set. The following warnings are expected during conversion:
6554+
6555+
WARNING:Duplicated key name 'granitehybrid.attention.head_count_kv'
6556+
WARNING:Duplicated key name 'granitehybrid.context_length'
6557+
"""
6558+
GraniteMoeModel.set_gguf_parameters(self)
6559+
6560+
## Mamba mixer params ##
6561+
self.gguf_writer.add_ssm_conv_kernel(self.find_hparam(["conv_kernel", "d_conv"]))
6562+
self.gguf_writer.add_ssm_state_size(self.find_hparam(["state_size", "d_state"]))
6563+
self.gguf_writer.add_ssm_group_count(self.n_group)
6564+
self.gguf_writer.add_ssm_inner_size(self.d_inner)
6565+
# NOTE: The mamba_dt_rank is _not_ the right field for how this is used
6566+
# in llama.cpp
6567+
self.gguf_writer.add_ssm_time_step_rank(self.find_hparam(["n_heads"]))
6568+
6569+
## Attention params ##
6570+
head_count_kv = self.find_hparam(["num_key_value_heads", "n_head_kv"])
6571+
head_count_kv_vec = [
6572+
head_count_kv if i in self._attn_layers else 0 for i in range(self.block_count)
6573+
]
6574+
if rope_dim := self.hparams.get("attn_rotary_emb"):
6575+
self.gguf_writer.add_rope_dimension_count(rope_dim)
6576+
self.gguf_writer.add_head_count_kv(head_count_kv_vec)
6577+
6578+
## If Bamba, use rope, otherwise don't
6579+
use_rope = "BambaForCausalLM" in self.hparams["architectures"]
6580+
self.gguf_writer.add_rope_scaling_finetuned(use_rope)
6581+
if not use_rope:
6582+
self.gguf_writer.add_context_length(2**20)
6583+
6584+
## Validation ##
6585+
d_head = self.find_hparam(["d_head"], optional=True) or 64
6586+
assert self.hparams.get("hidden_act") in [None, "silu"], "Only SILU activation supported"
6587+
assert self.d_inner % d_head == 0, f"SSM inner size {self.d_inner} not a multiple of head dim {d_head}"
6588+
6589+
def set_vocab(self):
6590+
self.hparams["pad_vocab_size_multiple"] = 8
6591+
Mamba2Model.set_vocab(self)
6592+
6593+
64676594
@ModelBase.register("BailingMoeForCausalLM")
64686595
class BailingMoeModel(TextModel):
64696596
model_arch = gguf.MODEL_ARCH.BAILINGMOE
@@ -6687,7 +6814,7 @@ def __init__(self, *args, **kwargs):
66876814
# Use Llama conversion for attention
66886815
self._transformer_model_class = LlamaModel
66896816

6690-
# n_group and d_inner are used during reshape_tensors for mamaba2
6817+
# n_group and d_inner are used during reshape_tensors for mamba2
66916818
self.n_group = self.find_hparam(["n_groups"])
66926819
self.d_inner = self.find_hparam(["mamba_d_ssm"])
66936820
self.d_head = self.find_hparam(["d_head"])

gguf-py/gguf/constants.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -352,6 +352,7 @@ class MODEL_ARCH(IntEnum):
352352
EXAONE = auto()
353353
GRANITE = auto()
354354
GRANITE_MOE = auto()
355+
GRANITE_HYBRID = auto()
355356
CHAMELEON = auto()
356357
WAVTOKENIZER_DEC = auto()
357358
PLM = auto()
@@ -661,6 +662,7 @@ class MODEL_TENSOR(IntEnum):
661662
MODEL_ARCH.EXAONE: "exaone",
662663
MODEL_ARCH.GRANITE: "granite",
663664
MODEL_ARCH.GRANITE_MOE: "granitemoe",
665+
MODEL_ARCH.GRANITE_HYBRID: "granitehybrid",
664666
MODEL_ARCH.CHAMELEON: "chameleon",
665667
MODEL_ARCH.WAVTOKENIZER_DEC: "wavtokenizer-dec",
666668
MODEL_ARCH.PLM: "plm",
@@ -2143,6 +2145,36 @@ class MODEL_TENSOR(IntEnum):
21432145
MODEL_TENSOR.FFN_UP_SHEXP,
21442146
MODEL_TENSOR.FFN_DOWN_SHEXP,
21452147
],
2148+
MODEL_ARCH.GRANITE_HYBRID: [
2149+
MODEL_TENSOR.TOKEN_EMBD,
2150+
MODEL_TENSOR.OUTPUT_NORM,
2151+
MODEL_TENSOR.OUTPUT,
2152+
MODEL_TENSOR.ATTN_NORM,
2153+
MODEL_TENSOR.SSM_IN,
2154+
MODEL_TENSOR.SSM_CONV1D,
2155+
MODEL_TENSOR.SSM_DT,
2156+
MODEL_TENSOR.SSM_A,
2157+
MODEL_TENSOR.SSM_D,
2158+
MODEL_TENSOR.SSM_NORM,
2159+
MODEL_TENSOR.SSM_OUT,
2160+
MODEL_TENSOR.ATTN_Q,
2161+
MODEL_TENSOR.ATTN_K,
2162+
MODEL_TENSOR.ATTN_V,
2163+
MODEL_TENSOR.ATTN_OUT,
2164+
MODEL_TENSOR.FFN_NORM,
2165+
# MoE
2166+
MODEL_TENSOR.FFN_GATE_INP,
2167+
MODEL_TENSOR.FFN_GATE_EXP,
2168+
MODEL_TENSOR.FFN_DOWN_EXP,
2169+
MODEL_TENSOR.FFN_UP_EXP,
2170+
MODEL_TENSOR.FFN_GATE_SHEXP,
2171+
MODEL_TENSOR.FFN_UP_SHEXP,
2172+
MODEL_TENSOR.FFN_DOWN_SHEXP,
2173+
# Dense
2174+
MODEL_TENSOR.FFN_GATE,
2175+
MODEL_TENSOR.FFN_DOWN,
2176+
MODEL_TENSOR.FFN_UP,
2177+
],
21462178
MODEL_ARCH.CHAMELEON: [
21472179
MODEL_TENSOR.TOKEN_EMBD,
21482180
MODEL_TENSOR.OUTPUT_NORM,

gguf-py/gguf/tensor_mapping.py

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ class TensorNameMap:
1313
"transformer.wte", # gpt2 gpt-j mpt refact qwen dbrx jais exaone
1414
"transformer.word_embeddings", # falcon
1515
"word_embeddings", # bloom
16-
"model.embed_tokens", # llama-hf nemotron olmoe olmo2 rwkv6qwen2 glm4-0414
16+
"model.embed_tokens", # llama-hf nemotron olmoe olmo2 rwkv6qwen2 glm4-0414 granite-hybrid
1717
"tok_embeddings", # llama-pth
1818
"embeddings.word_embeddings", # bert nomic-bert
1919
"language_model.embedding.word_embeddings", # persimmon
@@ -118,7 +118,7 @@ class TensorNameMap:
118118
"transformer.h.{bid}.input_layernorm", # falcon7b
119119
"h.{bid}.input_layernorm", # bloom
120120
"transformer.h.{bid}.ln_mlp", # falcon40b
121-
"model.layers.{bid}.input_layernorm", # llama-hf nemotron olmoe phimoe
121+
"model.layers.{bid}.input_layernorm", # llama-hf nemotron olmoe phimoe granite-hybrid
122122
"layers.{bid}.attention_norm", # llama-pth
123123
"language_model.encoder.layers.{bid}.input_layernorm", # persimmon
124124
"model.layers.{bid}.ln1", # yi
@@ -279,7 +279,7 @@ class TensorNameMap:
279279
"transformer.decoder_layer.{bid}.rms_norm_2", # Grok
280280
"encoder.layers.{bid}.post_attention_layernorm", # chatglm
281281
"transformer.layers.{bid}.ffn_norm", # openelm
282-
"model.layers.{bid}.pre_ff_layernorm", # jamba
282+
"model.layers.{bid}.pre_ff_layernorm", # jamba granite-hybrid
283283
"model.layers.{bid}.pre_moe_layernorm", # mini-jamba
284284
"model.layers.{bid}.post_attention_layernorm", # llama4
285285
"transformer_encoder.{bid}.ffn_norm", # neobert
@@ -349,7 +349,7 @@ class TensorNameMap:
349349
"model.layers.{bid}.residual_mlp.w3", # arctic
350350
"encoder.layers.{bid}.mlp.dense_h_to_4h", # chatglm
351351
"transformer.h.{bid}.mlp.c_fc_1", # exaone
352-
"model.layers.{bid}.feed_forward.up_proj", # llama4 jamba
352+
"model.layers.{bid}.feed_forward.up_proj", # llama4 jamba granite-hybrid
353353
"transformer_encoder.{bid}.ffn.w12", # neobert
354354
),
355355

@@ -389,7 +389,7 @@ class TensorNameMap:
389389
"transformer.h.{bid}.mlp.linear_1", # refact
390390
"model.layers.{bid}.residual_mlp.w1", # arctic
391391
"transformer.h.{bid}.mlp.c_fc_0", # exaone
392-
"model.layers.{bid}.feed_forward.gate_proj", # llama4 jamba
392+
"model.layers.{bid}.feed_forward.gate_proj", # llama4 jamba granite-hybrid
393393
),
394394

395395
MODEL_TENSOR.FFN_GATE_EXP: (
@@ -435,7 +435,7 @@ class TensorNameMap:
435435
"encoder.layer.{bid}.mlp.down_layer", # jina-bert-v2
436436
"encoder.layers.{bid}.mlp.dense_4h_to_h", # chatglm
437437
"model.layers.h.{bid}.mlp.c_proj", # exaone
438-
"model.layers.{bid}.feed_forward.down_proj", # llama4 jamba
438+
"model.layers.{bid}.feed_forward.down_proj", # llama4 jamba granite-hybrid
439439
"transformer_encoder.{bid}.ffn.w3", # neobert
440440
),
441441

@@ -558,13 +558,13 @@ class TensorNameMap:
558558
MODEL_TENSOR.SSM_IN: (
559559
"model.layers.{bid}.in_proj", # mamba-hf
560560
"backbone.layers.{bid}.mixer.in_proj", # mamba
561-
"model.layers.{bid}.mamba.in_proj", # jamba falcon-h1
561+
"model.layers.{bid}.mamba.in_proj", # jamba falcon-h1 granite-hybrid
562562
),
563563

564564
MODEL_TENSOR.SSM_CONV1D: (
565565
"model.layers.{bid}.conv1d", # mamba-hf
566566
"backbone.layers.{bid}.mixer.conv1d", # mamba
567-
"model.layers.{bid}.mamba.conv1d", # jamba falcon-h1
567+
"model.layers.{bid}.mamba.conv1d", # jamba falcon-h1 granite-hybrid
568568
),
569569

570570
MODEL_TENSOR.SSM_X: (
@@ -576,7 +576,7 @@ class TensorNameMap:
576576
MODEL_TENSOR.SSM_DT: (
577577
"model.layers.{bid}.dt_proj", # mamba-hf
578578
"backbone.layers.{bid}.mixer.dt_proj", # mamba
579-
"model.layers.{bid}.mamba.dt_proj", # jamba falcon-h1
579+
"model.layers.{bid}.mamba.dt_proj", # jamba falcon-h1 granite-hybrid
580580
),
581581

582582
MODEL_TENSOR.SSM_DT_NORM: (
@@ -586,7 +586,7 @@ class TensorNameMap:
586586
MODEL_TENSOR.SSM_A: (
587587
"model.layers.{bid}.A_log", # mamba-hf
588588
"backbone.layers.{bid}.mixer.A_log", # mamba
589-
"model.layers.{bid}.mamba.A_log", # jamba falcon-h1
589+
"model.layers.{bid}.mamba.A_log", # jamba falcon-h1 granite-hybrid
590590
),
591591

592592
MODEL_TENSOR.SSM_B_NORM: (
@@ -602,18 +602,18 @@ class TensorNameMap:
602602
MODEL_TENSOR.SSM_D: (
603603
"model.layers.{bid}.D", # mamba-hf
604604
"backbone.layers.{bid}.mixer.D", # mamba
605-
"model.layers.{bid}.mamba.D", # jamba falcon-h1
605+
"model.layers.{bid}.mamba.D", # jamba falcon-h1 granite-hybrid
606606
),
607607

608608
MODEL_TENSOR.SSM_NORM: (
609-
"model.layers.{bid}.mamba.norm", # falcon-h1
609+
"model.layers.{bid}.mamba.norm", # falcon-h1 granite-hybrid
610610
"backbone.layers.{bid}.mixer.norm", # mamba2
611611
),
612612

613613
MODEL_TENSOR.SSM_OUT: (
614614
"model.layers.{bid}.out_proj", # mamba-hf
615615
"backbone.layers.{bid}.mixer.out_proj", # mamba
616-
"model.layers.{bid}.mamba.out_proj", # jamba falcon-h1
616+
"model.layers.{bid}.mamba.out_proj", # jamba falcon-h1 granite-hybrid
617617
),
618618

619619
MODEL_TENSOR.TIME_MIX_W0: (

0 commit comments

Comments
 (0)