Hyperbolic Large Language Models via Mixture of Cruvature Experts

Description

Official source code of HELM, a family of fully HypErbolic Large Language Models (LLMs) consisting of two variants:

HELM-MiCE: hyperbolic LLMs with an Mixture of Experts (MoE) module where each experts operate in a distinct curvature space to learn fine-grained token geometry
HELM-D: hyperbolic dense LLMs to better align with hierarchical structures in token embedding distribution

Model Framework

Mixture of Curvature Experts (MiCE)

The Mixture of Curvature Experts (MiCE) module is a hyperbolic MoE module that enables each experts to each operate on a distinct curvature space, so the the experts can collectively learn more fine-grained geometric structures in the token distributions. The routing is also sepcifically designed to reflect the geometric structure of the space. Please see our paper for techinical details.

Hyperbolic Multi-Head Latent Attention (HMLA)

The HMLA module, similar to the Euclidean Multi-Head Latent Attention, is designed specifically for hyperbolic LLMs so that the model only needs to save the latent keys and values during generation. By projecting the original keys and values to a lower-dimensional subspace, HMLA significantlly reduces the memory footprint of the KV-cache.

Installation

Pip

# [OPTIONAL] create virtual environment
python3.10 -m venv helm_env
source helm_env/bin/activate

# install requirements
bash setup_training.sh

How to run

The training is handled solely via the train.py file. We use sample packing enabled by the LLM-Foundary library, with a packing ratio of 3.0. The models are trained on the English portion of the Wikipedia dataset, tokenized by the LLaMA3.1-8B tokenizer.

To train the models with the default config, please first prepare the dataset for training by run the following command:

python3 helm/utils/prep_data.py

Then the models can be trained using the scipts found in the example folder. For example, to train the 120M parameter HELM-MiCE model as did in the paper, run,

bash example/train_mice_120M.sh

If you wish to train your own hyperbolic LLM, you can override the parameters from the command, with the following options:

training_config: 
    --train 
        If true, MiCE model will return information for load balancing
    --min_lr_ratio
        ratio between final target learning rate and initial learning rate
    warm_up_ratio
        percent of steps to use as warm up
    seed 
        random seed
    lr
        initial learning rate
    weight_decay
        which optimizer to use, can be any of [Adam, RiemannianAdam]
    packing_ratio
        how many samples to pack into one bin for sample packing
    gradient_accumulation_steps
        how many steps to update gradients for accelerator
    CHECKPOINT_DIR
        where to save the model
    log_dir
        where to log training dynamics
    data_path
        path to data
    model_name
        One of HELM_D or HELM_MiCE
    find_unused_parameters
        whether the accelerator should find unused parameters
    max_batch_size
        Maximum batch size
    max_seq_len
        Maximum sequence length
    project_emb
        If true, the model will map tokens to space-like dimension of Lorentz vectors
    vocab_size
        Vocabulary size of the tokenizer

model_config
    dim
        Model dimension
    inter_dim
        Intermediate dimension for MLP layers
    mice_inter_dim
        Intermediate dimension for MoE layers
    n_layers
        Number of transformer layers
    n_dense_layers
        Number of dense layers in the model
    n_heads
        Number of attention heads
    n_routed_experts
        Number of routed experts for MiCE layers
    n_shared_experts
        Number of shared experts for MiCE layers
    n_activated_experts
        Number of activated experts in MiCE layers
    n_expert_groups
        Number of expert groups
    n_limited_groups
        Number of limited groups for MiCE routing
    score_func
        Scoring function for MiCE routing
    route_scale
        Scaling factor for routing scores
    bias_update_speed
        How much to update the bias for gating to ensure expert load balancing
    seq_bal_alpha
        Scaling for sequence load balancing loss
    train_curv
        If true, sets the curvatures of the experts as trainable
    q_lora_rank
        LoRA rank for query projections
    kv_lora_rank
        LoRA rank for key-value projections
    qk_nope_head_dim
        Dimension for query-key projections without positional embeddings
    qk_rope_head_dim
        Dimension for query-key projections with rotary embedding
    v_head_dim
        Dimension for value projections
    original_seq_len
        Original sequence length
    rope_theta
        Base for rotary positional encoding
    rope_factor
        Scaling factor for extended sequence length
    beta_fast
        Fast beta correction factor
    beta_slow
        Slow beta correction factor
    arch
        model architecture for HELM-D, given by La_Wb_Ac, where a is number of layers, b is model dimension, and c is number of heads

Rminder to access_token variable in the corresponding files to load the correct tokenizer.

Reusing Hyperbolic LLM Modules

To reusme HELM modules, please check the ./helm folder. For example, the MiCE module is in the mice.py file.

Acknolwedgement

This project heavily relies on the the following libraries. We thanks the authors for their awesome contributions

HyperCore
Accelerate
LLM Foundry
And LM Evaluation Harness for evaluation results

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

You can find the full details regarding the model and modules in the paper here. Please cite it as follows:

@article{he2025helm,
  title={HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts},
  author={He, Neil and Anand, Rishabh and Madhu, Hiren and Maatouk, Ali and Krishnaswamy, Smita and Tassiulas, Leandros and Yang, Menglin and Ying, Rex},
  journal={arXiv preprint arXiv:2505.24722},
  year={2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
config		config
example		example
figure		figure
helm		helm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup_training.sh		setup_training.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hyperbolic Large Language Models via Mixture of Cruvature Experts

Description

Model Framework

Mixture of Curvature Experts (MiCE)

Hyperbolic Multi-Head Latent Attention (HMLA)

Installation

Pip

How to run

Reusing Hyperbolic LLM Modules

Acknolwedgement

License

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Graph-and-Geometric-Learning/helm

Folders and files

Latest commit

History

Repository files navigation

Hyperbolic Large Language Models via Mixture of Cruvature Experts

Description

Model Framework

Mixture of Curvature Experts (MiCE)

Hyperbolic Multi-Head Latent Attention (HMLA)

Installation

Pip

How to run

Reusing Hyperbolic LLM Modules

Acknolwedgement

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages