Skip to content

Feature Request: ERNIE MoE Model Support #568

@Downtown-Case

Description

@Downtown-Case

New MoE series from Baidu: https://github.com/PaddlePaddle/ERNIE

...We designed a heterogeneous MoE structure, incorporated modality-isolated routing, and employed router orthogonal loss and multimodal token-balanced loss...

This bit caught my eye:

...For inference, we propose multi-expert parallel collaboration method and convolutional code quantization algorithm to achieve 4-bit/2-bit lossless quantization...

https://github.com/PaddlePaddle/ERNIE?tab=readme-ov-file#model-development

ERNIE-4.5-300B-A47B: BF16 / W4A16C16 / W8A16C16 / W4A8C8 / FP8 / 2Bits

https://huggingface.co/baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle

2 bit QAT on a 300B? Now that's interesting.

I am leaving this as a drive by request, as I still have other issues (like testing Hunyuan!) in my queue.

Related issue: ggml-org/llama.cpp#14408


Unrelated, but Huawei just dropped a 72B MoE trained on NPUs: https://huggingface.co/IntervitensInc/pangu-pro-moe-model

Seems to be specifically designed for even multi-device distribution:

We proposed a new type of Mixture of Grouped Experts (MoGE), which groups experts in the expert selection stage and constrains tokens to activate equal experts in each group, thereby achieving natural load balancing between devices.

LG is about to release EXAONE 4.0 as well: ggml-org/llama.cpp#14474

I can't keep up with any of this, lol.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions