Feature Request: ERNIE MoE Model Support

New MoE series from Baidu: https://github.com/PaddlePaddle/ERNIE

> ...We designed a heterogeneous MoE structure, incorporated modality-isolated routing, and employed router orthogonal loss and multimodal token-balanced loss...

This bit caught my eye:

>  ...For inference, we propose multi-expert parallel collaboration method and convolutional code quantization algorithm to achieve **4-bit/2-bit lossless quantization...**

https://github.com/PaddlePaddle/ERNIE?tab=readme-ov-file#model-development

> ERNIE-4.5-300B-A47B: BF16 / W4A16C16 / W8A16C16 / W4A8C8 / FP8 / **2Bits**

https://huggingface.co/baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle

2 bit QAT on a 300B? Now that's *interesting.*

I am leaving this as a drive by request, as I still have other issues (like testing Hunyuan!) in my queue.

Related issue: https://github.com/ggml-org/llama.cpp/pull/14408

***

Unrelated, but Huawei just dropped a 72B MoE trained on NPUs: https://huggingface.co/IntervitensInc/pangu-pro-moe-model

Seems to be *specifically* designed for even multi-device distribution:

> We proposed a new type of Mixture of Grouped Experts (MoGE), which groups experts in the expert selection stage and constrains tokens to activate equal experts in each group, thereby achieving natural load balancing between devices.

LG is about to release EXAONE 4.0 as well: https://github.com/ggml-org/llama.cpp/issues/14474

I can't keep up with any of this, lol.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: ERNIE MoE Model Support #568

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature Request: ERNIE MoE Model Support #568

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions