-
Notifications
You must be signed in to change notification settings - Fork 97
Description
New MoE series from Baidu: https://github.com/PaddlePaddle/ERNIE
...We designed a heterogeneous MoE structure, incorporated modality-isolated routing, and employed router orthogonal loss and multimodal token-balanced loss...
This bit caught my eye:
...For inference, we propose multi-expert parallel collaboration method and convolutional code quantization algorithm to achieve 4-bit/2-bit lossless quantization...
https://github.com/PaddlePaddle/ERNIE?tab=readme-ov-file#model-development
ERNIE-4.5-300B-A47B: BF16 / W4A16C16 / W8A16C16 / W4A8C8 / FP8 / 2Bits
https://huggingface.co/baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle
2 bit QAT on a 300B? Now that's interesting.
I am leaving this as a drive by request, as I still have other issues (like testing Hunyuan!) in my queue.
Related issue: ggml-org/llama.cpp#14408
Unrelated, but Huawei just dropped a 72B MoE trained on NPUs: https://huggingface.co/IntervitensInc/pangu-pro-moe-model
Seems to be specifically designed for even multi-device distribution:
We proposed a new type of Mixture of Grouped Experts (MoGE), which groups experts in the expert selection stage and constrains tokens to activate equal experts in each group, thereby achieving natural load balancing between devices.
LG is about to release EXAONE 4.0 as well: ggml-org/llama.cpp#14474
I can't keep up with any of this, lol.