-
Notifications
You must be signed in to change notification settings - Fork 17
add minimax-m1 doc #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+144
−0
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
2d4e63e
update minimax-m1.md
qscqesze 649d291
update
qscqesze 9298ff9
update
qscqesze c4aa24d
update
qscqesze ee05525
update
qscqesze b38ef16
change title
qscqesze ac3feb9
update
qscqesze c1d1b25
update
qscqesze 3d51b62
Update _posts/2025-06-26-minimax-m1.md
youkaichao 666503c
Update _posts/2025-06-26-minimax-m1.md
youkaichao 0964d6d
Update _posts/2025-06-26-minimax-m1.md
youkaichao File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,144 @@ | ||
--- | ||
layout: post | ||
title: "MiniMax-M1 Hybrid Architecture Meets vLLM: Long Context, Fast Inference" | ||
author: "MiniMax" | ||
benchmark-img: /assets/figures/minimax-m1/benchmark.png | ||
moe-img: /assets/figures/minimax-m1/moe.png | ||
lightning_attention-img: /assets/figures/minimax-m1/lightning_attention.png | ||
--- | ||
|
||
This article explores how MiniMax-M1's hybrid architecture is efficiently supported in vLLM. We discuss the model's unique features, the challenges of efficient inference, and the technical solutions implemented in vLLM. | ||
|
||
--- | ||
|
||
## Introduction | ||
|
||
The rapid advancement of artificial intelligence has led to the emergence of increasingly powerful large language models (LLMs). [MiniMax-M1](https://arxiv.org/pdf/2506.13585), a popular open-source large-scale mixture-of-experts (MoE) inference model, has attracted significant attention since its release. Its innovative hybrid architecture points to the future of LLMs, enabling breakthroughs in long-context reasoning and complex task processing. Meanwhile, vLLM, a high-performance LLM inference and serving library, provides robust support for MiniMax-M1, making efficient deployment possible. | ||
|
||
<img align="center" src="/assets/figures/minimax-m1/benchmark.png" alt="MiniMax-M1 Benchmark Performance" width="90%" height="90%"> | ||
|
||
* **Left:** Benchmark comparison of leading commercial and open-source models on tasks such as math, code, software engineering, tool use, and long-context understanding. MiniMax-M1 leads among open-source models. | ||
* **Right:** Theoretical inference FLOPs scaling with token length. Compared to DeepSeek R1, MiniMax-M1 uses only 25% of the FLOPs when generating sequences of 100k tokens. | ||
|
||
## Deploying MiniMax-M1 with vLLM | ||
|
||
We recommend deploying **MiniMax-M1** using **vLLM** for optimal performance. Our tests demonstrate the following key benefits: | ||
|
||
- Outstanding throughput | ||
- Efficient and intelligent memory management | ||
- Robust support for batched requests | ||
- Deeply optimized backend performance | ||
|
||
### Model Download | ||
|
||
You can download the models from Hugging Face: | ||
|
||
```bash | ||
# Install the Hugging Face Hub CLI | ||
pip install -U huggingface-hub | ||
|
||
# Download the MiniMax-M1-40k model | ||
huggingface-cli download MiniMaxAI/MiniMax-M1-40k | ||
# For the 80k version, uncomment the following line: | ||
# huggingface-cli download MiniMaxAI/MiniMax-M1-80k | ||
``` | ||
|
||
### Deployment | ||
|
||
Below is a quick guide to deploying MiniMax-M1 with vLLM and Docker: | ||
|
||
```bash | ||
# Set environment variables | ||
IMAGE=vllm/vllm-openai:latest | ||
MODEL_DIR=<model storage path> | ||
NAME=MiniMaxImage | ||
|
||
# Docker run configuration | ||
DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --rm --gpus all --ulimit stack=67108864" | ||
|
||
# Start the container | ||
sudo docker run -it \ | ||
-v $MODEL_DIR:$MODEL_DIR \ | ||
--name $NAME \ | ||
$DOCKER_RUN_CMD \ | ||
$IMAGE /bin/bash | ||
|
||
# Launch MiniMax-M1 Service | ||
export SAFETENSORS_FAST_GPU=1 | ||
export VLLM_USE_V1=0 | ||
vllm serve \ | ||
--model <model storage path> \ | ||
--tensor-parallel-size 8 \ | ||
--trust-remote-code \ | ||
--quantization experts_int8 \ | ||
--max_model_len 4096 \ | ||
--dtype bfloat16 | ||
``` | ||
|
||
## MiniMax-M1 Hybrid Architecture Highlights | ||
|
||
### Mixture-of-Experts (MoE) | ||
|
||
MiniMax-M1 utilizes a Mixture-of-Experts (MoE) architecture with **456 billion total parameters**. During inference, a dynamic routing algorithm activates a sparse subset of experts (~45.9B parameters, or 10% of the total), based on the semantic characteristics of input tokens. This sparse activation is managed by a gating network that computes expert selection probabilities. | ||
|
||
This approach significantly improves computational efficiency: in classification tasks, it reduces computational cost by up to 90% while maintaining accuracy comparable to dense models. | ||
|
||
<figure> | ||
<img align="center" src="/assets/figures/minimax-m1/moe.png" alt="MoE vs. Dense Comparison" width="90%" height="90%"> | ||
<figcaption style="text-align:center; font-style:italic;"> | ||
Isoflop Comparison: MoE vs. Dense on various benchmarks. Both models are trained on 1 trillion tokens. The gray dashed lines indicate the difference in computation required for the two models to achieve the same performance. | ||
</figcaption> | ||
</figure> | ||
|
||
### Lightning Attention | ||
|
||
**Lightning Attention** addresses the quadratic complexity bottleneck of traditional attention by introducing linearized approximation techniques. It transforms softmax attention into a **linear combination of matrix multiplications**, aided by dynamic memory tiling and gradient approximation. | ||
|
||
In code completion benchmarks, Lightning Attention reduces memory usage by **83%** and inference latency by **67%** for 100k-token sequences. | ||
|
||
<figure> | ||
<img align="center" src="/assets/figures/minimax-m1/lightning_attention.png" alt="Lightning Attention Algorithm" width="90%" height="90%"> | ||
<figcaption style="text-align:center; font-style:italic;"> | ||
Overview of the Lightning Attention Algorithm, which reduces memory usage and latency for long sequences. | ||
</figcaption> | ||
</figure> | ||
|
||
### Efficient Computation & Activation Strategy | ||
|
||
Thanks to its hybrid architecture, MiniMax-M1 enables efficient computation and scalable inference. The Lightning Attention mechanism dramatically improves runtime performance, while the sparse expert activation strategy avoids unnecessary computation. This makes it feasible to achieve strong performance even with limited hardware resources. | ||
|
||
To learn more about MiniMax-M1 please refer to [this paper](https://arxiv.org/pdf/2506.13585). | ||
|
||
## Efficient Inference with vLLM | ||
|
||
### Advanced Memory Management | ||
|
||
vLLM introduces PagedAttention, a technique for managing attention key-value caches more efficiently. Instead of storing the kv-cache contiguously, vLLM divides it into multiple memory pages, greatly reducing fragmentation and over-allocation. This allows vLLM to minimize memory waste to under 4%, compared to 60%-80% with traditional approaches. | ||
|
||
Such efficient memory handling is crucial for models like MiniMax-M1 that support ultra-long context lengths, ensuring smooth and stable inference without running into memory bottlenecks. | ||
|
||
### Deep Kernel-Level Optimizations | ||
|
||
vLLM incorporates a wide range of CUDA kernel optimizations, including integrations with FlashAttention, FlashInfer, and support for quantization formats such as GPTQ, AWQ, INT4, INT8, and FP8. | ||
|
||
These enhancements further boost the low-level computation efficiency of MiniMax-M1 inference. Quantization reduces memory and compute overhead with minimal accuracy loss, while FlashAttention accelerates the attention computation itself—resulting in significantly faster inference in real-world applications. | ||
|
||
### Lightning Attention in vLLM | ||
|
||
As a cutting-edge attention mechanism, Lightning Attention is implemented in vLLM via Triton, leveraging its flexibility and high-performance computing features. A Triton-based execution framework fully supports Lightning Attention's core computation logic, enabling seamless integration and deployment within the vLLM ecosystem. | ||
|
||
## Future Work | ||
|
||
Looking ahead, further optimizations for hybrid architecture support are actively being explored within the vLLM community. Notably, the development of a hybrid allocator is expected to enable even more efficient memory management tailored to the unique requirements of models like MiniMax-M1. | ||
|
||
In addition, full support for [vLLM v1](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) is planned, with the hybrid model architecture expected to be migrated into the v1 framework. These advancements are anticipated to unlock further performance improvements and provide a more robust foundation for future developments. | ||
|
||
## Conclusion | ||
|
||
The hybrid architecture of MiniMax-M1 paves the way for the next generation of large language models, offering powerful capabilities in long-context reasoning and complex task inference. vLLM complements this with highly optimized memory handling, robust batch request management, and deeply tuned backend performance. | ||
|
||
Together, MiniMax-M1 and vLLM form a strong foundation for efficient and scalable AI applications. As the ecosystem evolves, we anticipate this synergy will power more intelligent, responsive, and capable solutions across a wide range of use cases, including code generation, document analysis, and conversational AI. | ||
|
||
## Acknowledgement | ||
|
||
We would like to express our sincere gratitude to the vLLM community for their invaluable support and collaboration. In particular, we thank [Tyler Michael Smith](https://github.com/tlrmchlsmth), [Simon Mo](https://github.com/simon-mo), [Cyrus Leung](https://github.com/DarkLight1337), [Roger Wang](https://github.com/ywang96), [Zifeng Mo](https://github.com/Isotr0py) and [Kaichao You](https://github.com/youkaichao) for their significant contributions. We also appreciate the efforts of the MiniMax engineering team, especially [Gangying Qing](https://github.com/ZZBoom), [Jun Qing](https://github.com/qscqesze), and [Jiaren Cai](https://github.com/sriting), whose dedication made this work possible. | ||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tlrmchlsmth @simon-mo @DarkLight1337 @ywang96 @Isotr0py can you please take a look and check if you have any suggestions?
the blog is planned to be released this week, thanks!