eLLM: Infer Qwen3-480B on a CPU in Real Time

eLLM can run full MoE models (Qwen3-480B) on a single-socket CPU-only server. It

Runs real-time short-sequence inferences (100ms/token)
Runs lossless long-sequence inferences up to millions of tokens

How It Works:

Leverages the large memory capacity of CPUs with a space-for-time strategy design strategy to maximize inference efficiency
Introduces an elastic computation graph that replaces dynamic computation graphs to eliminate costly preparations before computation.

🌐 Language: English | 简体中文

✅ Call for Contributors & Funding

The project is under active development, with the minimum prototype (Qwen30B) expected to be released in about 1 month!
We are looking for volunteers — if you're interested, please contact lucienhuangfu@outlook.com.
We are looking for funding — if you're interested, please contact lucienhuangfu@outlook.com.

Key Capabilities:

Full MoE model loading with dynamic expert activation
Full storage for million-token context (KV Cache)
Standard attention inference (every token can reference the full context)

Use Cases

Online Short-Text Inference

Search-based Q&A
Code completion
Chatbots

Offline Long-Text Inference (Deep Research)

Code auditing / high-risk vulnerability detection
Contract review / document compliance checking
Financial statement compliance checks
Literary creation / extended writing

Competitive Advantage: Private LLM Deployment Without GPUs

eLLM enables small and medium teams to deploy large models with lower costs and more flexible setups.

No Need for High-End GPU Servers

Single-socket CPU-only servers can run MoE-architecture LLMs
Requires only a general-purpose CPU supporting AVX512-F16
Memory can be expanded with DDR5

Simple Deployment, Adaptable to Multiple Scenarios

Easily deploy to local servers / private cloud / edge nodes
Supports on-demand elastic computation, automatically freeing resources after tasks
Scales horizontally to handle high-concurrency inference workloads

Machine Comparison: CPU-Only Server vs GPU Server

CPU-Only Server	Item	GPU Server
CPU		CPU	GPU
Xeon 6900	Model	Xeon 8480+	H20
3	Memory Capacity (TB)	2	0.141
1	Quantity	4	8
15	Total Price (10k RMB)	150

Problems with Existing Solutions

High barrier, high cost
- 🧠 High GPU inference threshold: per-user cost for long-context inference is extremely high
- 📦 Context limitations: GPU VRAM cannot hold the full long-text context
- 🔀 Complex expert routing: requires synchronized expert routing, adding system complexity
Dynamic memory management and task generation consume additional CPU resources
- CPU-only servers use a small portion of CPU for this, leaving most CPU cycles for computation
- GPU servers require host CPU usage, but GPUs can focus solely on computation
Performance bottlenecks grow super-linearly with sequence length
- Dynamic memory allocation: pre-consumes memory, uncontrollable
- Dynamic graph construction: high runtime overhead, low efficiency
- Chunked KV cache: poor bandwidth utilization, lowering inference efficiency

💡 Why MoE is More Suitable for CPU-Only Inference

MoE is a “store big, compute small” architecture
- High storage demand: TB-level expert parameters must reside in memory
- Low bandwidth demand: only activated experts are loaded
- Low compute demand: only active paths are computed
CPU-only architecture matches MoE inference needs
- Large storage capacity: easily holds all expert parameters
- Low bandwidth capacity: few activated experts, low memory bandwidth pressure
  - MRDIMM: doubles memory bandwidth, fully saturates compute pipelines
- Low computation capacity: ideal for AMX matrix acceleration
  - AMX: matrix instruction extensions, delivering multiple times performance improvement

Why eLLM Outperforms Existing Frameworks (vLLM, Llama.cpp)

Space-for-Time: Rebuilding CPU Inference Engines

🛠️ Static Resource Allocation

Static memory allocation + static graph compilation → static task set
Directly obtain ready-to-run tasks, freeing more CPU cycles for computation

⚡ Inference Latency Grows Linearly with Sequence Length

Static memory allocation: avoids fragmentation, achieves globally optimal layout
Contiguous memory layout: maximizes KV Cache bandwidth utilization
Dynamic task scheduling: supports irregular input
Dynamic expert activation: only computes necessary paths, saving compute

Performance

Preliminary tests indicate that it runs approximately 30% faster than Llama.cpp.
More detailed benchmarking results will be released soon.

Roadmap

Qwen (30B, 480B) (coming soon)
LLaMA 2 / 3
DeepSeek (coming soon)
gpt-oss

Github: https://github.com/lucienhuangfu/eLLM

📄 Paper

If you're interested in the technical details, you can read our paper and cite:

@misc{huangfu2025ellm,
  title        = {eLLM: Achieving Lossless Million-Token LLM Inference on CPUs Faster Than GPUs},
  author       = {Huangfu, Yaguang},
  howpublished = {Preprint, ResearchGate},
  year         = {2025},
  url          = {https://www.researchgate.net/publication/393416965}
}

📜 License

This project is licensed under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.cargo		.cargo
models		models
python		python
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
PYTHON_API.md		PYTHON_API.md
PYTHON_RUST_MAPPING.md		PYTHON_RUST_MAPPING.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
RUST_PYTHON_API_GUIDE.md		RUST_PYTHON_API_GUIDE.md
SAFETENSORS_README.md		SAFETENSORS_README.md
ellm.pdf		ellm.pdf
pyproject.toml		pyproject.toml
python_api_example.py		python_api_example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

eLLM: Infer Qwen3-480B on a CPU in Real Time

✅ Call for Contributors & Funding

Use Cases

Online Short-Text Inference

Offline Long-Text Inference (Deep Research)

Competitive Advantage: Private LLM Deployment Without GPUs

No Need for High-End GPU Servers

Simple Deployment, Adaptable to Multiple Scenarios

Problems with Existing Solutions

💡 Why MoE is More Suitable for CPU-Only Inference

Why eLLM Outperforms Existing Frameworks (vLLM, Llama.cpp)

Space-for-Time: Rebuilding CPU Inference Engines

🛠️ Static Resource Allocation

⚡ Inference Latency Grows Linearly with Sequence Length

Performance

Roadmap

Github: https://github.com/lucienhuangfu/eLLM

📄 Paper

📜 License

About

Uh oh!

Releases

Packages

Languages

License

lucienhuangfu/eLLM

Folders and files

Latest commit

History

Repository files navigation

eLLM: Infer Qwen3-480B on a CPU in Real Time

✅ Call for Contributors & Funding

Use Cases

Online Short-Text Inference

Offline Long-Text Inference (Deep Research)

Competitive Advantage: Private LLM Deployment Without GPUs

No Need for High-End GPU Servers

Simple Deployment, Adaptable to Multiple Scenarios

Problems with Existing Solutions

💡 Why MoE is More Suitable for CPU-Only Inference

Why eLLM Outperforms Existing Frameworks (vLLM, Llama.cpp)

Space-for-Time: Rebuilding CPU Inference Engines

🛠️ Static Resource Allocation

⚡ Inference Latency Grows Linearly with Sequence Length

Performance

Roadmap

Github: https://github.com/lucienhuangfu/eLLM

📄 Paper

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages