EvoLingua

EvoLingua: A Scalable Mixture-of-Experts Language Model Framework EvoLingua is an open-source PyTorch implementation of a large-scale, Mixture-of-Experts (MoE) language model inspired by advanced architectures in natural language processing research. It features Multi-head Latent Attention (MLA) for efficient attention mechanisms, a custom MoE design with auxiliary-loss-free load balancing, and Multi-Token Prediction (MTP) for enhanced generation capabilities. Designed for researchers, EvoLingua provides a modular and extensible framework to experiment with cutting-edge language modeling techniques, supporting FP8 precision and distributed training. This project aims to democratize access to state-of-the-art model designs for academic and industrial exploration.

EvoLingua: A Scalable Mixture-of-Experts Language Model Framework

EvoLingua is an advanced, open-source PyTorch implementation of a large-scale Mixture-of-Experts (MoE) language model, inspired by cutting-edge NLP research. It integrates Multi-head Latent Attention (MLA) for efficient attention mechanisms, a custom MoE architecture with auxiliary-loss-free load balancing, and Multi-Token Prediction (MTP) for enhanced generative capabilities. Designed with modularity and scalability in mind, EvoLingua serves as a robust framework for researchers and engineers to explore state-of-the-art language modeling techniques, supporting FP8 precision, distributed training, and customizable configurations.

--

Features

Modular Architecture: Separated components (MLA, MoE, MTP) for easy experimentation and modification.
Scalability: Supports distributed training with PyTorch Distributed and DeepSpeed integration.
Efficiency: Simulated FP8 precision (full support with NVIDIA Transformer Engine) and low-rank attention compression.
Advanced Techniques: Incorporates auxiliary-loss-free load balancing and multi-token prediction.
Customizability: Configurable hyperparameters via config.py.

Technical Overview

EvoLingua is designed to emulate the scale and complexity of modern large language models (LLMs), with an estimated parameter count of up to 1 billion ( 10 million active per token) when fully scaled. It leverages:

Mixture-of-Experts (MoE): A hybrid of shared and routed experts, with dynamic load balancing to optimize compute efficiency.
Multi-head Latent Attention (MLA): Compresses key-value caches for reduced memory footprint while maintaining performance.
Multi-Token Prediction (MTP): Predicts multiple future tokens in a single pass, improving generation speed and coherence.

The model is built to handle long contexts (up to 128,000 tokens) and supports mixed-precision training (FP8/FP16) for high throughput on modern GPUs.

Software Requirements

Operating System: Linux (Ubuntu 20.04+ recommended)
Python: 3.8 or higher

Dependencies (see requirements.txt):

torch>=2.0.0
transformer-engine>=0.12.0  # For FP8 precision (optional)
deepspeed>=0.14.0          # For distributed training (optional)
numpy>=1.23.0

CUDA: 11.8+ (for PyTorch GPU support)

NCCL: 2.10+ (for multi-GPU communication) Optional Tools: torchrun or mpirun for distributed training wandb for experiment tracking Data Requirements EvoLingua requires a large, diverse text corpus for pre-training and fine-tuning:

MMLU: Multi-task language understanding. MATH-500: Mathematical problem-solving. LiveCodeBench: Code generation. Script: Add an evaluate.py with metrics like perplexity and accuracy. Contributing We welcome contributions! Please:

Fork the repository. Create a feature branch (git checkout -b feature-name). Submit a pull request with detailed changes. Issues: Report bugs or suggest enhancements via GitHub Issues.

License This project is licensed under the MIT License. See LICENSE for details.

Acknowledgements EvoLingua is inspired by advancements in NLP research, including Mixture-of-Experts architectures and multi-token prediction strategies. Thanks to the open-source community for tools like PyTorch, DeepSpeed, and Transformer Engine.

Software Requirements (Expanded)

PyTorch: 2.0+ for native FP16 and distributed support.
Transformer Engine: Enables FP8 mixed precision, reducing memory and boosting throughput.
DeepSpeed: ZeRO-3 for parameter sharding, pipeline parallelism for layer distribution.
CUDA Toolkit: Must match GPU architecture

Running the Model

Small-Scale: Use the default config for testing on a single GPU.
Large-Scale:
1. Prepare dataset (e.g., convert to PyTorch tensors with torch.save).
2. Configure ds_config.json for DeepSpeed.
3. Launch with:
```
deepspeed --num_gpus=64 evolingua/train.py --deepspeed ds_config.json
```

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
attention.py		attention.py
config.py		config.py
model.py		model.py
moe.py		moe.py
mtp.py		mtp.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EvoLingua

EvoLingua: A Scalable Mixture-of-Experts Language Model Framework

Features

Technical Overview

Software Requirements

Software Requirements (Expanded)

Running the Model

About

Uh oh!

Releases

Packages

Languages

License

Bhazantri/EvoLingua

Folders and files

Latest commit

History

Repository files navigation

EvoLingua

EvoLingua: A Scalable Mixture-of-Experts Language Model Framework

Features

Technical Overview

Software Requirements

Software Requirements (Expanded)

Running the Model

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages