I build scalable LLM training/inference systems across distributed multi-GPU (PyTorch DDP/FSDP, DeepSpeed ZeRO, Ray, MosaicML LLM Foundry) and decentralised swarms (Hivemind, Petals). Writing experiments up so others can reproduce them.
- Distributed Multi-GPU LLM Fine-Tuning (monorepo) — PyTorch (DDP, FSDP), DeepSpeed (ZeRO Offload, Pipeline Parallelism), Ray (Train, Tune), MosaicML; W&B/MLflow + HF Hub for fully traceable runs.
- Hivemind fine-tuning (Qwen2-0.5B) — Internet-scale data parallelism with DHT + fault tolerance; measured val-loss reductions.
- Petals (LLaMA-2-70B) — Decentralised inference + Deep prompt-tuning via swarm model-parallelism.
- Overview: (meta repo) https://github.com/sparklerz/petals-llama2-70b
- Part 1: https://medium.com/@kannansarat9/part-1-inferencing-llama-2-70b-using-petals-swarm-model-parallelism-over-the-internet-a29de8f8aef3
- Part 2: https://medium.com/@kannansarat9/part-2-prompt-tuning-llama-2-70b-using-petals-model-parallelism-over-the-internet-89cdee667840
Medium: https://medium.com/@kannansarat9
DMs open: https://x.com/saratkannan