Skip to content

⚡ Python-free Rust inference server — OpenAI-API compatible. GGUF + SafeTensors, hot model swap, auto-discovery, single binary. FREE now, FREE forever.

License

Notifications You must be signed in to change notification settings

manualhyper/shimmy

 
 
Shimmy Logo

The 5MB Alternative to Ollama

License: MIT Rust CI Tests Quality Sponsor

Shimmy will be free forever. No asterisks. No "free for now." No pivot to paid.

Fast, reliable local AI inference. Shimmy provides OpenAI-compatible endpoints for GGUF models with comprehensive testing and automated quality assurance.

What is Shimmy?

Shimmy is a 5.1MB single-binary local inference server that provides OpenAI API-compatible endpoints for GGUF models. It's designed to be the invisible infrastructure that just works.

Metric Shimmy Ollama
Binary Size 5.1MB 🏆 680MB
Startup Time <100ms 🏆 5-10s
Memory Overhead <50MB 🏆 200MB+
OpenAI Compatibility 100% 🏆 Partial
Port Management Auto 🏆 Manual
Configuration Zero 🏆 Manual

🎯 Perfect for Developers

  • Privacy: Your code stays on your machine
  • Cost: No per-token pricing, unlimited queries
  • Speed: Local inference = sub-second responses
  • Integration: Works with VSCode, Cursor, Continue.dev out of the box

BONUS: First-class LoRA adapter support - from training to production API in 30 seconds.

Quick Start (30 seconds)

Installation

# Install from crates.io (Linux, macOS, Windows)
cargo install shimmy

# Or download pre-built binary (Windows only)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy.exe

⚠️ Windows Security Notice: Windows Defender may flag the binary as a false positive. This is common with unsigned Rust executables. Recommended: Use cargo install shimmy instead, or add an exclusion for shimmy.exe in Windows Defender.

Get Models

Shimmy auto-discovers models from:

  • Hugging Face cache: ~/.cache/huggingface/hub/
  • Ollama models: ~/.ollama/models/
  • Local directory: ./models/
  • Environment: SHIMMY_BASE_GGUF=path/to/model.gguf
# Download models that work out of the box
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf --local-dir ./models/
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF --local-dir ./models/

Start Server

# Auto-allocates port to avoid conflicts
shimmy serve

# Or use manual port
shimmy serve --bind 127.0.0.1:11435

Point your AI tools to the displayed port - VSCode Copilot, Cursor, Continue.dev all work instantly!

📦 Download & Install

Package Managers

Direct Downloads

  • GitHub Releases: Latest binaries
  • Docker: docker pull shimmy/shimmy:latest (coming soon)

🍎 macOS Support

Full compatibility confirmed! Shimmy works flawlessly on macOS with Metal GPU acceleration.

# Install dependencies
brew install cmake rust

# Install shimmy
cargo install shimmy

✅ Verified working:

  • Intel and Apple Silicon Macs
  • Metal GPU acceleration (automatic)
  • Xcode 17+ compatibility
  • All LoRA adapter features

Integration Examples

VSCode Copilot

{
  "github.copilot.advanced": {
    "serverUrl": "http://localhost:11435"
  }
}

Continue.dev

{
  "models": [{
    "title": "Local Shimmy",
    "provider": "openai", 
    "model": "your-model-name",
    "apiBase": "http://localhost:11435/v1"
  }]
}

Cursor IDE

Works out of the box - just point to http://localhost:11435/v1

Why Shimmy Will Always Be Free

I built Shimmy because I was tired of 680MB binaries to run a 4GB model.

This is my commitment: Shimmy stays MIT licensed, forever. If you want to support development, sponsor it. If you don't, just build something cool with it.

Shimmy saves you time and money. If it's useful, consider sponsoring for $5/month — less than your Netflix subscription, infinitely more useful.

Performance Comparison

Tool Binary Size Startup Time Memory Usage OpenAI API
Shimmy 5.1MB <100ms 50MB 100%
Ollama 680MB 5-10s 200MB+ Partial
llama.cpp 89MB 1-2s 100MB None

API Reference

Endpoints

  • GET /health - Health check
  • POST /v1/chat/completions - OpenAI-compatible chat
  • GET /v1/models - List available models
  • POST /api/generate - Shimmy native API
  • GET /ws/generate - WebSocket streaming

CLI Commands

shimmy serve                    # Start server (auto port allocation)
shimmy serve --bind 127.0.0.1:8080  # Manual port binding
shimmy list                     # Show available models  
shimmy discover                 # Refresh model discovery
shimmy generate --name X --prompt "Hi"  # Test generation
shimmy probe model-name         # Verify model loads

Technical Architecture

  • Rust + Tokio: Memory-safe, async performance
  • llama.cpp backend: Industry-standard GGUF inference
  • OpenAI API compatibility: Drop-in replacement
  • Dynamic port management: Zero conflicts, auto-allocation
  • Zero-config auto-discovery: Just works™

Community & Support

Sponsors

See our amazing sponsors who make Shimmy possible! 🙏

Sponsorship Tiers:

  • $5/month: Coffee tier - My eternal gratitude + sponsor badge
  • $25/month: Bug prioritizer - Priority support + name in SPONSORS.md
  • $100/month: Corporate backer - Logo on README + monthly office hours
  • $500/month: Infrastructure partner - Direct support + roadmap input

Companies: Need invoicing? Email michaelallenkuykendall@gmail.com

Quality & Reliability

Shimmy maintains high code quality through comprehensive testing:

  • Comprehensive test suite with property-based testing
  • Automated CI/CD pipeline with quality gates
  • Runtime invariant checking for critical operations
  • Cross-platform compatibility testing

See our testing approach for technical details.


License & Philosophy

MIT License - forever and always.

Philosophy: Infrastructure should be invisible. Shimmy is infrastructure.

Testing Philosophy: Reliability through comprehensive validation and property-based testing.


Forever maintainer: Michael A. Kuykendall
Promise: This will never become a paid product
Mission: Making local AI development frictionless

"The best code is code you don't have to think about."
"The best tests are properties you can't break."

About

⚡ Python-free Rust inference server — OpenAI-API compatible. GGUF + SafeTensors, hot model swap, auto-discovery, single binary. FREE now, FREE forever.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Rust 61.5%
  • C 24.1%
  • C++ 11.8%
  • TypeScript 1.6%
  • JavaScript 0.5%
  • Dockerfile 0.4%
  • Ruby 0.1%