ModelMuxer is an intelligent LLM routing service that optimizes cost and quality by automatically selecting the best provider and model for each request. It uses direct provider connections for optimal performance and reliability, providing advanced features like cost tracking, caching, and intelligent routing.
- Multi-Provider Support: Direct connections to OpenAI, Anthropic, Mistral, Google, Groq, Cohere, Together AI
- Intelligent Routing: Automatic provider/model selection based on request characteristics
- Cost Tracking: Real-time cost monitoring and budget management
- Caching: Response caching for improved performance and cost savings
- Enterprise Features: Multi-tenancy, policy enforcement, and compliance
- Observability: Comprehensive metrics, tracing, and monitoring
# Clone the repository
git clone https://github.com/your-org/modelmuxer.git
cd modelmuxer
# Install dependencies
poetry install
# Set up environment with direct provider API keys
cp .env.example .env
# Edit .env with your direct provider API keys:
# OPENAI_API_KEY=sk-...
# ANTHROPIC_API_KEY=sk-ant-...
# MISTRAL_API_KEY=...
# GOOGLE_API_KEY=...
# Run the server
poetry run python -m app.main
# Start the server
poetry run python -m app.main --mode basic
# Make a request
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"model": "gpt-3.5-turbo"
}'
Key configuration options in .env
:
# Direct Provider API Keys (Primary)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
MISTRAL_API_KEY=...
GOOGLE_API_KEY=...
GROQ_API_KEY=gsk_...
COHERE_API_KEY=...
TOGETHER_API_KEY=...
# Provider Configuration (Direct connections only)
PROVIDER_ADAPTERS_ENABLED=true
# Intent Classifier (Phase 1)
ROUTER_INTENT_CLASSIFIER_ENABLED=true
INTENT_LOW_CONFIDENCE=0.4
INTENT_MIN_CONF_FOR_DIRECT=0.7
# Test Mode
TEST_MODE=false
# Pricing and Cost Estimation
PRICE_TABLE_PATH=./scripts/data/prices.json
LATENCY_PRIORS_WINDOW_S=1800
ESTIMATOR_DEFAULT_TOKENS_IN=400
ESTIMATOR_DEFAULT_TOKENS_OUT=300
# Budget Thresholds
MAX_ESTIMATED_USD_PER_REQUEST=0.08
- Basic Mode: Direct provider routing with cost tracking
- Enhanced Mode: Advanced features with ML classification and caching
- Production Mode: Full enterprise features with monitoring and advanced routing
ModelMuxer uses direct provider connections exclusively, offering:
- Lower Latency: Direct API calls without proxy overhead
- Better Error Handling: Provider-specific error handling and retry logic
- Enhanced Control: Fine-grained configuration per provider
- Improved Observability: Detailed telemetry and circuit breaker patterns
At least one provider must be configured for ModelMuxer to function:
- OpenAI: Set
OPENAI_API_KEY=sk-...
for GPT models - Anthropic: Set
ANTHROPIC_API_KEY=sk-ant-...
for Claude models - Mistral: Set
MISTRAL_API_KEY=...
for Mistral models - Google: Set
GOOGLE_API_KEY=...
for Gemini models - Groq: Set
GROQ_API_KEY=gsk_...
for Groq models - Together AI: Set
TOGETHER_API_KEY=...
for Together AI models - Cohere: Set
COHERE_API_KEY=...
for Cohere models
If no providers are configured:
- The service will log a warning at startup
- Requests will fail with a 503 error
- Check your API key configuration and ensure keys are valid
POST /v1/chat/completions
Compatible with OpenAI's chat completions API. ModelMuxer will automatically route to the optimal provider.
GET /health
Returns service health status.
GET /metrics/prometheus
Prometheus metrics endpoint.
# Run all tests
poetry run pytest
# Run specific test categories
poetry run pytest tests/test_intent_classifier.py
poetry run pytest tests/test_routing.py
# Format code
poetry run black .
# Lint code
poetry run ruff check .
# Type checking
poetry run mypy .
ModelMuxer includes a comprehensive cost estimation and budget management system that helps control spending and optimize model selection based on cost constraints.
The system uses a centralized price table (scripts/data/prices.json
)
containing current market rates for all supported providers and models. The
price table format is:
{
"provider:model": {
"input_per_1k_usd": 2.5,
"output_per_1k_usd": 10.0
}
}
Prices are in USD per 1k tokens and use the mtoks = tokens/1000 formula. The system automatically loads and validates this price table on startup.
The system maintains latency priors for each model using a ring buffer of recent measurements. This provides p95 and p99 percentile estimates for ETA calculation, helping with both cost and performance optimization.
The budget gate enforces cost constraints before routing decisions:
- Pre-request Estimation: Estimates cost using token heuristics and current prices
- Budget Enforcement: Blocks requests that exceed
MAX_ESTIMATED_USD_PER_REQUEST
- Down-routing: Automatically selects cheaper models when budget allows
- Structured Errors: Returns HTTP 402 with detailed cost information when budget exceeded
Configure budget constraints and estimation parameters:
# Budget threshold (typical values: 0.05 conservative, 0.08 balanced, 0.15 permissive)
MAX_ESTIMATED_USD_PER_REQUEST=0.08
# Latency measurement window (30 minutes default)
LATENCY_PRIORS_WINDOW_S=1800
# Default token estimates when not provided
ESTIMATOR_DEFAULT_TOKENS_IN=400
ESTIMATOR_DEFAULT_TOKENS_OUT=300
ModelMuxer uses a standardized error response format for all API errors. All
error responses include an error
object with consistent structure:
{
"error": {
"message": "Human-readable error description",
"type": "error_category",
"code": "specific_error_code",
"details": {
// Additional error-specific information
}
}
}
{
"error": {
"message": "Budget exceeded: No models within budget limit of $0.08",
"type": "budget_exceeded",
"code": "insufficient_budget",
"details": {
"limit": 0.08,
"estimate": 0.12
}
}
}
{
"error": {
"message": "Invalid API key provided.",
"type": "authentication_error",
"code": "invalid_api_key",
"details": {}
}
}
{
"error": {
"message": "Rate limit exceeded: 100/100 requests per minute",
"type": "rate_limit_exceeded",
"code": "security_rate_limit",
"details": {
"current": 100,
"limit": 100,
"window": "minute"
}
}
}
{
"error": {
"message": "Provider error: OpenAI API returned 429",
"type": "provider_error",
"code": "provider_error",
"details": {}
}
}
{
"error": {
"message": "Provider openai is not available",
"type": "service_unavailable",
"code": "provider_unavailable",
"details": {}
}
}
The new cost estimation system works alongside the existing cost tracking system:
- Pre-request Estimation: New system estimates costs before routing
- Post-request Tracking: Existing system tracks actual costs after completion
- Telemetry Integration: Both systems contribute to monitoring and metrics
- Backward Compatibility: Existing cost tracking continues to work unchanged
When debug mode is enabled (SERVER_DEBUG=true
), the API includes additional
headers for observability:
X-Route-Decision
: Shows the selected provider and model (e.g.,openai:gpt-4o-mini
)X-Route-Estimate-USD
: Shows the estimated cost in USD (e.g.,0.000150
)
Note: These headers are non-contractual and may change without notice. They are intended for debugging and monitoring purposes only.
The Routing Mind intent classifier is the first building block for intelligent routing. It analyzes each request and tags it with a task label and confidence score before routing decisions are made.
- Lightweight Classification: Uses heuristics by default, with optional cheap LLM integration
- Deterministic Results: Test mode ensures reproducible behavior
- Feature Extraction: Extracts lexical and structural signals from prompts
- Telemetry Integration: OpenTelemetry spans and Prometheus metrics
The classifier supports 7 intent labels:
chat_lite
: Simple conversation and basic questionsdeep_reason
: Complex analysis, explanations, and reasoningcode_gen
: Code generation and programming tasksjson_extract
: JSON parsing and structured data extractiontranslation
: Language translation tasksvision
: Image analysis and OCR taskssafety_risk
: Potentially harmful content detection
# Enable/disable the classifier
ROUTER_INTENT_CLASSIFIER_ENABLED=true
# Confidence thresholds
INTENT_LOW_CONFIDENCE=0.4
INTENT_MIN_CONF_FOR_DIRECT=0.7
The classifier runs automatically on each request and attaches intent metadata to the response:
{
"router_metadata": {
"selected_provider": "openai",
"selected_model": "gpt-3.5-turbo",
"routing_reason": "Simple query detected",
"intent_label": "chat_lite",
"intent_confidence": 0.85,
"intent_signals": {
"token_length_est": 45.2,
"has_code_fence": false,
"has_programming_keywords": false,
"signals": {
"code": false,
"translation": false,
"vision": false,
"safety": false
}
}
}
}
The classifier includes comprehensive tests with a dataset of 60 labeled examples:
# Run intent classifier tests
poetry run pytest tests/test_intent_classifier.py -v
The test suite validates:
- Accuracy ≥ 80% on the labeled dataset
- Deterministic results in test mode
- Proper handling of disabled feature flag
- Confidence score validation
- Feature signal extraction
The intent classifier consists of:
- Feature Extraction (
app/core/features.py
): Extracts lexical and structural signals - Intent Classification (
app/core/intent.py
): Heuristic classification with LLM fallback - Integration (
app/main.py
): Wired into request flow with telemetry - Telemetry (
app/telemetry/metrics.py
): Prometheus counter and OpenTelemetry spans
- Cheap LLM integration for improved accuracy
- Dynamic confidence thresholds based on model performance
- Intent-aware routing decisions
- A/B testing framework for intent strategies
ModelMuxer provides comprehensive Prometheus metrics for monitoring routing decisions, cost estimation, and system performance.
modelmuxer_router_cost_estimate_usd_sum
: Total estimated costs by route, model, and budget status- Labels:
route
,model
,within_budget
(true/false) - The
within_budget
label helps analyze budget gating effectiveness
- Labels:
modelmuxer_router_budget_exceeded_total
: Budget exceeded events by route and reasonmodelmuxer_router_decision_latency_ms
: Router decision latency distributionmodelmuxer_provider_latency_seconds
: Provider response latency by provider and model
The within_budget
label in cost estimation metrics provides visibility into:
- How often models exceed budget thresholds
- Which models are most frequently down-routed due to cost
- Budget gating effectiveness across different request types
A pre-configured Grafana dashboard is available in
grafana/dashboard_modelmuxer.json
for visualizing these metrics.
The current latency tracking system (LatencyPriors
) is implemented as an
in-memory ring buffer that resets on application restart. This means:
- Limitation: All latency measurements are lost when the service restarts
- Impact: ETA estimates will fall back to defaults until new measurements are collected
- Workaround: For production deployments, consider implementing a Redis-backed version that persists measurements across restarts
- Future Enhancement: The interface is designed to be easily replaceable with a persistent backend
Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
This project is licensed under the Business Source License 1.1 - see the LICENSE file for details.
For commercial licensing and enterprise support, contact:
- Email: licensing@modelmuxer.com
For support and questions:
- Create an issue on GitHub
- Check the documentation
- Review troubleshooting guide