Enterprise-grade orchestration platform for managing multiple local AI models with intelligent routing, RAG capabilities, and dynamic data ingestion.
- Overview
- System Architecture
- Features
- Models & Technologies
- Installation
- Usage
- API Reference
- Project Structure
- Performance
- Next Steps
- Contributing
This system addresses enterprise AI infrastructure challenges by providing intelligent model orchestration, retrieval-augmented generation (RAG), and dynamic knowledge base management. Built for production scalability with local-first architecture.
Key Capabilities:
- Routes queries to optimal AI models based on content analysis
- Manages concurrent requests with load balancing
- Maintains dynamic knowledge base through web crawling
- Provides REST APIs and web interfaces
- Monitors system health and performance metrics
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Web Crawlers │ │ RAG Pipeline │ │ Model Orchestra │
│ │ │ │ │ │
│ • DuckDuckGo │ │ • ChromaDB │ │ • Query Router │
│ • StackOverflow │────│ • Embeddings │────│ • Load Balancer │
│ • GitHub API │ │ • Retrieval │ │ • Health Monitor│
│ • RSS Feeds │ │ • Generation │ │ • Model Pool │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
┌─────────────────────────────────────────────────┐
│ Interface Layer │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────┐ │
│ │ FastAPI │ │ Streamlit │ │ Direct │ │
│ │ REST API │ │ Dashboard │ │ Python │ │
│ └─────────────┘ └─────────────┘ └──────────┘ │
└─────────────────────────────────────────────────┘
- Intelligent Routing: Automatically selects optimal models based on query analysis
- Load Balancing: Handles concurrent requests with configurable limits (default: 3)
- Health Monitoring: Tracks model availability, response times, and error rates
- Auto-scaling: Models load/unload based on memory constraints and demand
- Vector Database: ChromaDB with persistent storage and semantic search
- Document Processing: Automated chunking and metadata extraction
- Context Enhancement: Retrieves relevant documents to augment model responses
- Multi-format Support: Text files, markdown, web content, API data
- Real-time Data: Pulls latest information from multiple sources
- API Integration: StackOverflow, GitHub, NewsAPI, Alpha Vantage
- Search Capabilities: DuckDuckGo integration for current web content
- RSS Processing: Finance and tech news feeds
- Scheduled Updates: Configurable crawling intervals
- REST API: Production-ready endpoints with auto-documentation
- Web Dashboard: Real-time monitoring and query interface
- Side-by-side Comparison: RAG vs non-RAG response analysis
- Database Explorer: Browse and search stored knowledge
Model | Size | Use Case | Category |
---|---|---|---|
neural-chat:7b-v3.3-q4_0 | 4.1GB | Fast responses, greetings | General |
llama3.1:8b | 4.9GB | General purpose queries | General |
codellama:13b | 7.4GB | Programming tasks | Coding |
mixtral:8x7b-instruct | 26GB | Complex analysis | Analysis |
llama3.1:70b | 42GB | Deep reasoning | Reasoning |
- Python 3.13 - Core language
- FastAPI - REST API framework
- Streamlit - Web dashboard
- ChromaDB - Vector database
- Sentence Transformers - Text embeddings
- BeautifulSoup - Web scraping
- Ollama - Local model management
- GPU: RTX 5090 (24GB VRAM) or equivalent
- RAM: 141GB DDR5 (recommended)
- Storage: 100GB+ for models and data
- OS: Ubuntu 22.04+ (tested)
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Download required models
ollama pull neural-chat:7b-v3.3-q4_0
ollama pull llama3.1:8b
ollama pull codellama:13b
ollama pull mixtral:8x7b-instruct-v0.1-q4_0
ollama pull llama3.1:70b
# Clone repository
git clone https://github.com/yourusername/ai-model-orchestration-system.git
cd ai-model-orchestration-system
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# For enhanced crawling capabilities
export NEWS_API_KEY="your_newsapi_key"
export ALPHA_VANTAGE_KEY="your_alphavantage_key"
# Start model orchestration API
python api/orchestration_api.py
# Start RAG-enhanced API (separate terminal)
python api/rag_api.py
# Launch web dashboard (separate terminal)
streamlit run dashboard/rag_dashboard.py
from orchestration.core.orchestrator import ModelOrchestrator
orchestrator = ModelOrchestrator()
result = orchestrator.process_request_sync("Write Python code for sorting")
# Routes to codellama:13b automatically
from rag.retrieval.rag_orchestrator import RAGOrchestrator
rag = RAGOrchestrator()
result = rag.search_and_generate("What is machine learning?")
# Retrieves relevant docs + generates enhanced response
from rag.crawler.api_crawler import APICrawler
crawler = APICrawler()
result = crawler.comprehensive_crawl([
"artificial intelligence trends",
"python machine learning"
])
POST /orchestrate
- Submit query for intelligent routingGET /system/status
- Get system health and metricsGET /recommendations/{query}
- Get routing recommendations
POST /rag/query
- RAG-enhanced query processingGET /rag/stats
- Knowledge base statistics
curl -X POST "http://localhost:8001/orchestrate" \
-H "Content-Type: application/json" \
-d '{"query": "Explain quantum computing", "priority": "accuracy"}'
ai-model-orchestration-system/
├── orchestration/ # Model orchestration core
│ ├── core/
│ │ ├── pool/ # Model pool management
│ │ ├── router/ # Query routing logic
│ │ ├── balancer/ # Load balancing
│ │ └── orchestrator.py # Main orchestration
│ └── config/ # Configuration
├── rag/ # RAG system components
│ ├── vector_store/ # ChromaDB management
│ ├── ingestion/ # Document processing
│ ├── retrieval/ # RAG orchestration
│ ├── crawler/ # Web crawling
│ └── viewer/ # Data visualization
├── api/ # REST API endpoints
├── dashboard/ # Web interface
├── requirements.txt # Dependencies
└── README.md # Documentation
Model | Avg Response Time | Concurrent Capacity | VRAM Usage |
---|---|---|---|
neural-chat:7b | 0.95s | 8 requests | 4.1GB |
llama3.1:8b | 8.63s | 3 requests | 4.9GB |
codellama:13b | 12.02s | 2 requests | 7.4GB |
- Routing Decision: ~0.003s average
- Document Retrieval: ~0.1s for 5 results
- Concurrent Load: Up to 3 simultaneous requests
- Knowledge Base: 100+ documents, growing via crawlers
- Tool-calling capabilities
- Multi-step reasoning workflows
- Agent coordination framework
- Decision-making algorithms
- Advanced monitoring and alerting
- Multi-user authentication and authorization
- Rate limiting and quotas
- Audit logging
- Model Context Protocol (MCP) support
- Kubernetes deployment configurations
- External API integrations (OpenAI, Anthropic)
- Performance optimization
- Multi-modal support (images, audio)
- Fine-tuning pipeline integration
- A/B testing framework
- Advanced analytics dashboard
source venv/bin/activate
# Verify Python version
python --version
# Install core dependencies if not already done
pip install fastapi uvicorn streamlit requests pydantic beautifulsoup4 lxml
# Install development dependencies
pip install -r requirements-dev.txt
# Run tests
python -m pytest tests/
# Code formatting
black . && flake8 .
python orchestration/core/pool/model_pool.py
# Test model router
python -c "
import sys
sys.path.append('.')
from orchestration.core.router.model_router import ModelRouter
router = ModelRouter()
decision = router.route_request('Write Python code for sorting')
print(f'Query routed to: {decision[\"selected_model\"]}')
print(f'Category: {decision[\"category\"]}')
"
- Follow existing code structure and patterns
- Include tests for new features
- Update documentation for API changes
- Respect rate limits for external APIs
This system demonstrates enterprise-level AI infrastructure management:
- Production Architecture: Handles concurrent users with intelligent load balancing
- Dynamic Knowledge: Automatically updates from 6+ data sources
- Cost Efficiency: $0 operational costs vs $1000s/month for cloud alternatives
- Performance: Sub-second responses for simple queries, <15s for complex analysis
- Scalability: Modular design supports horizontal scaling
- Integration Ready: REST APIs enable integration with existing systems
The architecture addresses key enterprise AI challenges: model selection, resource management, knowledge currency, and system reliability.
MIT License - See LICENSE file for details.
Hardware Tested On: RTX 5090 (24GB VRAM), 141GB DDR5 RAM, Ubuntu 22.04
Last Updated: September 2025
Version: 1.0.0
cd ~/ai-model-orchestration-system
source venv/bin/activate
# Verify Python version
python --version
# Install core dependencies if not already done
pip install fastapi uvicorn streamlit requests pydantic beautifulsoup4 lxml
# Test model pool first
python orchestration/core/pool/model_pool.py
# Test model router
python -c "
import sys
sys.path.append('.')
from orchestration.core.router.model_router import ModelRouter
router = ModelRouter()
decision = router.route_request('Write Python code for sorting')
print(f'Query routed to: {decision[\"selected_model\"]}')
print(f'Category: {decision[\"category\"]}')
"
# Terminal 1 - Start orchestration API
python api/orchestration_api.py
You should see
Load balancer started
INFO: Uvicorn running on http://0.0.0.0:8001
# Terminal 2 - Test the API
source venv/bin/activate
# Test system status
curl http://localhost:8001/system/status
# Test orchestration
curl -X POST "http://localhost:8001/orchestrate" \
-H "Content-Type: application/json" \
-d '{"query": "Write Python code for bubble sort"}'
# Terminal 3 - Start Streamlit dashboard (if API is working)
source venv/bin/activate
streamlit run dashboard/orchestration_dashboard.py
# Install RAG dependencies
pip install chromadb sentence-transformers
# Test ChromaDB
python rag/vector_store/chroma_manager.py
# Test RAG orchestration
python rag/retrieval/rag_orchestrator.py
- API Documentation: http://localhost:8001/docs
- System Status: http://localhost:8001/system/status
- Dashboard: http://localhost:8501 (if running Streamlit)
# Check if Ollama models are available
ollama list
# Test a simple orchestration request
python -c "
import sys
sys.path.append('.')
from orchestration.core.orchestrator import ModelOrchestrator
orchestrator = ModelOrchestrator()
result = orchestrator.process_request_sync('Hello, test the system')
print(f'Response from {result[\"model\"]}: {result[\"response\"][:100]}...')
"