Skip to content

dineshtripathi/ai-llm-model-orchestration-system

Repository files navigation

AI Model Orchestration System

Enterprise-grade orchestration platform for managing multiple local AI models with intelligent routing, RAG capabilities, and dynamic data ingestion.

Table of Contents

Overview

This system addresses enterprise AI infrastructure challenges by providing intelligent model orchestration, retrieval-augmented generation (RAG), and dynamic knowledge base management. Built for production scalability with local-first architecture.

Key Capabilities:

  • Routes queries to optimal AI models based on content analysis
  • Manages concurrent requests with load balancing
  • Maintains dynamic knowledge base through web crawling
  • Provides REST APIs and web interfaces
  • Monitors system health and performance metrics

System Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Web Crawlers  │    │  RAG Pipeline   │    │ Model Orchestra │
│                 │    │                 │    │                 │
│ • DuckDuckGo    │    │ • ChromaDB      │    │ • Query Router  │
│ • StackOverflow │────│ • Embeddings    │────│ • Load Balancer │
│ • GitHub API    │    │ • Retrieval     │    │ • Health Monitor│
│ • RSS Feeds     │    │ • Generation    │    │ • Model Pool    │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
         ┌─────────────────────────────────────────────────┐
         │              Interface Layer                    │
         │                                                 │
         │  ┌─────────────┐  ┌─────────────┐  ┌──────────┐ │
         │  │ FastAPI     │  │ Streamlit   │  │ Direct   │ │
         │  │ REST API    │  │ Dashboard   │  │ Python   │ │
         │  └─────────────┘  └─────────────┘  └──────────┘ │
         └─────────────────────────────────────────────────┘

Features

Model Orchestration

  • Intelligent Routing: Automatically selects optimal models based on query analysis
  • Load Balancing: Handles concurrent requests with configurable limits (default: 3)
  • Health Monitoring: Tracks model availability, response times, and error rates
  • Auto-scaling: Models load/unload based on memory constraints and demand

RAG System

  • Vector Database: ChromaDB with persistent storage and semantic search
  • Document Processing: Automated chunking and metadata extraction
  • Context Enhancement: Retrieves relevant documents to augment model responses
  • Multi-format Support: Text files, markdown, web content, API data

Web Crawling

  • Real-time Data: Pulls latest information from multiple sources
  • API Integration: StackOverflow, GitHub, NewsAPI, Alpha Vantage
  • Search Capabilities: DuckDuckGo integration for current web content
  • RSS Processing: Finance and tech news feeds
  • Scheduled Updates: Configurable crawling intervals

Interfaces

  • REST API: Production-ready endpoints with auto-documentation
  • Web Dashboard: Real-time monitoring and query interface
  • Side-by-side Comparison: RAG vs non-RAG response analysis
  • Database Explorer: Browse and search stored knowledge

Models & Technologies

AI Models (Ollama)

Model Size Use Case Category
neural-chat:7b-v3.3-q4_0 4.1GB Fast responses, greetings General
llama3.1:8b 4.9GB General purpose queries General
codellama:13b 7.4GB Programming tasks Coding
mixtral:8x7b-instruct 26GB Complex analysis Analysis
llama3.1:70b 42GB Deep reasoning Reasoning

Technical Stack

  • Python 3.13 - Core language
  • FastAPI - REST API framework
  • Streamlit - Web dashboard
  • ChromaDB - Vector database
  • Sentence Transformers - Text embeddings
  • BeautifulSoup - Web scraping
  • Ollama - Local model management

Hardware Requirements

  • GPU: RTX 5090 (24GB VRAM) or equivalent
  • RAM: 141GB DDR5 (recommended)
  • Storage: 100GB+ for models and data
  • OS: Ubuntu 22.04+ (tested)

Installation

Prerequisites

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Download required models
ollama pull neural-chat:7b-v3.3-q4_0
ollama pull llama3.1:8b
ollama pull codellama:13b
ollama pull mixtral:8x7b-instruct-v0.1-q4_0
ollama pull llama3.1:70b

Setup

# Clone repository
git clone https://github.com/yourusername/ai-model-orchestration-system.git
cd ai-model-orchestration-system

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Environment Variables (Optional)

# For enhanced crawling capabilities
export NEWS_API_KEY="your_newsapi_key"
export ALPHA_VANTAGE_KEY="your_alphavantage_key"

Usage

Start Core Services

# Start model orchestration API
python api/orchestration_api.py

# Start RAG-enhanced API (separate terminal)
python api/rag_api.py

# Launch web dashboard (separate terminal)
streamlit run dashboard/rag_dashboard.py

Basic Usage Examples

Query Routing

from orchestration.core.orchestrator import ModelOrchestrator

orchestrator = ModelOrchestrator()
result = orchestrator.process_request_sync("Write Python code for sorting")
# Routes to codellama:13b automatically

RAG Queries

from rag.retrieval.rag_orchestrator import RAGOrchestrator

rag = RAGOrchestrator()
result = rag.search_and_generate("What is machine learning?")
# Retrieves relevant docs + generates enhanced response

Web Crawling

from rag.crawler.api_crawler import APICrawler

crawler = APICrawler()
result = crawler.comprehensive_crawl([
    "artificial intelligence trends",
    "python machine learning"
])

API Reference

Model Orchestration Endpoints

  • POST /orchestrate - Submit query for intelligent routing
  • GET /system/status - Get system health and metrics
  • GET /recommendations/{query} - Get routing recommendations

RAG Endpoints

  • POST /rag/query - RAG-enhanced query processing
  • GET /rag/stats - Knowledge base statistics

Example API Call

curl -X POST "http://localhost:8001/orchestrate" \
     -H "Content-Type: application/json" \
     -d '{"query": "Explain quantum computing", "priority": "accuracy"}'

Project Structure

ai-model-orchestration-system/
├── orchestration/                 # Model orchestration core
│   ├── core/
│   │   ├── pool/                 # Model pool management
│   │   ├── router/               # Query routing logic
│   │   ├── balancer/             # Load balancing
│   │   └── orchestrator.py       # Main orchestration
│   └── config/                   # Configuration
├── rag/                          # RAG system components
│   ├── vector_store/             # ChromaDB management
│   ├── ingestion/                # Document processing
│   ├── retrieval/                # RAG orchestration
│   ├── crawler/                  # Web crawling
│   └── viewer/                   # Data visualization
├── api/                          # REST API endpoints
├── dashboard/                    # Web interface
├── requirements.txt              # Dependencies
└── README.md                     # Documentation

Performance

Benchmarks (RTX 5090, 141GB RAM)

Model Avg Response Time Concurrent Capacity VRAM Usage
neural-chat:7b 0.95s 8 requests 4.1GB
llama3.1:8b 8.63s 3 requests 4.9GB
codellama:13b 12.02s 2 requests 7.4GB

System Metrics

  • Routing Decision: ~0.003s average
  • Document Retrieval: ~0.1s for 5 results
  • Concurrent Load: Up to 3 simultaneous requests
  • Knowledge Base: 100+ documents, growing via crawlers

Next Steps

Phase 1: Agentic AI (Current Priority)

  • Tool-calling capabilities
  • Multi-step reasoning workflows
  • Agent coordination framework
  • Decision-making algorithms

Phase 2: Enterprise Features

  • Advanced monitoring and alerting
  • Multi-user authentication and authorization
  • Rate limiting and quotas
  • Audit logging

Phase 3: Scaling & Integration

  • Model Context Protocol (MCP) support
  • Kubernetes deployment configurations
  • External API integrations (OpenAI, Anthropic)
  • Performance optimization

Phase 4: Advanced Capabilities

  • Multi-modal support (images, audio)
  • Fine-tuning pipeline integration
  • A/B testing framework
  • Advanced analytics dashboard

Contributing

Development Setup

source venv/bin/activate

# Verify Python version
python --version

# Install core dependencies if not already done
pip install fastapi uvicorn streamlit requests pydantic beautifulsoup4 lxml

# Install development dependencies
pip install -r requirements-dev.txt

# Run tests
python -m pytest tests/

# Code formatting
black . && flake8 .

Test Orchestration Component

Test model pool first

python orchestration/core/pool/model_pool.py

# Test model router
python -c "
import sys
sys.path.append('.')
from orchestration.core.router.model_router import ModelRouter

router = ModelRouter()
decision = router.route_request('Write Python code for sorting')
print(f'Query routed to: {decision[\"selected_model\"]}')
print(f'Category: {decision[\"category\"]}')
"

Contribution Guidelines

  • Follow existing code structure and patterns
  • Include tests for new features
  • Update documentation for API changes
  • Respect rate limits for external APIs

Achievements Summary

This system demonstrates enterprise-level AI infrastructure management:

  • Production Architecture: Handles concurrent users with intelligent load balancing
  • Dynamic Knowledge: Automatically updates from 6+ data sources
  • Cost Efficiency: $0 operational costs vs $1000s/month for cloud alternatives
  • Performance: Sub-second responses for simple queries, <15s for complex analysis
  • Scalability: Modular design supports horizontal scaling
  • Integration Ready: REST APIs enable integration with existing systems

The architecture addresses key enterprise AI challenges: model selection, resource management, knowledge currency, and system reliability.

License

MIT License - See LICENSE file for details.


Hardware Tested On: RTX 5090 (24GB VRAM), 141GB DDR5 RAM, Ubuntu 22.04
Last Updated: September 2025
Version: 1.0.0

Run Locally

Step 1: Ensure Fresh Environment setup

cd ~/ai-model-orchestration-system
source venv/bin/activate

# Verify Python version
python --version

# Install core dependencies if not already done
pip install fastapi uvicorn streamlit requests pydantic beautifulsoup4 lxml

Step 2: Test core orchestration components

# Test model pool first
python orchestration/core/pool/model_pool.py

# Test model router
python -c "
import sys
sys.path.append('.')
from orchestration.core.router.model_router import ModelRouter

router = ModelRouter()
decision = router.route_request('Write Python code for sorting')
print(f'Query routed to: {decision[\"selected_model\"]}')
print(f'Category: {decision[\"category\"]}')
"

Step 3: Start Orchestration API

# Terminal 1 - Start orchestration API
python api/orchestration_api.py

You should see

Load balancer started
INFO:     Uvicorn running on http://0.0.0.0:8001

Step 4: Test the API (in another Terminal)

# Terminal 2 - Test the API
source venv/bin/activate

# Test system status
curl http://localhost:8001/system/status

# Test orchestration
curl -X POST "http://localhost:8001/orchestrate" \
     -H "Content-Type: application/json" \
     -d '{"query": "Write Python code for bubble sort"}'

Step 5: Start the dashboard

# Terminal 3 - Start Streamlit dashboard (if API is working)
source venv/bin/activate
streamlit run dashboard/orchestration_dashboard.py

Step 6: Test RAG System

# Install RAG dependencies
pip install chromadb sentence-transformers

# Test ChromaDB
python rag/vector_store/chroma_manager.py

# Test RAG orchestration
python rag/retrieval/rag_orchestrator.py

Step 7 : Access your system

Quick Verification commands:
# Check if Ollama models are available
ollama list

# Test a simple orchestration request
python -c "
import sys
sys.path.append('.')
from orchestration.core.orchestrator import ModelOrchestrator

orchestrator = ModelOrchestrator()
result = orchestrator.process_request_sync('Hello, test the system')
print(f'Response from {result[\"model\"]}: {result[\"response\"][:100]}...')
"

About

Intelligent model routing with automatic loading

Resources

Stars

Watchers

Forks

Packages

No packages published