Skip to content

DevAakash17/mlops-backend

Repository files navigation

MLOps Backend Service

A comprehensive backend service for managing user authentication, organization membership, cluster resource allocation, and deployment scheduling with priority-based preemptive scheduling.

Features

  • User Authentication: JWT-based authentication with bcrypt password hashing
  • Organization Management: Invite code-based organization membership
  • Cluster Management: Create and manage clusters with AWS-style resource units
  • Deployment Management: Docker-based deployment management with resource allocation
  • Priority-based Scheduling: HIGH/MEDIUM/LOW priority with preemptive scheduling
  • Resource Optimization: Efficient resource utilization and bin-packing algorithms
  • Queue Management: Redis-based persistent deployment queues

Technology Stack

  • Framework: FastAPI with async support
  • Database: PostgreSQL with SQLAlchemy ORM
  • Queue: Redis for deployment scheduling
  • Authentication: JWT tokens with bcrypt
  • Testing: pytest with async support

Installation

Prerequisites

  • Python 3.8+
  • PostgreSQL
  • Redis Server

Setup

  1. Clone the repository:

    git clone <repository-url>
    cd mlops_backend
  2. Install dependencies:

    pip install -r requirements.txt
  3. Setup PostgreSQL:

    # Create database
    createdb mlops_db
    
    # Update connection string in app/config.py or set environment variable
    export DATABASE_URL="postgresql://username:password@localhost/mlops_db"
    
    or 
    
    export DATABASE_URL="postgresql://localhost/mlops_db"
  4. Setup Redis:

    # Start Redis server
    redis-server
    
    # Or using Docker
    docker run -d -p 6379:6379 redis:alpine
  5. Configure environment variables (optional): Create a .env file:

    DATABASE_URL=postgresql://localhost/mlops_db
    REDIS_URL=redis://localhost:6379/0
    SECRET_KEY=your-secret-key-here
    

Running the Application

Development Mode

# Run with auto-reload
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Production Mode

# Run with gunicorn
gunicorn app.main:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000

The API will be available at http://localhost:8000

API Documentation

Once the server is running, visit:

  • Interactive API docs: http://localhost:8000/docs
  • ReDoc documentation: http://localhost:8000/redoc

API Endpoints

Authentication

  • POST /auth/register - Register new user
  • POST /auth/login - Login user
  • POST /auth/join-organization - Join organization with invite code
  • GET /auth/me - Get current user info

Organizations

  • POST /organizations/ - Create organization
  • GET /organizations/me - Get user's organization

Clusters

  • POST /clusters/ - Create cluster
  • GET /clusters/ - List clusters
  • GET /clusters/{id} - Get cluster details
  • PUT /clusters/{id} - Update cluster
  • DELETE /clusters/{id} - Delete cluster
  • GET /clusters/{id}/resources - Get resource usage

Deployments

  • POST /deployments/ - Create deployment
  • GET /deployments/ - List deployments
  • GET /deployments/{id} - Get deployment details
  • PUT /deployments/{id} - Update deployment priority
  • DELETE /deployments/{id} - Cancel deployment
  • POST /deployments/{id}/start - Start deployment (simulation)
  • POST /deployments/{id}/complete - Complete deployment (simulation)
  • GET /deployments/queue/{cluster_id} - Get deployment queue
  • POST /deployments/queue/{cluster_id}/process - Process queue

Resource Units (AWS-style)

  • RAM: Gigabytes (GB) - e.g., 1, 2, 4, 8, 16, 32
  • CPU: vCPUs - e.g., 1, 2, 4, 8, 16
  • GPU: Count - e.g., 0, 1, 2, 4, 8

Priority Levels

  • HIGH (1): Highest priority, can preempt lower priority deployments
  • MEDIUM (2): Standard priority
  • LOW (3): Lowest priority, can be preempted

Scheduling Algorithm

The service implements a Priority-based Preemptive Scheduler:

  1. Immediate Scheduling: If resources are available, deploy immediately
  2. Preemption: HIGH priority deployments can preempt MEDIUM/LOW priority ones
  3. Queueing: Deployments wait in Redis queue when resources unavailable
  4. Resource Optimization: Minimal preemption set using bin-packing approach
  5. Automatic Requeuing: Preempted deployments are automatically requeued

Testing

Run the test suite:

# Run all tests
pytest

# Run specific test file
pytest tests/test_auth.py

# Run with coverage
pytest --cov=app tests/

Usage Examples

1. Register and Setup Organization

# Register user
curl -X POST "http://localhost:8000/auth/register" \
  -H "Content-Type: application/json" \
  -d '{
    "username": "admin",
    "email": "admin@company.com",
    "password": "securepassword"
  }'

# Login
curl -X POST "http://localhost:8000/auth/login" \
  -H "Content-Type: application/json" \
  -d '{
    "username": "admin",
    "password": "securepassword"
  }'

# Create organization
curl -X POST "http://localhost:8000/organizations/" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "My Company"
  }'

2. Create Cluster

curl -X POST "http://localhost:8000/clusters/" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Production Cluster",
    "total_ram_gb": 64.0,
    "total_cpu_vcpus": 16.0,
    "total_gpu_count": 4
  }'

3. Create Deployment

curl -X POST "http://localhost:8000/deployments/" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Web App Deployment",
    "cluster_id": 1,
    "docker_image": "nginx:latest",
    "required_ram_gb": 4.0,
    "required_cpu_vcpus": 2.0,
    "required_gpu_count": 0,
    "priority": "HIGH"
  }'

Health Check

curl http://localhost:8000/health

Architecture

The service follows a layered architecture:

  • API Layer: FastAPI routers handling HTTP requests
  • Service Layer: Business logic and orchestration
  • Model Layer: SQLAlchemy ORM models
  • Database Layer: PostgreSQL for persistent data
  • Queue Layer: Redis for deployment scheduling

Future Enhancements

The codebase is designed to be extensible for:

  • RBAC (Role-Based Access Control): User roles are already in the model
  • Multi-cloud Support: Abstract resource providers
  • Advanced Scheduling: Machine learning-based resource prediction
  • Monitoring: Integration with Prometheus/Grafana
  • Audit Logging: Track all operations for compliance

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make changes with tests
  4. Submit a pull request

License

[Add your license here]

About

Comprehensive backend service for managing user, organization, cluster, deployment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published