A comprehensive backend service for managing user authentication, organization membership, cluster resource allocation, and deployment scheduling with priority-based preemptive scheduling.
- User Authentication: JWT-based authentication with bcrypt password hashing
- Organization Management: Invite code-based organization membership
- Cluster Management: Create and manage clusters with AWS-style resource units
- Deployment Management: Docker-based deployment management with resource allocation
- Priority-based Scheduling: HIGH/MEDIUM/LOW priority with preemptive scheduling
- Resource Optimization: Efficient resource utilization and bin-packing algorithms
- Queue Management: Redis-based persistent deployment queues
- Framework: FastAPI with async support
- Database: PostgreSQL with SQLAlchemy ORM
- Queue: Redis for deployment scheduling
- Authentication: JWT tokens with bcrypt
- Testing: pytest with async support
- Python 3.8+
- PostgreSQL
- Redis Server
-
Clone the repository:
git clone <repository-url> cd mlops_backend
-
Install dependencies:
pip install -r requirements.txt
-
Setup PostgreSQL:
# Create database createdb mlops_db # Update connection string in app/config.py or set environment variable export DATABASE_URL="postgresql://username:password@localhost/mlops_db" or export DATABASE_URL="postgresql://localhost/mlops_db"
-
Setup Redis:
# Start Redis server redis-server # Or using Docker docker run -d -p 6379:6379 redis:alpine
-
Configure environment variables (optional): Create a
.env
file:DATABASE_URL=postgresql://localhost/mlops_db REDIS_URL=redis://localhost:6379/0 SECRET_KEY=your-secret-key-here
# Run with auto-reload
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
# Run with gunicorn
gunicorn app.main:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000
The API will be available at http://localhost:8000
Once the server is running, visit:
- Interactive API docs:
http://localhost:8000/docs
- ReDoc documentation:
http://localhost:8000/redoc
POST /auth/register
- Register new userPOST /auth/login
- Login userPOST /auth/join-organization
- Join organization with invite codeGET /auth/me
- Get current user info
POST /organizations/
- Create organizationGET /organizations/me
- Get user's organization
POST /clusters/
- Create clusterGET /clusters/
- List clustersGET /clusters/{id}
- Get cluster detailsPUT /clusters/{id}
- Update clusterDELETE /clusters/{id}
- Delete clusterGET /clusters/{id}/resources
- Get resource usage
POST /deployments/
- Create deploymentGET /deployments/
- List deploymentsGET /deployments/{id}
- Get deployment detailsPUT /deployments/{id}
- Update deployment priorityDELETE /deployments/{id}
- Cancel deploymentPOST /deployments/{id}/start
- Start deployment (simulation)POST /deployments/{id}/complete
- Complete deployment (simulation)GET /deployments/queue/{cluster_id}
- Get deployment queuePOST /deployments/queue/{cluster_id}/process
- Process queue
- RAM: Gigabytes (GB) - e.g., 1, 2, 4, 8, 16, 32
- CPU: vCPUs - e.g., 1, 2, 4, 8, 16
- GPU: Count - e.g., 0, 1, 2, 4, 8
- HIGH (1): Highest priority, can preempt lower priority deployments
- MEDIUM (2): Standard priority
- LOW (3): Lowest priority, can be preempted
The service implements a Priority-based Preemptive Scheduler:
- Immediate Scheduling: If resources are available, deploy immediately
- Preemption: HIGH priority deployments can preempt MEDIUM/LOW priority ones
- Queueing: Deployments wait in Redis queue when resources unavailable
- Resource Optimization: Minimal preemption set using bin-packing approach
- Automatic Requeuing: Preempted deployments are automatically requeued
Run the test suite:
# Run all tests
pytest
# Run specific test file
pytest tests/test_auth.py
# Run with coverage
pytest --cov=app tests/
# Register user
curl -X POST "http://localhost:8000/auth/register" \
-H "Content-Type: application/json" \
-d '{
"username": "admin",
"email": "admin@company.com",
"password": "securepassword"
}'
# Login
curl -X POST "http://localhost:8000/auth/login" \
-H "Content-Type: application/json" \
-d '{
"username": "admin",
"password": "securepassword"
}'
# Create organization
curl -X POST "http://localhost:8000/organizations/" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "My Company"
}'
curl -X POST "http://localhost:8000/clusters/" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Production Cluster",
"total_ram_gb": 64.0,
"total_cpu_vcpus": 16.0,
"total_gpu_count": 4
}'
curl -X POST "http://localhost:8000/deployments/" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Web App Deployment",
"cluster_id": 1,
"docker_image": "nginx:latest",
"required_ram_gb": 4.0,
"required_cpu_vcpus": 2.0,
"required_gpu_count": 0,
"priority": "HIGH"
}'
curl http://localhost:8000/health
The service follows a layered architecture:
- API Layer: FastAPI routers handling HTTP requests
- Service Layer: Business logic and orchestration
- Model Layer: SQLAlchemy ORM models
- Database Layer: PostgreSQL for persistent data
- Queue Layer: Redis for deployment scheduling
The codebase is designed to be extensible for:
- RBAC (Role-Based Access Control): User roles are already in the model
- Multi-cloud Support: Abstract resource providers
- Advanced Scheduling: Machine learning-based resource prediction
- Monitoring: Integration with Prometheus/Grafana
- Audit Logging: Track all operations for compliance
- Fork the repository
- Create a feature branch
- Make changes with tests
- Submit a pull request
[Add your license here]