π Complete infrastructure setup for Kubernetes-based ML workloads with automated local cluster provisioning and operator management
A production-ready, infrastructure-first solution that gets you from zero to a fully configured K8s ML environment in 15 minutes.
β
Automated Cluster Setup: Kind clusters optimized for ML workloads on laptops/workstations
β
Operator Management: Automated Kubeflow Training Operator installation and configuration
β
Infrastructure as Code: Declarative cluster configurations and resource management
β
Multi-Backend Support: Flexible infrastructure supporting various ML frameworks
β
Resource Optimization: Memory and CPU configurations tuned for local development
β
Fault-Tolerant Setup: Robust infrastructure with multiple fallback mechanisms
β
Ready-to-Use Examples: Pre-configured distributed PyTorch training as proof-of-concept
k8s-ml-lab/
βββ bin/ # Executable scripts
β βββ setup.sh # Infrastructure automation script
βββ configs/ # Infrastructure configurations
β βββ kind-cluster-config.yaml # Kind cluster specification
β βββ pytorch-distributed-job.yaml # Sample workload configuration
β βββ pytorch-test-job.yaml # Test workload configuration
βββ scripts/ # Sample ML workloads
β βββ distributed_mnist_training.py # Distributed training example
β βββ simple_single_pod_training.py # Single-pod training example
β βββ test_mnist_model.py # Model inference example
βββ input/ # Input datasets (auto-populated)
βββ output/ # Training outputs (auto-created)
βββ examples/ # Infrastructure examples and guides
β βββ README.md # π Comprehensive documentation
β βββ 01-complete-workflow/ # Complete infrastructure + workload demo
β βββ 02-existing-cluster/ # Existing cluster integration
β βββ 03-custom-dataset/ # Custom workload configurations
β βββ 04-gpu-training/ # GPU-enabled cluster setup
β βββ 05-debugging/ # Infrastructure debugging guide
β βββ 06-common-issues/ # Infrastructure troubleshooting
βββ Makefile # Infrastructure automation commands
βββ README.md # This file
- macOS 11+ or Linux (Ubuntu, Fedora, Debian, etc.)
- 8GB+ RAM (16GB recommended)
- 4+ CPU cores, 10GB free disk space
- Docker or Podman (container runtime)
# Clone repository
git clone https://github.com/<your-username>/k8s-ml-lab.git
cd k8s-ml-lab
# Automated infrastructure setup (recommended)
make setup # Complete setup: cluster + operators + training environment
# Alternative: Configure existing cluster
make use-existing # For EKS, GKE, AKS, minikube, etc.
# Comprehensive system verification (recommended first step)
make verify-system # Checks system requirements + all dependencies
# Check cluster status
make status
# Submit test workload
make submit-job
# View workload logs
make logs
# OR: Run complete end-to-end workflow
make run-e2e-workflow # Runs training + inference + testing automatically
# Test the sample distributed training workload
python scripts/test_mnist_model.py
# OR: Use make command for easier testing
make inference # Test with built-in images
TEST_IMAGE=path/to/digit.png make inference # Test single custom image
TEST_IMAGES_DIR=my_digits/ make inference # Test directory of images
SUCCESS: Kind cluster 'ml-training-cluster' created
SUCCESS: Kubeflow Training Operator installed
SUCCESS: gloo backend initialized - Rank 0, World size 2
Rank 0: Using pre-downloaded MNIST dataset (60000 train, 10000 test)
β
Infrastructure ready for ML workloads!
β
Sample workload completed successfully!
Generated Infrastructure:
- Kubernetes cluster with ML-optimized configuration
- Kubeflow Training Operator for distributed workloads
- Persistent storage for datasets and models
- Network policies and resource quotas
- Sample workload demonstrating capabilities
The make run-e2e-workflow
command runs the complete end-to-end workflow automation:
- Training Phase: Submits distributed PyTorch training job
- Monitoring Phase: Tracks job progress and collects logs
- Inference Phase: Tests trained model with sample images
- Results Phase: Generates performance reports and saves outputs
What it does:
- Creates and submits PyTorch distributed training job
- Monitors job completion and downloads training logs
- Extracts trained model from completed pods
- Runs inference tests on sample handwritten digit images
- Generates training metrics and accuracy reports
- Saves all outputs to
output/latest/
directory
Example output:
β
Training job submitted and completed
β
Model extracted: output/latest/trained-model.pth
β
Inference tests passed: 8/10 correct predictions
β
Training metrics saved: output/latest/training_metadata.txt
The make verify-system
command performs comprehensive verification of your system readiness:
System Requirements Check:
- Memory: minimum 8GB, recommended 16GB
- CPU: minimum 4 cores
- Disk space: minimum 10GB free
- Operating system and architecture detection
Dependencies Verification:
- Container runtime: Docker/Podman installation and status
- Python: version and availability
- kubectl: Kubernetes CLI installation and version
- kind: Kubernetes in Docker installation
- Python packages: PyTorch, torchvision, requests, PyYAML
Infrastructure Status:
- Kubernetes cluster accessibility
- Kubeflow Training Operator installation status
- Overall readiness assessment
Example output:
β
Memory: 16GB (sufficient)
β
CPU cores: 8 (sufficient)
β
Disk space: 45GB free (sufficient)
β
Docker: installed and running
β
Python: 3.11.5
β
kubectl: v1.28.0
β
kind: v0.20.0
β
Python dependencies: PyTorch, torchvision, requests, PyYAML installed
β
Kubernetes cluster: accessible
β
Kubeflow Training Operator: installed
β
System verification completed - all dependencies are ready!
Use this command:
- Before starting any setup to identify missing dependencies
- After setup to confirm everything is working
- When troubleshooting issues
- As part of CI/CD pipeline validation
π Complete Documentation - Detailed infrastructure guides, architecture, troubleshooting, and advanced configurations
- Setup Guide - Detailed installation and configuration
- Architecture - Infrastructure components and design
- Complete Workflow - End-to-end infrastructure + workload demo
- Existing Clusters - Integrate with EKS, GKE, AKS, etc.
- Custom Workloads - Configure your own ML workloads
- GPU Infrastructure - GPU-enabled cluster setup
- Debugging - Infrastructure debugging techniques
- Troubleshooting - Common infrastructure problems and solutions
# Infrastructure Management
make setup # Complete infrastructure setup (cluster + dependencies + training env)
make verify-system # Comprehensive system and dependency verification
make use-existing # Use existing cluster (skip cluster creation)
# Training & Workflows
make submit-job # Submit PyTorch distributed training job
make run-e2e-workflow # Run complete end-to-end workflow (training + inference + results)
make inference # Run model inference on test images (TEST_IMAGE=path or TEST_IMAGES_DIR=path)
make status # Show job status, pods, and recent events
make logs # View logs from master pod (real-time)
make restart # Restart training job (delete + submit)
# Debugging & Monitoring
make debug # Show comprehensive debugging information
# Cleanup
make cleanup # Clean up jobs and resources (keep cluster)
make cleanup-all # Delete entire Kind cluster and all resources
# Aliases (for compatibility)
make check-requirements # Alias for verify-system
make install-operator # Install Kubeflow training operator (standalone)
Scale Infrastructure:
# configs/pytorch-distributed-job.yaml
Worker:
replicas: 3 # Scale workers from 1 to 3
Custom Cluster Configuration:
# configs/kind-cluster-config.yaml
nodes:
- role: control-plane
- role: worker
- role: worker # Add more workers
Configure Your Workloads:
# scripts/distributed_mnist_training.py
def load_dataset(rank):
# Replace with your dataset
train_dataset = YourDataset('/input/your-data')
return train_dataset, test_dataset
# Delete workloads only
make cleanup
# Delete entire infrastructure (Kind cluster)
make cleanup-all