Skip to content

kapil27/k8s-ml-lab

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

K8s ML Training Lab - Local Infrastructure Setup

πŸš€ Complete infrastructure setup for Kubernetes-based ML workloads with automated local cluster provisioning and operator management

A production-ready, infrastructure-first solution that gets you from zero to a fully configured K8s ML environment in 15 minutes.

🎯 What This Project Provides

βœ… Automated Cluster Setup: Kind clusters optimized for ML workloads on laptops/workstations
βœ… Operator Management: Automated Kubeflow Training Operator installation and configuration
βœ… Infrastructure as Code: Declarative cluster configurations and resource management
βœ… Multi-Backend Support: Flexible infrastructure supporting various ML frameworks
βœ… Resource Optimization: Memory and CPU configurations tuned for local development
βœ… Fault-Tolerant Setup: Robust infrastructure with multiple fallback mechanisms
βœ… Ready-to-Use Examples: Pre-configured distributed PyTorch training as proof-of-concept

πŸ“ Project Structure

k8s-ml-lab/
β”œβ”€β”€ bin/                                # Executable scripts
β”‚   └── setup.sh                       # Infrastructure automation script
β”œβ”€β”€ configs/                            # Infrastructure configurations
β”‚   β”œβ”€β”€ kind-cluster-config.yaml        # Kind cluster specification
β”‚   β”œβ”€β”€ pytorch-distributed-job.yaml    # Sample workload configuration
β”‚   └── pytorch-test-job.yaml           # Test workload configuration
β”œβ”€β”€ scripts/                            # Sample ML workloads
β”‚   β”œβ”€β”€ distributed_mnist_training.py   # Distributed training example
β”‚   β”œβ”€β”€ simple_single_pod_training.py   # Single-pod training example
β”‚   └── test_mnist_model.py             # Model inference example
β”œβ”€β”€ input/                              # Input datasets (auto-populated)
β”œβ”€β”€ output/                             # Training outputs (auto-created)
β”œβ”€β”€ examples/                           # Infrastructure examples and guides
β”‚   β”œβ”€β”€ README.md                       # πŸ“š Comprehensive documentation
β”‚   β”œβ”€β”€ 01-complete-workflow/           # Complete infrastructure + workload demo
β”‚   β”œβ”€β”€ 02-existing-cluster/            # Existing cluster integration
β”‚   β”œβ”€β”€ 03-custom-dataset/              # Custom workload configurations
β”‚   β”œβ”€β”€ 04-gpu-training/                # GPU-enabled cluster setup
β”‚   β”œβ”€β”€ 05-debugging/                   # Infrastructure debugging guide
β”‚   └── 06-common-issues/               # Infrastructure troubleshooting
β”œβ”€β”€ Makefile                           # Infrastructure automation commands
└── README.md                          # This file

πŸš€ Quick Infrastructure Setup (15 minutes)

Prerequisites

  • macOS 11+ or Linux (Ubuntu, Fedora, Debian, etc.)
  • 8GB+ RAM (16GB recommended)
  • 4+ CPU cores, 10GB free disk space
  • Docker or Podman (container runtime)

1. Infrastructure Setup

# Clone repository
git clone https://github.com/<your-username>/k8s-ml-lab.git
cd k8s-ml-lab

# Automated infrastructure setup (recommended)
make setup                    # Complete setup: cluster + operators + training environment

# Alternative: Configure existing cluster
make use-existing             # For EKS, GKE, AKS, minikube, etc.

2. Verify Infrastructure

# Comprehensive system verification (recommended first step)
make verify-system       # Checks system requirements + all dependencies

# Check cluster status
make status

# Submit test workload
make submit-job

# View workload logs
make logs

# OR: Run complete end-to-end workflow
make run-e2e-workflow    # Runs training + inference + testing automatically

3. Test ML Workload

# Test the sample distributed training workload
python scripts/test_mnist_model.py

# OR: Use make command for easier testing
make inference                                    # Test with built-in images
TEST_IMAGE=path/to/digit.png make inference       # Test single custom image
TEST_IMAGES_DIR=my_digits/ make inference         # Test directory of images

πŸ“Š Expected Infrastructure Results

SUCCESS: Kind cluster 'ml-training-cluster' created
SUCCESS: Kubeflow Training Operator installed
SUCCESS: gloo backend initialized - Rank 0, World size 2
Rank 0: Using pre-downloaded MNIST dataset (60000 train, 10000 test)
βœ… Infrastructure ready for ML workloads!
βœ… Sample workload completed successfully!

Generated Infrastructure:

  • Kubernetes cluster with ML-optimized configuration
  • Kubeflow Training Operator for distributed workloads
  • Persistent storage for datasets and models
  • Network policies and resource quotas
  • Sample workload demonstrating capabilities

πŸ”„ End-to-End Workflow

The make run-e2e-workflow command runs the complete end-to-end workflow automation:

  1. Training Phase: Submits distributed PyTorch training job
  2. Monitoring Phase: Tracks job progress and collects logs
  3. Inference Phase: Tests trained model with sample images
  4. Results Phase: Generates performance reports and saves outputs

What it does:

  • Creates and submits PyTorch distributed training job
  • Monitors job completion and downloads training logs
  • Extracts trained model from completed pods
  • Runs inference tests on sample handwritten digit images
  • Generates training metrics and accuracy reports
  • Saves all outputs to output/latest/ directory

Example output:

βœ… Training job submitted and completed
βœ… Model extracted: output/latest/trained-model.pth
βœ… Inference tests passed: 8/10 correct predictions
βœ… Training metrics saved: output/latest/training_metadata.txt

πŸ” System Verification

The make verify-system command performs comprehensive verification of your system readiness:

System Requirements Check:

  • Memory: minimum 8GB, recommended 16GB
  • CPU: minimum 4 cores
  • Disk space: minimum 10GB free
  • Operating system and architecture detection

Dependencies Verification:

  • Container runtime: Docker/Podman installation and status
  • Python: version and availability
  • kubectl: Kubernetes CLI installation and version
  • kind: Kubernetes in Docker installation
  • Python packages: PyTorch, torchvision, requests, PyYAML

Infrastructure Status:

  • Kubernetes cluster accessibility
  • Kubeflow Training Operator installation status
  • Overall readiness assessment

Example output:

βœ… Memory: 16GB (sufficient)
βœ… CPU cores: 8 (sufficient)
βœ… Disk space: 45GB free (sufficient)
βœ… Docker: installed and running
βœ… Python: 3.11.5
βœ… kubectl: v1.28.0
βœ… kind: v0.20.0
βœ… Python dependencies: PyTorch, torchvision, requests, PyYAML installed
βœ… Kubernetes cluster: accessible
βœ… Kubeflow Training Operator: installed
βœ… System verification completed - all dependencies are ready!

Use this command:

  • Before starting any setup to identify missing dependencies
  • After setup to confirm everything is working
  • When troubleshooting issues
  • As part of CI/CD pipeline validation

πŸ“š Documentation

πŸ‘‰ Complete Documentation - Detailed infrastructure guides, architecture, troubleshooting, and advanced configurations

Quick Links

πŸ”§ Common Infrastructure Commands

# Infrastructure Management
make setup               # Complete infrastructure setup (cluster + dependencies + training env)
make verify-system       # Comprehensive system and dependency verification
make use-existing        # Use existing cluster (skip cluster creation)

# Training & Workflows
make submit-job          # Submit PyTorch distributed training job
make run-e2e-workflow    # Run complete end-to-end workflow (training + inference + results)
make inference           # Run model inference on test images (TEST_IMAGE=path or TEST_IMAGES_DIR=path)
make status              # Show job status, pods, and recent events
make logs                # View logs from master pod (real-time)
make restart             # Restart training job (delete + submit)

# Debugging & Monitoring
make debug               # Show comprehensive debugging information

# Cleanup
make cleanup             # Clean up jobs and resources (keep cluster)
make cleanup-all         # Delete entire Kind cluster and all resources

# Aliases (for compatibility)
make check-requirements  # Alias for verify-system
make install-operator    # Install Kubeflow training operator (standalone)

🎨 Quick Infrastructure Customization

Scale Infrastructure:

# configs/pytorch-distributed-job.yaml
Worker:
  replicas: 3  # Scale workers from 1 to 3

Custom Cluster Configuration:

# configs/kind-cluster-config.yaml
nodes:
- role: control-plane
- role: worker
- role: worker  # Add more workers

Configure Your Workloads:

# scripts/distributed_mnist_training.py
def load_dataset(rank):
    # Replace with your dataset
    train_dataset = YourDataset('/input/your-data')
    return train_dataset, test_dataset

🧹 Cleanup

# Delete workloads only
make cleanup

# Delete entire infrastructure (Kind cluster)
make cleanup-all

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 59.1%
  • Python 35.1%
  • Makefile 5.8%