K8sCIFAR

Distributed Data Parallelism on CIFAR-10 using Kubernetes

About

This project demonstrates distributed deep learning training using Data Parallelism on a two-node Kubernetes cluster. We train a ResNet18 model on the CIFAR-10 dataset using PyTorch and Docker, orchestrated with Kubernetes. This was submitted as a part of 'Enhanced Techniques for Big Data Computing' course of M.Sc. Big Data Analytics programme at Ramakrishna Mission Vivekananda Educational and Research Institute.

Course Project Details

This repository contains the report, slides, and jupyter notebook and related code files for the final course project of the Enhanced Techniques for Big Data Computing offered at Ramakrishna Mission Vivekananda Educational and Research Institute, Belur as a part of the Master of Science in Big Data Analytics program.

Project Overview

Model: ResNet18 (PyTorch)
Dataset: CIFAR-10
Parallelism: Data Parallelism with DistributedDataParallel
Cluster: 2-node Kubernetes cluster (Master + Worker)
Containerization: Docker
Orchestration: Kubernetes

Tech Stack

Python 3.10+
PyTorch
Kubernetes (k3s)
Docker

Plan of implementation

Step 1: Set up Kubernetes cluster
- Use "kubeadm", "k3s" or "minikube" (multi-node) depending on preference.
- One laptop/computer will be the master node, the other will be the worker node.
Step 2: Data Parallelism
- Use PyTorch DataDistributedParallel (DDP) or Tensorflow’s MultiWorkerMirroredStrategy.
- Split the CIFAR-10 dataset across two nodes.
Step 3: Containerize the Training Code
- Write DockerFile that installs dependencies, mounts data & runs training.
- Push to a container registry (local/private) or DockerHub.
Step 4: Deploy Kubernetes
- Define a StatefulSet or Deployment for each training process.
- Use headless services + shared storage (NFS or object-store) if needed.
- Set host networking or Service DNS for communication.

Contributors

This project was done by the team "Bhattacharya Brothers ", whose team members are:

-- Bhattacharya Brothers
April 22, 2025

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
__pycache__		__pycache__
Code_explanation.pdf		Code_explanation.pdf
Dockerfile		Dockerfile
ETBDC_II_Project_PPT.pdf		ETBDC_II_Project_PPT.pdf
README.md		README.md
ddp-training.yaml		ddp-training.yaml
shared-nfs-pv.yaml		shared-nfs-pv.yaml
shared-pvc.yaml		shared-pvc.yaml
train_ddp.py		train_ddp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

K8sCIFAR

Distributed Data Parallelism on CIFAR-10 using Kubernetes

About

Course Project Details

Project Overview

Tech Stack

Plan of implementation

Contributors

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

soham-b-github/K8sCIFAR

Folders and files

Latest commit

History

Repository files navigation

K8sCIFAR

Distributed Data Parallelism on CIFAR-10 using Kubernetes

About

Course Project Details

Project Overview

Tech Stack

Plan of implementation

Contributors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages