Efficient and Scalable Vision Model Training

Efficient and Scalable Vision Model Training is a benchmarking and workflow-oriented project designed to help users train deep vision models efficiently across multiple A100 or H100 GPUs using PyTorch’s Distributed Data-Parallel (DDP) on SLURM-managed clusters. Rather than being tied to a specific dataset or architecture, the project allows users to plug in available or their own model and dataset configurations. It demonstrates best practices for scalable training, including SLURM-native job management, environment setup, and optimized data loading strategies. The goal is to achieve strong performance and near-linear scaling while maintaining flexibility and reproducibility across a variety of vision workloads on the Kempner AI cluster.

Available Workflows

Workflow	Model	Dataset	Max Tested GPUs	Tags
imagenet1k_resnet50	ResNet-50	ImageNet-1k	64	`A100`, `DDP`
imagenet1k_alexnet	AlexNet	ImageNet-1k	4	`A100`, `DDP`

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
imagenet1k_alexnet		imagenet1k_alexnet
imagenet1k_resnet50		imagenet1k_resnet50
imagenet_data		imagenet_data
scripts		scripts
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Efficient and Scalable Vision Model Training

Available Workflows

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

KempnerInstitute/scalable-vision-workflows

Folders and files

Latest commit

History

Repository files navigation

Efficient and Scalable Vision Model Training

Available Workflows

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages