Skip to content

pallgeuer/model_testbed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deep Learning Model Testbed

Author: Philipp Allgeuer

This repository provides flexible and user-extensible training scripts for computer vision (e.g. image classification) tasks on standard benchmarks. Designed as a stable and reproducible testbed, it allows rapid experimentation with different model architectures, datasets, optimizers, learning rate schedules, gradient accumulation strategies, and other training strategies and hyperparameters. The goal is to streamline the process of testing new training ideas and techniques in a controlled environment, enabling fair comparisons and insightful evaluations of their impact on model performance. It includes seamless integration with Weights & Biases (wandb) for powerful experiment tracking, live monitoring, and in-depth analysis of training runs and results.

Getting Started

Follow the instructions from Installation to ensure that you have a Python environment that can run the code (e.g. in the form of a running Docker container or a local sandboxed Python environment) and is logged in to wandb via specification of an API key. This will involve ensuring that you have a suitable NVIDIA Driver installed (for GPU/CUDA acceleration). Then, you can for example train a wide variety of classification models using commands like:

./train_cls.py --dataset MNIST --model fcnet --epochs 50 --batch_size 64 --amp
./train_cls.py --dataset FashionMNIST --model fcnet --epochs 50 --batch_size 64 --amp
./train_cls.py --dataset CIFAR10 --model wide2_resnet14_g3 --epochs 80 --batch_size 64 --amp
./train_cls.py --dataset CIFAR100 --model resnet34-4 --epochs 120 --batch_size 64
./train_cls.py --dataset TinyImageNet --model resnet101-8 --epochs 120 --batch_size 32
./train_cls.py --dataset Imagenette --model efficientnet_v2_s --epochs 120 --batch_size 32
./train_cls.py --dataset Imagewoof --model resnext50_32x4d --epochs 300 --batch_size 32 --lr_scale 3
./train_cls.py --dataset Food101 --model swin_t --epochs 120 --warmup_epochs 20 --batch_size 32 --optimizer adamw
./train_cls.py --dataset ImageNet1K --model swin_t --epochs 80 --batch_size 32
./train_cls.py --dataset iNaturalist --model efficientnet_v2_s --epochs 80 --batch_size 32

You can always refer to the script help for documentation on the available arguments:

./train_cls.py --help

You can view the live progress and results of all your training runs in your wandb workspace, e.g. https://wandb.ai/pallgeuer/model_testbed if your username is pallgeuer.

Run Commands

In addition to the basic commands provided in Getting Started, here are some further run commands and advice:

  • Dataset Caching: The training speed of small- to medium-sized datasets can be increased (for local installations) by forcing them to be cached in RAM. For instance:

    mkdir /dev/shm/Datasets && cp -r ~/Datasets/{MNIST,FashionMNIST,CIFAR,TinyImageNet,Imagenette,Imagewoof} /dev/shm/Datasets
    ls -lAh /dev/shm/Datasets
    export DATASET_PATH=/dev/shm/Datasets
    # <-- Perform the required training runs
    # UNDO AFTER TRAINING: rm -rf /dev/shm/Datasets
  • Train classification model sweep: You can use wandb to automatically perform custom sweeps across sets of hyperparameters:

    wandb sweep sweep/cls_SWEEPNAME.yaml     # <-- CAUTION: Replace with the required sweep file name
    export WANDB_DIR="$(pwd)/log"            # <-- Explicitly needed for sweeps (even though this is automatically handled for individual manual runs of ./train_cls.py)
    <WANDB AGENT COMMAND FROM ABOVE OUTPUT>  # [--count NUM]
  • Summarize training results: You can use the wandb API to programmatically retrieve and filter training run results, and present them in tabular form. For example:

    results/mean_stats_table.py --project model_testbed --any_tags cifar10_act_func --metric valid_top1_max --group_by act_func
  • Create custom wandb plots per dataset: Go to your wandb workspace in the browser, and:

    • Create a results section:
      Go to project page
      'Add a section' at bottom
      Drag to top spot
      Rename via '...' to 'Results'
      
    • Create a summary plot:
      Add panel -> Custom chart -> Scatter plot (nominal x, field sorted)
      Remove limit with X
      Add filters with ... -> filters -> Dropdown
        Filter by State = finished, Sweep != null, and then every single config that can possibly make a difference (including Dataset)
      Summary variables: params, valid_top1_max
      Config variables: act_func, model
      Chart fields: groupKey = act_func, x = model, y = valid_top1_max, order = params, title = {DATASET} model E{EPOCHS} B{BATCH_SIZE}
      OK -> Duplicate panel -> Change valid_top1_max to train_top1
      
  • Running in tmux: On remote machines you can run training runs inside a tmux session (for convenience and the ability to disconnect/reconnect):

    tmux new -s mtestbed
      Ctrl+B,"
      Pane 0:
        SET UP AS YOU NORMALLY WOULD
        git pull
        START YOUR RUN / SWEEP (Middle click doesn't work => Use Ctrl+Shift+V or Ctrl+B,] if pasting from within tmux via mouse select)
      Pane 1:
        nvitop
        Q
      Ctrl+B,D
    tmux attach -d -t mtestbed
    

Custom Wandb Charts

You can create custom wandb charts for easier training results monitoring and analysis. Here are some example recipes:

  • Add a section for custom charts:

    Go to wandb main project page of concern
    Scroll all the way down and click 'Add a section'
    Click on the ellipsis at the top right of the new section and rename the section to 'Custom Charts' or whatever is desired
    
  • Add a custom chart:

    In the custom charts section click 'Add Panel -> Custom chart'
    Select the type of plot you want on the top left (start with 'Scatter plot' if you're not sure)
    Update the data query:
      Click the appearing X above 'limit:' to make sure all data you are looking for will be plotted (just be careful in selecting your data now though)
      Click the ellipsis and add 'filters:'
        Click the down arrow on '# filters' and configure as many filters as appropriate
        Note that all data in the project (including all data from all sweeps and individual runs) is queried, so best be specific/explicit about all possible configs
        Filters: State == finished, Sweep (explicit sweep, or != null), All configs that could possibly affect the output y-value (even if there is currently only one choice)
        Note at time of writing there is a bug that you cannot expand the filters again once you've had them open (only choice is to completely delete the filters with the X and do them all again OR edit the filters in the Vega-Lite editor, that seems to be more stable)
      Select the keys (that you're interested in from the various categories available (use the drop down arrows, plus buttons, and click between the quotation marks)
      Categories:
        config: Single values corresponding to run config parameters
        summary: Single final logged values of a run, i.e. as presented in the run table
        history*: Series of logged data for a particular run
    Select the available chart fields from the keys you just selected, and hopefully you can see a suitable preview of the chart
    Use groupKeys if you don't want each run to be a separate data series, but instead for example all runs with a particular config parameter value
    
  • Customize the definition of scatter plots:

    If the x-axis data is not quantitative, then change encoding.x.type to 'nominal' (hover for more options and explanations):
      "encoding": {
        "x": {"field": "${field:x}", "type": "nominal"},
    If you don't need 0 to be shown on an axis, e.g. the y-axis then add encoding.y.scale.zero = false:
      "encoding": {
        "y": {"field": "groupedY", "type": "quantitative", "axis": {"title": "${field:y}"}, "scale": {"zero": false}},
    
  • For totally custom Vega-Lite plots refer to the vega directory

Installation

These installation instructions assume that Ubuntu is being used, but can likely easily be adapted to any other OS.

Begin by cloning the repository:

TESTBED=~/Code/model_testbed  # <-- CAUTION: Adjust this path to the desired directory of the model testbed code (the scripts will be placed directly inside this directory)
cd "$(dirname "$TESTBED")" && git clone https://github.com/pallgeuer/model_testbed.git "$(basename "$TESTBED")"

Docker Container

The safest and most sandboxed way of running the model testbed code is using a Docker container. For that, Docker needs to be installed, either via Docker Engine or via Docker Desktop, which bundles Docker Engine as one of its components. An NVIDIA driver then needs to be installed in order to enable system-wide GPU support. The NVIDIA Container Toolkit is then further required in order to allow Docker containers to actually use the now available GPU acceleration.

You can either pull a pre-built model testbed Docker image from Docker Hub, or easily build one yourself from source. For the following instructions we assume that PyTorch 2.7.0 with CUDA 12.8 is desired.

Make sure the TESTBED bash variable from Installation is defined, as it is required by the following bash commands!

Option 1: Pull a Docker Image from Docker Hub

You can conveniently pull a pre-built model testbed Docker image from the pallgeuer/model_testbed Docker Hub page:

docker pull pallgeuer/model_testbed:2.7.0-cu128

You can then verify the image has been successfully pulled:

docker image ls

You can now immediately proceed to running the Docker image.

Option 2: Build a Docker Image from Source

You can build a model testbed Docker image (based on docker/Dockerfile) as follows:

cd "$TESTBED"
CUDA_VERSION=12.8  # <-- Desired CUDA version
PYTORCH=2.7.0      # <-- Desired PyTorch version
docker buildx build -t "model_testbed:${PYTORCH}-cu${CUDA_VERSION//./}" --build-arg CUDA_VERSION="${CUDA_VERSION}" --build-arg PYTORCH="${PYTORCH}" --load -f docker/Dockerfile .

The model_testbed:2.7.0-cu128 Docker image is now available:

docker image ls

When you are done building Docker images for now, it is good to clear the often very large build cache:

docker builder prune --all

Here are some other useful commands related to the cleaning up of Docker images and containers:

docker ps -a                               # <-- Show all currently existing containers
docker stop NAME                           # <-- Stop a running container of the given name
docker rm NAME                             # <-- Remove/delete a container of the given name
docker image rm model_testbed:2.7.0-cu128  # <-- Removes/deletes the specified Docker image
docker image prune                         # <-- Removes unused dangling images (images with no tags and not referenced by any container)
docker system prune                        # <-- Removes all unused containers, networks, images (dangling only), and build cache

Run the Docker Image

Assuming a model_testbed:2.7.0-cu128 or pallgeuer/model_testbed:2.7.0-cu128 Docker image is locally available (CAUTION: Adjust the exact name in the commands below) as per the options above, we can launch a Docker container interactively (refer to Download Datasets for DATASET_PATH and how to download datasets) using:

export WANDB_API_KEY=...        # <-- CAUTION: Get your wandb API key from https://wandb.ai/authorize
export DATASET_PATH=~/Datasets  # <-- CAUTION: Adjust this path to the directory containing the downloaded datasets
docker run --name model_testbed --rm -it --gpus all --network host --env WANDB_API_KEY -v "${TESTBED}:/code" -v "${DATASET_PATH}:/datasets:ro" model_testbed:2.7.0-cu128 bash

Inside the running Docker container we can then execute arbitrary commands interactively:

pwd                    # <-- Shows the current working directory (initially /code)
ls -lAh /code          # <-- Shows the mounted model testbed code directory
ls -lAh /datasets      # <-- Shows the read-only mounted datasets directory
nvidia-smi             # <-- Verifies access to GPU acceleration via NVIDIA Container Toolkit
which python           # <-- Shows the path to the default Python interpreter (by default the custom one in /env)
python --version       # <-- Shows the Python version
./train_cls.py --help  # <-- Show the help for a training script

You can alternatively directly launch a training script from the docker run command line as follows (only the bash part has changed, and the same variables are required as before, e.g. TESTBED, DATASET_PATH, WANDB_API_KEY):

docker run --name model_testbed --rm -it --gpus all --network host --env WANDB_API_KEY -v "${TESTBED}:/code" -v "${DATASET_PATH}:/datasets:ro" model_testbed:2.7.0-cu128 ./train_cls.py --help

You can use something like --gpus "device=1" on multi-GPU machines to select which GPU is used. The --rm option automatically removes/deletes a container when it is finished/exited/stopped. Otherwise you can use:

docker ps -a               # <-- Show all currently existing containers
docker stop model_testbed  # <-- Stop a running container of the given name
docker rm model_testbed    # <-- Remove/delete a container of the given name

Local Installation

As an alternative to Docker, you can use either venv or conda on your local system to create an isolated Python environment that can run the model testbed code. For the following instructions we assume Python 3.12 as the desired Python version.

Make sure the TESTBED bash variable from Installation is defined, as it is required by the following bash commands!

Option 1: Using venv

First, verify whether we currently have Python 3.12 already installed (e.g. the Ubuntu 24.04 system Python interpreter):

which python3.12  # <-- If this responds with a path (e.g. /usr/bin/python3.12) then a Python 3.12 interpreter was found

If no Python 3.12 interpreter is available, then we install one system-wide that we can subsequently use to create our own customized sandboxed Python 3.12 environment later:

grep -R "deadsnakes/ppa" /etc/apt/sources.list /etc/apt/sources.list.d/  # <-- Check whether the deadsnakes PPA is already added (can skip next command if so)
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.12 python3.12-venv
which python3.12  # <-- Should now respond with a Python interpreter path

We create a new sandboxed Python 3.12 environment for the model testbed as follows:

cd "$TESTBED"
python3.12 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
deactivate

In future, in order to use the Python environment you then just do:

unset PYTHONPATH  # <-- Prevent any external Python packages from accidentally bleeding into the sandboxed environment (e.g. ROS)
cd "$TESTBED" && source .venv/bin/activate

Option 2: Using conda

Install Miniconda if you do not have either Miniconda or Anaconda yet, and check whether the libmamba solver is enabled:

conda config --show solver  # <-- If this prints 'solver: libmamba' then the libmamba solver is enabled

If the libmamba solver is not enabled, then enable it (this is heavily recommended for all modern uses of conda, and is the default anyway for new installs):

conda install -n base conda-libmamba-solver
conda config --set solver libmamba

Now create a conda environment model_testbed for the project:

conda create -y -n model_testbed python=3.12

Add activation and deactivation scripts to the conda environment that automatically unset and restore the PYTHONPATH environment variable respectively, to prevent any external Python packages from accidentally bleeding into the sandboxed environment (e.g. ROS):

CONDA_ENV_DIR="$(readlink -e "$(dirname "$CONDA_EXE")/../envs/model_testbed")"
mkdir -p "$CONDA_ENV_DIR/etc/conda/activate.d"
mkdir -p "$CONDA_ENV_DIR/etc/conda/deactivate.d"
cat << 'EOM' > "$CONDA_ENV_DIR/etc/conda/activate.d/pythonpath.sh"
#!/bin/sh
# The environment name is available under $CONDA_DEFAULT_ENV
if [ -n "$PYTHONPATH" ]; then
    export SUPPRESSED_PYTHONPATH="$PYTHONPATH"
    unset PYTHONPATH
fi
# EOF
EOM
cat << 'EOM' > "$CONDA_ENV_DIR/etc/conda/deactivate.d/pythonpath.sh"
#!/bin/sh
# The environment name is available under $CONDA_DEFAULT_ENV
if [ -n "$SUPPRESSED_PYTHONPATH" ]; then
    export PYTHONPATH="$SUPPRESSED_PYTHONPATH"
    unset SUPPRESSED_PYTHONPATH
fi
# EOF
EOM
chmod +x "$CONDA_ENV_DIR/etc/conda/activate.d/pythonpath.sh" "$CONDA_ENV_DIR/etc/conda/deactivate.d/pythonpath.sh"

Especially on older Ubuntus it is often advantageous to have newer versions of the compiler runtime libraries available (it allows some conda packages to be installed that would otherwise be deemed incompatible):

conda activate model_testbed
conda update -y -c conda-forge libgcc-ng libgomp libstdcxx-ng
conda deactivate

In future, in order to use the Python environment you then just do:

cd "$TESTBED" && conda activate model_testbed

Install Dependencies

Assuming you have created a sandboxed Python environment as per the options above, we now install the package dependencies of the model testbed code:

# <-- CAUTION: Change to the TESTBED directory and activate the created Python environment!
pip3 install --index-url https://download.pytorch.org/whl/cu128 torch==2.7.0 torchvision numpy
pip install matplotlib wandb

For PyTorch (torch and torchvision), an NVIDIA driver is required in order to enable GPU (i.e. CUDA) support.

To complete the installation, we ensure that Weights & Biases (wandb) is logged in locally on the command line with your account (create an account if you don't have one yet):

wandb login  # <-- Refer to https://wandb.ai/authorize

This will let you very conveniently monitor the progress of training runs and summarize their results, all in an online web interface that you can access remotely (i.e. even when you are away from the machine that is actually performing the training).

At this point you can launch any model testbed script, like for example (refer to Download Datasets for DATASET_PATH and how to download datasets):

# <-- CAUTION: Change to the TESTBED directory and activate the created Python environment!
export DATASET_PATH=~/Datasets  # <-- CAUTION: Adjust this path to the directory containing the downloaded datasets
./train_cls.py --help

You can use something like export CUDA_VISIBLE_DEVICES=1 on multi-GPU machines to select which GPU is used.

NVIDIA Driver

In order to train deep learning models on GPU devices using CUDA, an appropriate NVIDIA driver is required. Each NVIDIA driver has a maximum CUDA version that it supports, which for a currently installed NVIDIA driver can be checked using:

nvidia-smi  # <-- Maximum supported CUDA version is at top-right, i.e. the 'CUDA Version' field

If you don't have an NVIDIA driver installed, or if you plan to use a CUDA version newer than this (even if it is inside a docker container) then you need to update your NVIDIA graphics driver first.

If you have an NVIDIA graphics card, especially in a laptop setting, it can happen that the default graphics driver does not work out of the box. This can cause weird screen effects when attempting to boot, and failure to properly load the graphical interface. One workaround that often helps is to temporarily specify the nomodeset kernel boot option until you can install a graphics driver that works properly (e.g. a suitable NVIDIA one). When at the GRUB screen during boot, instead of pressing Enter to launch Ubuntu, have it highlighted and press e and then add the keyword nomodeset after quiet splash. Then press F10 to boot. Note that this is a temporary change for a single boot only.

The best place to get an overview of the currently released Unix NVIDIA drivers is the Unix Driver Archive, including the link there to their Linux AMD64 Display Driver Archive. Production Branch NVIDIA drivers are recommended unless you have a good reason otherwise. You can also search for recommended/certified NVIDIA driver versions for your particular GPU on the main NVIDIA Drivers page (following NVIDIA's own advice, avoid installing drivers directly from here though). As a last check, you may consider looking up the minimum driver requirements in the release notes of the specific CUDA version you wish to use (for example the CUDA 12.8 Release Notes).

Before installing an NVIDIA graphics driver, check what driver is actually currently being used:

lsmod | egrep 'nouveau|nvidia'

Even if this does not show an NVIDIA driver currently in use, you should still be extraordinarily careful to make sure no single package belonging to another NVIDIA driver is currently installed before continuing. It happens by accident more often than you think, and can be fatal for the driver working. So first check what is installed (this will show more than just NVIDIA packages, but gives best chances of finding all currently installed NVIDIA packages):

dpkg -l | egrep -i 'nvidia|[4-9][0-9][05]\.[0-9]{2,3}\.[0-9]{2,3}'

If there is hi at the beginning of a line then the corresponding package is installed and on hold. Otherwise ii just means installed.

Purge-uninstall all existing NVIDIA driver packages before installing a new NVIDIA driver version!

First, for convenience, collect a superset list of the installed packages that are possibly part of the NVIDIA driver installation (the second command simply extracts the second column from the first command):

dpkg -l | egrep -i 'nvidia|[4-9][0-9][05]\.[0-9]{2,3}\.[0-9]{2,3}'
dpkg -l | egrep -i 'nvidia|[4-9][0-9][05]\.[0-9]{2,3}\.[0-9]{2,3}' | awk '{print $2}' | paste -sd' '

CAREFUL: Edit this space-separated list of packages to include only the packages installed as part of any currently installed NVIDIA display drivers. This in general includes every package with an NVIDIA driver version in the version column, as well as nvidia-prime and screen-resolution-extra. Note that the nvidia-container-toolkit* and libnvidia-container* packages are not part of the NVIDIA display driver.

We now purge these old NVIDIA driver packages:

OLDPACKAGES=(...)  # <-- CAUTION: Space-separated list of NVIDIA packages to purge
sudo apt-mark unhold "${OLDPACKAGES[@]}"
apt-mark showhold
sudo apt purge "${OLDPACKAGES[@]}"
dpkg -l | egrep -i 'nvidia|[4-9][0-9][05]\.[0-9]{2,3}\.[0-9]{2,3}'

Note that when uninstalling these packages, it is possible that other packages that explicitly depend on these packages also get removed (e.g. psensor). This is not easily avoidable, and you can reinstall the uninstalled packages after you have installed the new driver, but if you want to avoid them being purged (not just removed), then explicitly remove those packages manually beforehand. If very many packages depend on an NVIDIA driver component package you want to remove, e.g. libxnvctrl0:amd64, then you can just manually update the package version without completely uninstalling it (but this will skip the purging of config files):

# Example forced update command
sudo apt install --reinstall libxnvctrl0:amd64=570.148.08-1ubuntu1

We now install a new NVIDIA driver using the CUDA network repository installation method. Start by verifying (using the link) that the most current installation instructions still use cuda-keyring version 1.1-1. If not, update the commands below as appropriate. We first check whether the required keyring and network repository are already present:

dpkg -l | grep cuda-keyring                             # <-- Check if the CUDA keyring package is installed (expected version is 1.1-1)
ls -lAh /usr/share/keyrings/cuda-*                      # <-- Check if the CUDA GPG keyring file is present
ls -lAh /etc/apt/{sources.list.d,preferences.d}/cuda-*  # <-- Check if the CUDA network repository APT source and preferences files are present

If any of these are missing or an outdated version, then add the CUDA network repository to APT (these commands are safe even if the network repository is actually already present):

wget -P /tmp/ https://developer.download.nvidia.com/compute/cuda/repos/"ubuntu$(lsb_release -rs | tr -d '.')/$(uname -m)"/cuda-keyring_1.1-1_all.deb && sudo dpkg -i /tmp/cuda-keyring_1.1-1_all.deb && rm -v /tmp/cuda-keyring_1.1-1_all.deb

Update the local APT package index from remote repositories, in particular the CUDA network repository:

sudo apt update
apt-cache search 'nvidia-driver-' | sort | egrep '^nvidia-driver-[0-9]+(-[a-z-]+)?'

Now, assuming we are interested in installing the latest NVIDIA driver version from the production branch 570, we can check what latest version is available:

apt-cache policy nvidia-driver-570

We can then proceed to install the NVIDIA driver:

sudo apt install build-essential  # <-- Required but usually always already installed on all but very freshly installed Ubuntu OS's
sudo apt install nvidia-driver-570
dpkg -l | egrep -i 'nvidia|[4-9][0-9][05]\.[0-9]{2,3}\.[0-9]{2,3}'

Be very careful to sanity check the exact versions of all installed packages - unexpected version mismatches happen more often than you would expect and can cause a driver not to work! Use a command like the following if there are version mismatches:

sudo apt install --reinstall nvidia-modprobe=570.148.08-1ubuntu1 nvidia-settings=570.148.08-1ubuntu1

It is highly recommended to hold (i.e. fix) the exact versions of all installed NVIDIA driver components:

dpkg -l | egrep -i 'nvidia|[4-9][0-9][05]\.[0-9]{2,3}\.[0-9]{2,3}' | awk '{print $2}' | paste -sd' '
PACKAGES=(...)  # <-- CAUTION: Space-separated list of just-installed NVIDIA packages based on the command above
sudo apt-mark hold "${PACKAGES[@]}"
apt-mark showhold

Reboot the computer (without adding nomodeset in case you were doing that so far) and check that the NVIDIA driver works:

lsmod | egrep 'nouveau|nvidia'
nvidia-smi

You can also perform a graphical test of the graphics card and driver using:

sudo apt install mesa-utils
__GL_SYNC_TO_VBLANK=0 vblank_mode=0 glxgears

For a longer and more involved test, do:

sudo apt install glmark2
glmark2

Keep in mind that the frame rates and scores achieved by these tests are only a rough indication of the computational power of your GPU.

CPU / GPU Monitoring

During training you can monitor CPU performance using htop (installed via apt), and monitor GPU performance using nvidia-smi (installed as part of NVIDIA driver):

htop
watch -n 0.4 nvidia-smi

A more convenient and informative way to interactively monitor both CPU and GPU performance at the same time is using nvitop:

# Install pipx
python3 -m pip install --user pipx  # <-- To upgrade pipx in future: python3 -m pip install --user --upgrade pipx
~/.local/bin/pipx completions

# Manually add/append the following lines to the bashrc
vim ~/.bashrc
    function AddToPath() { echo "$PATH" | /bin/grep -Eq "(^|:)$1($|:)" || export PATH="$1${PATH:+:${PATH}}"; }
    AddToPath ~/.local/bin
    eval "$(register-python-argcomplete pipx)"

# Install nvitop
pipx install nvitop  # <-- Installs into venv: ~/.local/pipx/venvs/nvitop

You can then run nvitop as:

nvitop                  # <-- Available globally (also: nvisel)
ssh -t HOSTNAME nvitop  # <-- Run nvitop via SSH on a remote machine that has nvitop installed

Download Datasets

To train a model using the model testbed, you need to download associated benchmark datasets (refer to info.txt). Below are concise instructions for automatically downloading and preparing a variety of datasets from the command line.

Start by setting the environment variable DATASET_PATH that specifies where the datasets should be downloaded to:

export DATASET_PATH=~/Datasets  # <-- CAUTION: Adjust this path to your desired dataset directory
  • MNIST: http://yann.lecun.com/exdb/mnist

    MNIST_ROOT="$DATASET_PATH"/MNIST
    mkdir "$MNIST_ROOT"
    wget -P "$MNIST_ROOT/raw" http://yann.lecun.com/exdb/mnist/{train-images-idx3-ubyte.gz,train-labels-idx1-ubyte.gz,t10k-images-idx3-ubyte.gz,t10k-labels-idx1-ubyte.gz}
    for gz in "$MNIST_ROOT/raw"/*.gz; do gunzip -k "$gz"; done
  • FashionMNIST: https://github.com/zalandoresearch/fashion-mnist

    FASHION_MNIST_ROOT="$DATASET_PATH"/FashionMNIST
    mkdir "$FASHION_MNIST_ROOT"
    wget -P "$FASHION_MNIST_ROOT/raw" http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/{train-images-idx3-ubyte.gz,train-labels-idx1-ubyte.gz,t10k-images-idx3-ubyte.gz,t10k-labels-idx1-ubyte.gz}
    for gz in "$FASHION_MNIST_ROOT/raw"/*.gz; do gunzip -k "$gz"; done
  • CIFAR10: https://www.cs.toronto.edu/~kriz/cifar.html

    CIFAR_ROOT="$DATASET_PATH"/CIFAR
    wget -P "$CIFAR_ROOT" https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
    tar -xf "$CIFAR_ROOT"/cifar-10-python.tar.gz -C "$CIFAR_ROOT" && rm "$CIFAR_ROOT"/cifar-10-python.tar.gz
  • CIFAR100: https://www.cs.toronto.edu/~kriz/cifar.html

    CIFAR_ROOT="$DATASET_PATH"/CIFAR
    wget -P "$CIFAR_ROOT" https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
    tar -xf "$CIFAR_ROOT"/cifar-100-python.tar.gz -C "$CIFAR_ROOT" && rm "$CIFAR_ROOT"/cifar-100-python.tar.gz
    rm -rf "$CIFAR_ROOT"/cifar-100-python/file.txt~
  • Tiny ImageNet: https://www.kaggle.com/c/tiny-imagenet

    TINY_IMAGENET_ROOT="$DATASET_PATH"/TinyImageNet
    wget -P "$TINY_IMAGENET_ROOT" https://image-net.org/data/tiny-imagenet-200.zip
    unzip -q "$TINY_IMAGENET_ROOT"/tiny-imagenet-200.zip -d "$TINY_IMAGENET_ROOT" && rm "$TINY_IMAGENET_ROOT"/tiny-imagenet-200.zip
    rm -r "$TINY_IMAGENET_ROOT"/tiny-imagenet-200/test
    for wnid in "$TINY_IMAGENET_ROOT"/tiny-imagenet-200/train/*; do mv "$wnid"/images/* "$wnid/"; done
    rm -r "$TINY_IMAGENET_ROOT"/tiny-imagenet-200/train/*/images "$TINY_IMAGENET_ROOT"/tiny-imagenet-200/train/*/*.txt
    VALDIR="$TINY_IMAGENET_ROOT"/tiny-imagenet-200/val
    cut -f1,2 < "$VALDIR/val_annotations.txt" | while read file wnid; do mkdir -p "$VALDIR/$wnid"; mv "$VALDIR/images/$file" "$VALDIR/$wnid/"; done
    rm -r "$VALDIR"/{images,val_annotations.txt}
  • Imagenette: https://github.com/fastai/imagenette

    IMAGENETTE_ROOT="$DATASET_PATH"/Imagenette
    wget -P "$IMAGENETTE_ROOT" https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-320.tgz
    tar -xf "$IMAGENETTE_ROOT"/imagenette2-320.tgz -C "$IMAGENETTE_ROOT" && rm "$IMAGENETTE_ROOT"/imagenette2-320.tgz
  • Imagewoof: https://github.com/fastai/imagenette

    IMAGEWOOF_ROOT="$DATASET_PATH"/Imagewoof
    wget -P "$IMAGEWOOF_ROOT" https://s3.amazonaws.com/fast-ai-imageclas/imagewoof2-320.tgz
    tar -xf "$IMAGEWOOF_ROOT"/imagewoof2-320.tgz -C "$IMAGEWOOF_ROOT" && rm "$IMAGEWOOF_ROOT"/imagewoof2-320.tgz
  • Food-101: https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101

    FOOD_ROOT="$DATASET_PATH"/Food101
    wget -P "$FOOD_ROOT" http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz
    tar -xf "$FOOD_ROOT"/food-101.tar.gz -C "$FOOD_ROOT" && rm "$FOOD_ROOT"/food-101.tar.gz
  • ImageNet-1K: https://www.image-net.org/challenges/LSVRC

    IMAGENET1K_ROOT="$DATASET_PATH"/ImageNet1K
    
    # Option 1: Intended way
    xdg-open https://image-net.org/login.php
      Log in (you need to sign up with an academic email address and wait to be accepted)
      Click 'Download' at the top
      Click on '2012' in the section ILSVRC
      Download 'Training images (Task 1 & 2)' -> Can alternatively just try the wget below
      Download 'Validation images (all tasks)' -> Can alternatively just try the wget below
    
    # Option 2: Direct wget
    mkdir "$IMAGENET1K_ROOT"
    wget -P "$IMAGENET1K_ROOT/ILSVRC-CLS" https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar
    mkdir "$IMAGENET1K_ROOT/ILSVRC-CLS/val" && tar -xf "$IMAGENET1K_ROOT"/ILSVRC-CLS/ILSVRC2012_img_val.tar -C "$IMAGENET1K_ROOT/ILSVRC-CLS/val" && rm "$IMAGENET1K_ROOT"/ILSVRC-CLS/ILSVRC2012_img_val.tar
    (cd "$IMAGENET1K_ROOT/ILSVRC-CLS/val"; wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash; )
    find "$IMAGENET1K_ROOT/ILSVRC-CLS/val" -name "*.JPEG" | wc -l  # Should be 50000
    mkdir "$IMAGENET1K_ROOT/ILSVRC-CLS/train" && tar -xf "$IMAGENET1K_ROOT"/ILSVRC-CLS/ILSVRC2012_img_train.tar -C "$IMAGENET1K_ROOT/ILSVRC-CLS/train" && rm "$IMAGENET1K_ROOT"/ILSVRC-CLS/ILSVRC2012_img_train.tar
    for class_tar in "$IMAGENET1K_ROOT/ILSVRC-CLS/train"/*.tar; do class_dir="${class_tar%.tar}"; mkdir "$class_dir" && tar -xf "$class_tar" -C "$class_dir" && rm "$class_tar"; done
    find "$IMAGENET1K_ROOT/ILSVRC-CLS/train" -name "*.JPEG" | wc -l  # Should be 1281167
  • iNaturalist: https://github.com/visipedia/inat_comp/tree/master/2021

    INATURALIST_ROOT="$DATASET_PATH"/iNaturalist
    # <-- CAUTION: Activate a Python interpreter that has torchvision installed!
    python -c "import torchvision.datasets"$'\n'"[torchvision.datasets.INaturalist('$INATURALIST_ROOT', version, download=True) for version in ('2021_train', '2021_train_mini', '2021_valid')]"
    rm -f "$INATURALIST_ROOT"/2021_{train,train_mini,valid}.tgz

Citation

This repository is licensed under the GNU GPL v3 license. Please give appropriate attribution to Philipp Allgeuer and this repository if you use it for your own purposes, publications, theses, reports, or derivative works. Thanks!

About

Flexible and user-extensible training scripts for computer vision (e.g. image classification) tasks on standard benchmarks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published