PointPillars Waymo distributed training model weights and docs (PR #585)

ssheorey · ssheorey · commit 66131df0f8db · 2023-02-27T13:15:52.000-08:00
- Training speedup with GPUs.
- Add model weight file + metrics.
- Script for running training with SLURM.
diff --git a/README.md b/README.md
@@ -46,7 +46,9 @@ respective requirements files:
 ```bash
 # To install a compatible version of TensorFlow
 pip install -r requirements-tensorflow.txt
-# To install a compatible version of PyTorch with CUDA
+# To install a compatible version of PyTorch
+pip install -r requirements-torch.txt
+# To install a compatible version of PyTorch with CUDA on Linux
 pip install -r requirements-torch-cuda.txt
 ```
 
@@ -338,15 +340,17 @@ The table shows the available models and datasets for the segmentation task and
 For the task of object detection, we measure the performance of different methods using the mean average precision (mAP) for bird's eye view (BEV) and 3D.
 The table shows the available models and datasets for the object detection task and the respective scores. Each score links to the respective weight file.
 For the evaluation, the models were evaluated using the validation subset, according to KITTI's validation criteria. The models were trained for three classes (car, pedestrian and cyclist). The calculated values are the mean value over the mAP of all classes for all difficulty levels.
+For the Waymo dataset, the models were trained on three classes (pedestrian, vehicle, cyclist).
 
 
-| Model / Dataset    | KITTI [BEV / 3D] @ 0.70|
-|--------------------|---------------|
-| PointPillars (tf)    | [61.6 / 55.2](https://storage.googleapis.com/open3d-releases/model-zoo/pointpillars_kitti_202012221652utc.zip) |
-| PointPillars (torch) | [61.2 / 52.8](https://storage.googleapis.com/open3d-releases/model-zoo/pointpillars_kitti_202012221652utc.pth)   |
-| PointRCNN (tf)       | [78.2 / 65.9](https://storage.googleapis.com/open3d-releases/model-zoo/pointrcnn_kitti_202105071146utc.zip) |
-| PointRCNN (torch)    | [78.2 / 65.9](https://storage.googleapis.com/open3d-releases/model-zoo/pointrcnn_kitti_202105071146utc.pth) |
+| Model / Dataset    | KITTI [BEV / 3D] @ 0.70| Waymo (BEV / 3D) @ 0.50 |
+|--------------------|------------------------|------------------|
+| PointPillars (tf)    | [61.6 / 55.2](https://storage.googleapis.com/open3d-releases/model-zoo/pointpillars_kitti_202012221652utc.zip) | - |
+| PointPillars (torch) | [61.2 / 52.8](https://storage.googleapis.com/open3d-releases/model-zoo/pointpillars_kitti_202012221652utc.pth)  | avg: 61.01 / 48.30 \| [best: 61.47	/ 57.55](https://storage.googleapis.com/open3d-releases/model-zoo/pointpillars_waymo_202211200158utc_seed2_gpu16.pth) [^wpp-train] |
+| PointRCNN (tf)       | [78.2 / 65.9](https://storage.googleapis.com/open3d-releases/model-zoo/pointrcnn_kitti_202105071146utc.zip) | - |
+| PointRCNN (torch)    | [78.2 / 65.9](https://storage.googleapis.com/open3d-releases/model-zoo/pointrcnn_kitti_202105071146utc.pth) | - |
 
+[^wpp-train]: The avg. metrics are the average of three sets of training runs with 4, 8, 16 and 32 GPUs. Training was for halted after 30 epochs. Model checkpoint is available for the best training run.
 
 #### Training PointRCNN
 
@@ -402,6 +406,7 @@ For downloading these datasets visit the respective webpages and have a look at
 * [Visualize custom data](docs/howtos.md#visualize-custom-data)
 * [Adding a new model](docs/howtos.md#adding-a-new-model)
 * [Adding a new dataset](docs/howtos.md#adding-a-new-dataset)
+* [Distributed training](docs/howtos.md#distributed-training)
 * [Visualize and compare input data, ground truth and results in TensorBoard](docs/tensorboard.md)
 * [Inference with Intel OpenVINO](docs/openvino.md)
 
diff --git a/ci/run_ci.sh b/ci/run_ci.sh
@@ -50,8 +50,7 @@ cmake -DBUNDLE_OPEN3D_ML=ON \
     -DGLIBCXX_USE_CXX11_ABI=OFF \
     -DBUILD_TENSORFLOW_OPS=ON \
     -DBUILD_PYTORCH_OPS=ON \
-    -DBUILD_GUI=OFF \
-    -DBUILD_RPC_INTERFACE=OFF \
+    -DBUILD_GUI=ON \
     -DBUILD_UNIT_TESTS=OFF \
     -DBUILD_BENCHMARKS=OFF \
     -DBUILD_EXAMPLES=OFF \
diff --git a/docs/howtos.md b/docs/howtos.md
@@ -243,3 +243,18 @@ import open3d.ml.torch as ml3d
 model = ml3d.models.MyModel()
 dataset = ml3d.datasets.MyDataset()
 ```
+
+## Distributed training (preview)
+
+Open3D-ML currently supports distributed training with PyTorch for object detection on Waymo with the PointPillars model. More comprehensive support for semantic segmentation models will follow shortly.
+
+Distributed training uses the PyTorch Distributed Data Parallel (DDP) module and can be used to distribute training across multiple computer nodes, each with multiple GPUs. Here is a chart of per eopch runtime showing the speedup of sample runs with increasing number of GPUs. The training was run on a cluster containing 4 nodes with 8 RTX 3090 GPUs each.
+
+- Dataset: Waymo v1.3
+- Model: PointPillars
+- GPU: RTX 3090
+- Batch size: 4 per GPU
+
+![PointPillars training on Waymo per epoch training time with number of GPUs](https://user-images.githubusercontent.com/41028320/220750523-57075575-8cc7-4e40-99b0-a4e79995f1ec.png)
+
+See [`scripts/train_scripts/pointpillars_waymo.sh`](../scripts/train_scripts/pointpillars_waymo.sh) for an example SLURM training script for distributed training on two nodes, using four GPUs on each node. The remaining configuration is read from the config file [`pointpillars_waymo.yml`](../ml3d/configs/pointpillars_waymo.yml).
diff --git a/ml3d/tf/modules/pointnet.py b/ml3d/tf/modules/pointnet.py
@@ -93,7 +93,7 @@ def __init__(self):
     def call(self, xyz, features=None, new_xyz=None, training=True):
         r"""
         :param xyz: (B, N, 3) tensor of the xyz coordinates of the features
-        :param features: (B, N, C) tensor of the descriptors of the the features
+        :param features: (B, N, C) tensor of the descriptors of the features
         :param new_xyz:
         :return:
             new_xyz: (B, npoint, 3) tensor of the new features' xyz
diff --git a/ml3d/tf/utils/pointnet/pointnet2_modules.py b/ml3d/tf/utils/pointnet/pointnet2_modules.py
@@ -17,7 +17,7 @@ def __init__(self):
     def call(self, xyz, features=None, new_xyz=None, training=True):
         r"""
         :param xyz: (B, N, 3) tensor of the xyz coordinates of the features
-        :param features: (B, N, C) tensor of the descriptors of the the features
+        :param features: (B, N, C) tensor of the descriptors of the features
         :param new_xyz:
         :return:
             new_xyz: (B, npoint, 3) tensor of the new features' xyz
diff --git a/ml3d/torch/modules/pointnet.py b/ml3d/torch/modules/pointnet.py
@@ -122,7 +122,7 @@ def forward(self,
                 new_xyz=None) -> (torch.Tensor, torch.Tensor):
         r"""
         :param xyz: (B, N, 3) tensor of the xyz coordinates of the features
-        :param features: (B, N, C) tensor of the descriptors of the the features
+        :param features: (B, N, C) tensor of the descriptors of the features
         :param new_xyz:
         :return:
             new_xyz: (B, npoint, 3) tensor of the new features' xyz
diff --git a/ml3d/torch/utils/pointnet/pointnet2_modules.py b/ml3d/torch/utils/pointnet/pointnet2_modules.py
@@ -50,7 +50,7 @@ def forward(self,
         r"""Forward.
 
         :param xyz: (B, N, 3) tensor of the xyz coordinates of the features
-        :param features: (B, N, C) tensor of the descriptors of the the features
+        :param features: (B, N, C) tensor of the descriptors of the features
         :param new_xyz:
         :return:
             new_xyz: (B, npoint, 3) tensor of the new features' xyz
diff --git a/ml3d/utils/builder.py b/ml3d/utils/builder.py
@@ -14,19 +14,19 @@ def build_network(cfg):
     return build(cfg, NETWORK)
 
 
-def convert_device_name(framework, device_ids):
+def convert_device_name(device_type, device_ids):
     """Convert device to either cpu or cuda."""
     gpu_names = ["gpu", "cuda"]
     cpu_names = ["cpu"]
-    if framework not in cpu_names + gpu_names:
+    if device_type not in cpu_names + gpu_names:
         raise KeyError("the device should either "
-                       "be cuda or cpu but got {}".format(framework))
+                       "be cuda or cpu but got {}".format(device_type))
     assert type(device_ids) is list
     device_ids_new = []
     for device in device_ids:
         device_ids_new.append(int(device))
 
-    if framework in gpu_names:
+    if device_type in gpu_names:
         return "cuda", device_ids_new
     else:
         return "cpu", device_ids_new
diff --git a/scripts/run_pipeline.py b/scripts/run_pipeline.py
@@ -1,14 +1,15 @@
-import numpy as np
+import os
 import argparse
 import logging
 import sys
-import yaml
+from pathlib import Path
 import pprint
-import os
+import yaml
+import numpy as np
 import torch.distributed as dist
 from torch import multiprocessing
 
-from pathlib import Path
+import open3d.ml as _ml3d
 
 
 def parse_args():
@@ -42,7 +43,14 @@ def parse_args():
     parser.add_argument('--batch_size', help='batch size', default=None)
     parser.add_argument('--main_log_dir',
                         help='the dir to save logs and models')
-    parser.add_argument('--seed', help='random seed', default=0)
+    parser.add_argument('--seed', help='random seed', default=0, type=int)
+    parser.add_argument('--nodes', help='number of nodes', default=1, type=int)
+    parser.add_argument('--node_rank',
+                        help='ranking within the nodes, default: 0. To get from'
+                        ' the environment, enter the name of an env var eg: '
+                        '"SLURM_NODEID".',
+                        default="0",
+                        type=str)
     parser.add_argument(
         '--host',
         help='Host for distributed training, default: localhost',
@@ -57,6 +65,10 @@ def parse_args():
         default='gloo')
 
     args, unknown = parser.parse_known_args()
+    try:
+        args.node_rank = int(args.node_rank)
+    except ValueError:  # str => get from environment
+        args.node_rank = int(os.environ[args.node_rank])
 
     parser_extra = argparse.ArgumentParser(description='Extra arguments')
     for arg in unknown:
@@ -73,9 +85,6 @@ def parse_args():
     return args, vars(args_extra)
 
 
-import open3d.ml as _ml3d
-
-
 def main():
     cmd_line = ' '.join(sys.argv[:])
     args, extra_dict = parse_args()
@@ -103,6 +112,10 @@ def main():
                 if device == 'cpu':
                     tf.config.set_visible_devices([], 'GPU')
                 elif device == 'cuda':
+                    if len(args.device_ids) > 1:
+                        raise NotImplementedError(
+                            "Multi-GPU training with TensorFlow is not yet implemented."
+                        )
                     tf.config.set_visible_devices(gpus[0], 'GPU')
                 else:
                     idx = device.split(':')[1]
@@ -152,8 +165,8 @@ def main():
         cfg_dict_model['seed'] = rng
         cfg_dict_pipeline['seed'] = rng
 
-    with open(Path(__file__).parent / 'README.md', 'r') as f:
-        readme = f.read()
+    with open(Path(__file__).parent / 'README.md', 'r') as freadme:
+        readme = freadme.read()
 
     cfg_tb = {
         'readme': readme,
@@ -197,9 +210,10 @@ def cleanup():
     dist.destroy_process_group()
 
 
-def main_worker(rank, Dataset, Model, Pipeline, cfg_dict_dataset,
+def main_worker(local_rank, Dataset, Model, Pipeline, cfg_dict_dataset,
                 cfg_dict_model, cfg_dict_pipeline, args):
-    world_size = len(args.device_ids)
+    rank = args.node_rank * len(args.device_ids) + local_rank
+    world_size = args.nodes * len(args.device_ids)
     setup(rank, world_size, args)
 
     cfg_dict_dataset['rank'] = rank
@@ -211,8 +225,10 @@ def main_worker(rank, Dataset, Model, Pipeline, cfg_dict_dataset,
     cfg_dict_model['seed'] = rng
     cfg_dict_pipeline['seed'] = rng
 
-    device = f"cuda:{args.device_ids[rank]}"
-    print(f"rank = {rank}, world_size = {world_size}, gpu = {device}")
+    device = f"cuda:{args.device_ids[local_rank]}"
+    print(
+        f"local_rank = {local_rank}, rank = {rank}, world_size = {world_size},"
+        f" gpu = {device}")
 
     cfg_dict_model['device'] = device
     cfg_dict_pipeline['device'] = device
diff --git a/scripts/train_scripts/kpconv_kitti.sh b/scripts/train_scripts/kpconv_kitti.sh
@@ -4,7 +4,7 @@
 #SBATCH --gres=gpu:1 
 
 if [ "$#" -ne 2 ]; then
-    echo "Please, provide the the training framework: torch/tf and dataset path"
+    echo "Please, provide the training framework: torch/tf and dataset path"
     exit 1
 fi
 
diff --git a/scripts/train_scripts/kpconv_paris.sh b/scripts/train_scripts/kpconv_paris.sh
@@ -4,7 +4,7 @@
 #SBATCH --gres=gpu:1 
 
 if [ "$#" -ne 2 ]; then
-    echo "Please, provide the the training framework: torch/tf and dataset path"
+    echo "Please, provide the training framework: torch/tf and dataset path"
     exit 1
 fi
 
diff --git a/scripts/train_scripts/kpconv_s3dis.sh b/scripts/train_scripts/kpconv_s3dis.sh
@@ -4,7 +4,7 @@
 #SBATCH --gres=gpu:1 
 
 if [ "$#" -ne 2 ]; then
-    echo "Please, provide the the training framework: torch/tf and dataset path"
+    echo "Please, provide the training framework: torch/tf and dataset path"
     exit 1
 fi
 
diff --git a/scripts/train_scripts/kpconv_semantic3d.sh b/scripts/train_scripts/kpconv_semantic3d.sh
@@ -4,7 +4,7 @@
 #SBATCH --gres=gpu:1 
 
 if [ "$#" -ne 2 ]; then
-    echo "Please, provide the the training framework: torch/tf and dataset path"
+    echo "Please, provide the training framework: torch/tf and dataset path"
     exit 1
 fi
 
diff --git a/scripts/train_scripts/kpconv_toronto.sh b/scripts/train_scripts/kpconv_toronto.sh
@@ -4,7 +4,7 @@
 #SBATCH --gres=gpu:1 
 
 if [ "$#" -ne 2 ]; then
-    echo "Please, provide the the training framework: torch/tf and dataset path"
+    echo "Please, provide the training framework: torch/tf and dataset path"
     exit 1
 fi
 
diff --git a/scripts/train_scripts/pointpillars_kitti.sh b/scripts/train_scripts/pointpillars_kitti.sh
@@ -4,7 +4,7 @@
 #SBATCH --gres=gpu:1 
 
 if [ "$#" -ne 2 ]; then
-    echo "Please, provide the the training framework: torch/tf and dataset path"
+    echo "Please, provide the training framework: torch/tf and dataset path"
     exit 1
 fi
 
diff --git a/scripts/train_scripts/pointpillars_waymo.sh b/scripts/train_scripts/pointpillars_waymo.sh
@@ -0,0 +1,38 @@
+#!/bin/bash
+# Launch this script on each training node, for e.g. with SLURM. This example
+# launches 8 workers total: SLURM launches this on 2 nodes and run_pipline.py
+# launches 4 workers on each node.
+#SBATCH --job-name=objdet_2x4
+#SBATCH --gres=gpu:4
+#SBATCH --cpus-per-task=48      # scale by n_gpus_per_node
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=1
+#SBATCH --mem=384G              # scale by n_gpus_per_node
+#SBATCH --output="./slurm-%x-%j-%N.log"
+
+if [ "$#" -ne 2 ]; then
+    echo "Please, provide the training framework: torch/tf and dataset path."
+    exit 1
+fi
+
+# Use launch node as master, if not set
+export MASTER_ADDR=${MASTER_ADDR:-$SLURMD_NODENAME}
+export MASTER_PORT=${MASTER_PORT:-29500}
+# Use all available GPUs, if not set. Must be the same for ALL nodes.
+export DEVICE_IDS=${DEVICE_IDS:-$(nvidia-smi --list-gpus | cut -f2 -d' ' | tr ':\n' ' ')}
+export NODE_RANK=${NODE_RANK:-SLURM_NODEID} # Pass name of env var
+
+echo Started at: $(date)
+pushd ../..
+# Launch on each node
+srun -l python scripts/run_pipeline.py "$1" -c ml3d/configs/pointpillars_waymo.yml \
+    --dataset_path "$2" --pipeline ObjectDetection \
+    --pipeline.num_workers 0 --pipeline.pin_memory False \
+    --pipeline.batch_size 4 --device_ids $DEVICE_IDS \
+    --backend nccl \
+    --nodes $SLURM_JOB_NUM_NODES \
+    --node_rank "$NODE_RANK" \
+    --host "$MASTER_ADDR" --port "$MASTER_PORT"
+
+echo Completed at: $(date)
+popd
diff --git a/scripts/train_scripts/randlanet_kitti.sh b/scripts/train_scripts/randlanet_kitti.sh
@@ -4,7 +4,7 @@
 #SBATCH --gres=gpu:1 
 
 if [ "$#" -ne 2 ]; then
-    echo "Please, provide the the training framework: torch/tf and dataset path"
+    echo "Please, provide the training framework: torch/tf and dataset path"
     exit 1
 fi
 
diff --git a/scripts/train_scripts/randlanet_paris.sh b/scripts/train_scripts/randlanet_paris.sh
@@ -4,7 +4,7 @@
 #SBATCH --gres=gpu:1 
 
 if [ "$#" -ne 2 ]; then
-    echo "Please, provide the the training framework: torch/tf and dataset path"
+    echo "Please, provide the training framework: torch/tf and dataset path"
     exit 1
 fi
 
diff --git a/scripts/train_scripts/randlanet_s3dis.sh b/scripts/train_scripts/randlanet_s3dis.sh
@@ -4,7 +4,7 @@
 #SBATCH --gres=gpu:1 
 
 if [ "$#" -ne 2 ]; then
-    echo "Please, provide the the training framework: torch/tf and dataset path"
+    echo "Please, provide the training framework: torch/tf and dataset path"
     exit 1
 fi
 
diff --git a/scripts/train_scripts/randlanet_semantic3d.sh b/scripts/train_scripts/randlanet_semantic3d.sh
@@ -5,7 +5,7 @@
 
 
 if [ "$#" -ne 2 ]; then
-    echo "Please, provide the the training framework: torch/tf and dataset path"
+    echo "Please, provide the training framework: torch/tf and dataset path"
     exit 1
 fi
 
diff --git a/scripts/train_scripts/randlanet_toronto.sh b/scripts/train_scripts/randlanet_toronto.sh
@@ -4,7 +4,7 @@
 #SBATCH --gres=gpu:1 
 
 if [ "$#" -ne 2 ]; then
-    echo "Please, provide the the training framework: torch/tf and dataset path"
+    echo "Please, provide the training framework: torch/tf and dataset path"
     exit 1
 fi