Should probably get each process to save checkpoint to its own directory. E.g. model_dir/`hvd.local_rank()`/ I think multiple processes are in a race to write checkpoints to the same file.