Training with pipeline parallelism example #1517
Unanswered
hpc-unex
asked this question in
Community | Q&A
Replies: 1 comment 1 reply
-
Hi, Unex
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello!
I'm trying to train the example provided for pipeline parallelism in (https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/pipeline_parallel) for CIFAR10 and ResNet50.
I'm running a single node with 2 GPUs but something seems to be wrong with the execution since the accuracy is around 10%~ from the epoch 0 until the infinite. I'm using the "resnet.py" file from the repository with the only change that the processes are launched with mpi:
colossalai.launch(config=CONFIG,
host=None,
port=None,
backend='mpi',
rank = int(os.environ['OMPI_COMM_WORLD_RANK']),
world_size=int(os.environ['OMPI_COMM_WORLD_SIZE']),
local_rank = int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK']),
seed=opt.manualSeed)
This configuration is tested in data and model parallelism and works correctly. Any idea? Someone have tested that example?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions