Megatron Finetuning hangs with more than 1 A100 GPU #8655

dkajtoch · 2024-03-14T09:36:56Z

dkajtoch
Mar 14, 2024

I am trying to run an example that looks similar to this one: https://github.com/NVIDIA/NeMo/blob/48b8204d57e59c8790aaa6eaa20384b046b1a574/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py

I am using the Docker container nvcr.io/nvidia/nemo:24.01.framework with torchrun command. My initial model is a converted Mistral 7B to Nemo format. Execution looks like this:

torchrun --standalone --nproc-per-node 2 .../megatron_gpt_finetuning.py model.tensor_model_parallel_size=2 model.pipeline_model_parallel_size=1

When I use L4 GPU everything is ok (but end up with Cuda OOM) also H100 works. On the other hand, when I switch to A100 80GB initialization hangs before checkpoint loading. Below is the screenshot for L4. For A100 it hangs before (I never see a blue message) "Loading distributed checkpoint ...". Any ideas how to fix it?

akoumpa · 2024-08-10T00:20:15Z

akoumpa
Aug 10, 2024
Collaborator

Hi, if you try a newer container (e.g. >= 24.05) there shouldn't be an issue. If there is please re-open this ticket & apologies for the late response.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Megatron Finetuning hangs with more than 1 A100 GPU #8655

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Megatron Finetuning hangs with more than 1 A100 GPU #8655

Uh oh!

Uh oh!

dkajtoch Mar 14, 2024

Replies: 1 comment

Uh oh!

akoumpa Aug 10, 2024 Collaborator

dkajtoch
Mar 14, 2024

akoumpa
Aug 10, 2024
Collaborator