Replies: 1 comment
-
Hi, if you try a newer container (e.g. >= 24.05) there shouldn't be an issue. If there is please re-open this ticket & apologies for the late response. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to run an example that looks similar to this one: https://github.com/NVIDIA/NeMo/blob/48b8204d57e59c8790aaa6eaa20384b046b1a574/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py
I am using the Docker container
nvcr.io/nvidia/nemo:24.01.framework
withtorchrun
command. My initial model is a converted Mistral 7B to Nemo format. Execution looks like this:When I use L4 GPU everything is ok (but end up with Cuda OOM) also H100 works. On the other hand, when I switch to A100 80GB initialization hangs before checkpoint loading. Below is the screenshot for L4. For A100 it hangs before (I never see a blue message) "Loading distributed checkpoint ...". Any ideas how to fix it?

Beta Was this translation helpful? Give feedback.
All reactions