Skip to content

Conversation

@nkvetsinski
Copy link
Contributor

Issue #, if available:

Description of changes:

Noticed that Neuron tests were failing:

torch.distributed.run: [WARNING] *****************************************
orch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
 torch.distributed.run: [WARNING] *****************************************
F external/xla/xla/parse_flags_from_env.cc:224] Unknown flags in XLA_FLAGS: --xla_gpu_simplify_all_fp_conversions=false --xla_gpu_force_compilation_parallelism=8
F external/xla/xla/parse_flags_from_env.cc:224] Unknown flags in XLA_FLAGS: --xla_gpu_simplify_all_fp_conversions=false --xla_gpu_force_compilation_parallelism=8
torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 12) of binary: /usr/local/bin/python3.10
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
===================================================
tests/testNeuronSingleAllReduce.py FAILED
Failures:
[1]:
time : 2024-09-18_02:41:46
host : neuronx-single-node-8q4fw
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 13)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 13
---------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-18_02:41:46
host : neuronx-single-node-8q4fw
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 12)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 12
===================================================

I looked at a coredump from one of the runs, which pointed me in the direction of updating the SDK. Tests are passing with the versions from this PR.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@cartermckinnon cartermckinnon merged commit b35d508 into aws:main Sep 20, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants