Error when running run_gemini.sh under ColossalAI/examples/language/gpt/gemini/ #3649
Replies: 1 comment
-
Need newer colossalai |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
An error occurred when I tried to run script run_gemini.sh under ColossalAI/examples/language/gpt/gemini
Log:
please install Colossal-AI from https://www.colossalai.org/download or from source
Traceback (most recent call last):
File "/workspace/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 352, in
main()
File "/workspace/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 185, in main
assert version.parse(CAI_VERSION) >= version.parse("0.2.0")
AssertionError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 61) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./train_gpt_demo.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-04-26_09:31:39
host : c95a3316a474
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 61)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
CMD:
bash run_gemini.sh
Any idea about that?
Beta Was this translation helpful? Give feedback.
All reactions