No module named 'titans' #1455
Unanswered
pidan-bilibili
asked this question in
Community | Q&A
Replies: 1 comment
-
pip install titans |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I followed instructions on the tutorials, and tried to train ResNet(on google colab), and it shows the following error.
/content/ColossalAI/ColossalAI-Examples/image/resnet# colossalai run --nproc_per_node 1 train.py
Colossalai should be built with cuda extension to use the FP16 optimizer
If you want to activate cuda mode for MoE, please install with cuda_ext!
Colossalai should be built with cuda extension to use the FP16 optimizer
If you want to activate cuda mode for MoE, please install with cuda_ext!
Traceback (most recent call last):
File "train.py", line 15, in
from titans.utils import barrier_context
ModuleNotFoundError: No module named 'titans'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 819) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
)(*cmd_args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-08-15_00:42:03
host : 22c9ca7d0d85
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 819)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py on 127.0.0.1
Beta Was this translation helpful? Give feedback.
All reactions