CUDA error when increasing number of training epochs #13684
-
DescriptionCUDA error when increasing number of training epochs Environment info (Required)mxnet-cu90==1.4.0b20181207 Package used (Python/R/Scala/Julia): Error Message:src/engine/threaded_engine_perdevice.cc:99: Ignore CUDA Error [05:09:49] /home/ubuntu/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: initialization error Stack trace returned 10 entries: Minimum reproducible examplehttp://en.diveintodeeplearning.org/chapter_computer-vision/image-augmentation.html Steps to reproduce
into example.py
What have you tried to solve it?
Then the error will disappear. However, I think it just hides the problem rather than solves it.
Wed Dec 19 05:51:34 2018 |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 4 replies
-
Hey, this is the MXNet Label Bot. |
Beta Was this translation helpful? Give feedback.
-
@mxnet-label-bot add [Cuda, Windows] |
Beta Was this translation helpful? Give feedback.
-
Reproduce the bug on Ubuntu 16.04, TITAN X * 2, mxnet-cu90==1.5.0b20181218 The error occurs when calling |
Beta Was this translation helpful? Give feedback.
-
Any resolution on this error? |
Beta Was this translation helpful? Give feedback.
-
@ndeepesh this is caused by the same CUDA fork problem we discussed in #18734. The way to solve it is to fork first before initializing the GPU context. In this example, the fork happens in data loader in
|
Beta Was this translation helpful? Give feedback.
-
@szha I am still seeing same warning messages. (it is not failing the training though, not sure if this is expected)
|
Beta Was this translation helpful? Give feedback.
-
@szha Also will it be a issue if we use a dataloader like this in Gluon - https://github.com/apache/incubator-mxnet/blob/e297471c45a185d152cad1668dbb62e277fe6d62/python/mxnet/gluon/data/dataloader.py#L307, where processes are created at each epoch (since we return _MultiWorkerIter in iter method). In such cases forking before cuda context initialization would be difficult. Right? |
Beta Was this translation helpful? Give feedback.
@ndeepesh this is caused by the same CUDA fork problem we discussed in #18734. The way to solve it is to fork first before initializing the GPU context. In this example, the fork happens in data loader in
load_cifar10
and the GPU initialization happens withtry_all_gpus
. Reordering them should solve the problem.