CUDA error when increasing number of training epochs #13684

astonzhang · 2018-12-19T05:38:55Z

astonzhang
Dec 19, 2018
Collaborator

Description

CUDA error when increasing number of training epochs

Environment info (Required)

mxnet-cu90==1.4.0b20181207
mxnet-cu92==1.4.0b20181207
mxnet-cu90==1.5.0b20181215
mxnet-cu92==1.5.0b20181215

Package used (Python/R/Scala/Julia):
Python

Error Message:

src/engine/threaded_engine_perdevice.cc:99: Ignore CUDA Error [05:09:49] /home/ubuntu/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: initialization error

Stack trace returned 10 entries:
[bt] (0) /var/lib/jenkins/miniconda2/envs/d2l-zh-build/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x402eda) [0x7f7791db6eda]
[bt] (1) /var/lib/jenkins/miniconda2/envs/d2l-zh-build/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4034f1) [0x7f7791db74f1]
[bt] (2) /var/lib/jenkins/miniconda2/envs/d2l-zh-build/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2c616a8) [0x7f77946156a8]
[bt] (3) /var/lib/jenkins/miniconda2/envs/d2l-zh-build/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2c6dd42) [0x7f7794621d42]
[bt] (4) /var/lib/jenkins/miniconda2/envs/d2l-zh-build/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2c59a28) [0x7f779460da28]
[bt] (5) /var/lib/jenkins/miniconda2/envs/d2l-zh-build/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2c5a574) [0x7f779460e574]
[bt] (6) /var/lib/jenkins/miniconda2/envs/d2l-zh-build/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x361) [0x7f7794819151]
[bt] (7) /var/lib/jenkins/miniconda2/envs/d2l-zh-build/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x40688a) [0x7f7791dba88a]
[bt] (8) /var/lib/jenkins/miniconda2/envs/d2l-zh-build/lib/python3.6/site-packages/mxnet/libmxnet.so(MXNDArrayFree+0x54) [0x7f779459b204]
[bt] (9) /var/lib/jenkins/miniconda2/envs/d2l-zh-build/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f782ee68ec0]

Minimum reproducible example

http://en.diveintodeeplearning.org/chapter_computer-vision/image-augmentation.html

Steps to reproduce

pip install gluonbook==0.8.5
copy the code from

import gluonbook as gb                                                                
import mxnet as mx
from mxnet import autograd, gluon, image, init, nd
from mxnet.gluon import data as gdata, loss as gloss, utils as gutils
import sys 
from time import time


flip_aug = gdata.vision.transforms.Compose([
    gdata.vision.transforms.RandomFlipLeftRight(),
    gdata.vision.transforms.ToTensor()])

no_aug = gdata.vision.transforms.Compose([
    gdata.vision.transforms.ToTensor()])

num_workers = 0 if sys.platform.startswith('win32') else 4
def load_cifar10(is_train, augs, batch_size):
    return gdata.DataLoader(
        gdata.vision.CIFAR10(train=is_train).transform_first(augs),
        batch_size=batch_size, shuffle=is_train, num_workers=num_workers)


def try_all_gpus():
    ctxes = []
    try:
        for i in range(16):
            ctx = mx.gpu(i)
            _ = nd.array([0], ctx=ctx)
            ctxes.append(ctx)
    except mx.base.MXNetError:
        pass
    if not ctxes:
        ctxes = [mx.cpu()]
    return ctxes


def _get_batch(batch, ctx):
    features, labels = batch
    if labels.dtype != features.dtype:
        labels = labels.astype(features.dtype)
    return (gutils.split_and_load(features, ctx),
            gutils.split_and_load(labels, ctx),
            features.shape[0])

def evaluate_accuracy(data_iter, net, ctx=[mx.cpu()]):
    if isinstance(ctx, mx.Context):
        ctx = [ctx]
    acc = nd.array([0])
    n = 0 
    for batch in data_iter:
        features, labels, _ = _get_batch(batch, ctx)
        for X, y in zip(features, labels):
            y = y.astype('float32')
            acc += (net(X).argmax(axis=1) == y).sum().copyto(mx.cpu())
            n += y.size
        acc.wait_to_read()
    return acc.asscalar() / n 


def train(train_iter, test_iter, net, loss, trainer, ctx, num_epochs):
    print('training on', ctx)
    if isinstance(ctx, mx.Context):
        ctx = [ctx]
    for epoch in range(num_epochs):
        train_l_sum, train_acc_sum, n, m = 0.0, 0.0, 0.0, 0.0
        start = time()
        for i, batch in enumerate(train_iter):
            Xs, ys, batch_size = _get_batch(batch, ctx)
            ls = []
            with autograd.record():
                y_hats = [net(X) for X in Xs]
                ls = [loss(y_hat, y) for y_hat, y in zip(y_hats, ys)]
            for l in ls:
                l.backward()
            train_acc_sum += sum([(y_hat.argmax(axis=1) == y).sum().asscalar()
                                 for y_hat, y in zip(y_hats, ys)])
            train_l_sum += sum([l.sum().asscalar() for l in ls])
            trainer.step(batch_size)
            n += batch_size
            m += sum([y.size for y in ys])
        test_acc = evaluate_accuracy(test_iter, net, ctx)
        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, '
              'time %.1f sec'
              % (epoch + 1, train_l_sum / n, train_acc_sum / m, test_acc,
                 time() - start))


def train_with_data_aug(train_augs, test_augs, lr=0.001):
    batch_size, ctx, net = 256, try_all_gpus(), gb.resnet18(10)
    net.initialize(ctx=ctx, init=init.Xavier())
    trainer = gluon.Trainer(net.collect_params(), 'adam',
                            {'learning_rate': lr})
    loss = gloss.SoftmaxCrossEntropyLoss()
    train_iter = load_cifar10(True, train_augs, batch_size)
    test_iter = load_cifar10(False, test_augs, batch_size)
    train(train_iter, test_iter, net, loss, trainer, ctx, num_epochs=8)

train_with_data_aug(flip_aug, no_aug)


train_with_data_aug(no_aug, no_aug)

into

example.py

python example.py

What have you tried to solve it?

Change
train(train_iter, test_iter, net, loss, trainer, ctx, num_epochs=8)
to
train(train_iter, test_iter, net, loss, trainer, ctx, num_epochs=5)

Then the error will disappear. However, I think it just hides the problem rather than solves it.

Check memory usage (obviously not due to out of memory):

Answered by szha

Oct 3, 2020

@ndeepesh this is caused by the same CUDA fork problem we discussed in #18734. The way to solve it is to fork first before initializing the GPU context. In this example, the fork happens in data loader in load_cifar10 and the GPU initialization happens with try_all_gpus. Reordering them should solve the problem.

def train_with_data_aug(train_augs, test_augs, lr=0.001):
    batch_size = 256
    train_iter = load_cifar10(True, train_augs, batch_size)
    test_iter = load_cifar10(False, test_augs, batch_size)
    ctx, net = try_all_gpus(), gb.resnet18(10)
    net.initialize(ctx=ctx, init=init.Xavier())
    trainer = gluon.Trainer(net.collect_params(), 'adam',
                            {'le…

View full answer

mxnet-label-bot · 2018-12-19T05:38:58Z

mxnet-label-bot
Dec 19, 2018

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Cuda, Test

0 replies

astonzhang · 2018-12-19T22:41:28Z

astonzhang
Dec 19, 2018
Collaborator Author

@mxnet-label-bot add [Cuda, Windows]

0 replies

wkcn · 2018-12-27T09:39:36Z

wkcn
Dec 27, 2018
Collaborator

Reproduce the bug on Ubuntu 16.04, TITAN X * 2, mxnet-cu90==1.5.0b20181218

The error occurs when calling mshadow::SetDevice<gpu>(ctx.dev_id).

0 replies

ndeepesh · 2020-10-03T21:24:06Z

ndeepesh
Oct 3, 2020

Any resolution on this error?

0 replies

szha · 2020-10-03T22:06:40Z

szha
Oct 3, 2020
Collaborator

@ndeepesh this is caused by the same CUDA fork problem we discussed in #18734. The way to solve it is to fork first before initializing the GPU context. In this example, the fork happens in data loader in load_cifar10 and the GPU initialization happens with try_all_gpus. Reordering them should solve the problem.

def train_with_data_aug(train_augs, test_augs, lr=0.001):
    batch_size = 256
    train_iter = load_cifar10(True, train_augs, batch_size)
    test_iter = load_cifar10(False, test_augs, batch_size)
    ctx, net = try_all_gpus(), gb.resnet18(10)
    net.initialize(ctx=ctx, init=init.Xavier())
    trainer = gluon.Trainer(net.collect_params(), 'adam',
                            {'learning_rate': lr})
    loss = gloss.SoftmaxCrossEntropyLoss()
    train(train_iter, test_iter, net, loss, trainer, ctx, num_epochs=8)

0 replies

ndeepesh · 2020-10-03T22:46:43Z

ndeepesh
Oct 3, 2020

@szha I am still seeing same warning messages. (it is not failing the training though, not sure if this is expected)

import gluonbook as gb
import mxnet as mx
from mxnet import autograd, gluon, image, init, nd
from mxnet.gluon import data as gdata, loss as gloss, utils as gutils
import sys
from time import time


flip_aug = gdata.vision.transforms.Compose([
    gdata.vision.transforms.RandomFlipLeftRight(),
    gdata.vision.transforms.ToTensor()])

no_aug = gdata.vision.transforms.Compose([
    gdata.vision.transforms.ToTensor()])

num_workers = 0 if sys.platform.startswith('win32') else 4
def load_cifar10(is_train, augs, batch_size):
    return gdata.DataLoader(
        gdata.vision.CIFAR10(train=is_train).transform_first(augs),
        batch_size=batch_size, shuffle=is_train, num_workers=num_workers)


def try_all_gpus():
    ctxes = []
    try:
        for i in range(16):
            ctx = mx.gpu(i)
            _ = nd.array([0], ctx=ctx)
            ctxes.append(ctx)
    except mx.base.MXNetError:
        pass
    if not ctxes:
        ctxes = [mx.cpu()]
    return ctxes


def _get_batch(batch, ctx):
    features, labels = batch
    if labels.dtype != features.dtype:
        labels = labels.astype(features.dtype)
    return (gutils.split_and_load(features, ctx),
            gutils.split_and_load(labels, ctx),
            features.shape[0])

def evaluate_accuracy(data_iter, net, ctx=[mx.cpu()]):
    if isinstance(ctx, mx.Context):
        ctx = [ctx]
    acc = nd.array([0])
    n = 0
    for batch in data_iter:
        features, labels, _ = _get_batch(batch, ctx)
        for X, y in zip(features, labels):
            y = y.astype('float32')
            acc += (net(X).argmax(axis=1) == y).sum().copyto(mx.cpu())
            n += y.size
        acc.wait_to_read()
    return acc.asscalar() / n


def train(train_iter, test_iter, net, loss, trainer, ctx, num_epochs):
    print('training on', ctx)
    if isinstance(ctx, mx.Context):
        ctx = [ctx]
    for epoch in range(num_epochs):
        train_l_sum, train_acc_sum, n, m = 0.0, 0.0, 0.0, 0.0
        start = time()
        for i, batch in enumerate(train_iter):
            Xs, ys, batch_size = _get_batch(batch, ctx)
            ls = []
            with autograd.record():
                y_hats = [net(X) for X in Xs]
                ls = [loss(y_hat, y) for y_hat, y in zip(y_hats, ys)]
            for l in ls:
                l.backward()
            train_acc_sum += sum([(y_hat.argmax(axis=1) == y).sum().asscalar()
                                 for y_hat, y in zip(y_hats, ys)])
            train_l_sum += sum([l.sum().asscalar() for l in ls])
            trainer.step(batch_size)
            n += batch_size
            m += sum([y.size for y in ys])
        test_acc = evaluate_accuracy(test_iter, net, ctx)
        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, '
              'time %.1f sec'
              % (epoch + 1, train_l_sum / n, train_acc_sum / m, test_acc,
                 time() - start))

def train_with_data_aug(train_augs, test_augs, lr=0.001):
    batch_size = 256
    train_iter = load_cifar10(True, train_augs, batch_size)
    test_iter = load_cifar10(False, test_augs, batch_size)
    ctx, net = try_all_gpus(), gb.resnet18(10)
    net.initialize(ctx=ctx, init=init.Xavier())
    trainer = gluon.Trainer(net.collect_params(), 'adam',
                            {'learning_rate': lr})
    loss = gloss.SoftmaxCrossEntropyLoss()
    train(train_iter, test_iter, net, loss, trainer, ctx, num_epochs=8)


train_with_data_aug(flip_aug, no_aug)
train_with_data_aug(no_aug, no_aug)

2 replies

szha Oct 3, 2020
Collaborator

What's the message?

ndeepesh Oct 3, 2020

@szha Here is the message

[23:24:18] src/engine/threaded_engine_perdevice.cc:101: Ignore CUDA Error [23:24:18] /root/pip_build/mxnet-build/3rdparty/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess: CUDA: initialization error
Stack trace:
[bt] (0) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6e305b) [0x7f0d92f4b05b]
[bt] (1) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a6622) [0x7f0d9610e622]
[bt] (2) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38c9d1e) [0x7f0d96131d1e]
[bt] (3) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38bc6e1) [0x7f0d961246e1]
[bt] (4) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38b2291) [0x7f0d9611a291]
[bt] (5) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38b31e4) [0x7f0d9611b1e4]
[bt] (6) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x48a) [0x7f0d96356e3a]
[bt] (7) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6e680a) [0x7f0d92f4e80a]
[bt] (8) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(MXNDArrayFree+0x54) [0x7f0d96082f44]

ndeepesh · 2020-10-03T23:03:03Z

ndeepesh
Oct 3, 2020

@szha Also will it be a issue if we use a dataloader like this in Gluon - https://github.com/apache/incubator-mxnet/blob/e297471c45a185d152cad1668dbb62e277fe6d62/python/mxnet/gluon/data/dataloader.py#L307, where processes are created at each epoch (since we return _MultiWorkerIter in iter method). In such cases forking before cuda context initialization would be difficult. Right?

2 replies

szha Oct 3, 2020
Collaborator

Yes. In addition to the CUDA fork problem, creating dataloader in each epoch is an anti-pattern that should be avoided.

ndeepesh Oct 3, 2020

Got it. Thanks :)

CUDA error when increasing number of training epochs #13684

Uh oh!

Uh oh!

astonzhang Dec 19, 2018 Collaborator

Description

Environment info (Required)

Error Message:

Minimum reproducible example

Steps to reproduce

What have you tried to solve it?

Replies: 7 comments · 4 replies

Uh oh!

mxnet-label-bot Dec 19, 2018

Uh oh!

astonzhang Dec 19, 2018 Collaborator Author

Uh oh!

Uh oh!

wkcn Dec 27, 2018 Collaborator

Uh oh!

ndeepesh Oct 3, 2020

Uh oh!

szha Oct 3, 2020 Collaborator

Uh oh!

Uh oh!

ndeepesh Oct 3, 2020

Uh oh!

szha Oct 3, 2020 Collaborator

Uh oh!

ndeepesh Oct 3, 2020

Uh oh!

Uh oh!

ndeepesh Oct 3, 2020

Uh oh!

szha Oct 3, 2020 Collaborator

Uh oh!

Uh oh!

ndeepesh Oct 3, 2020

astonzhang
Dec 19, 2018
Collaborator

Replies: 7 comments 4 replies

mxnet-label-bot
Dec 19, 2018

astonzhang
Dec 19, 2018
Collaborator Author

wkcn
Dec 27, 2018
Collaborator

ndeepesh
Oct 3, 2020

szha
Oct 3, 2020
Collaborator

ndeepesh
Oct 3, 2020

szha Oct 3, 2020
Collaborator

ndeepesh
Oct 3, 2020

szha Oct 3, 2020
Collaborator