Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

CUDA error when increasing number of training epochs #13684

Answered by szha
astonzhang asked this question in Q&A
Discussion options

You must be logged in to vote

@ndeepesh this is caused by the same CUDA fork problem we discussed in #18734. The way to solve it is to fork first before initializing the GPU context. In this example, the fork happens in data loader in load_cifar10 and the GPU initialization happens with try_all_gpus. Reordering them should solve the problem.

def train_with_data_aug(train_augs, test_augs, lr=0.001):
    batch_size = 256
    train_iter = load_cifar10(True, train_augs, batch_size)
    test_iter = load_cifar10(False, test_augs, batch_size)
    ctx, net = try_all_gpus(), gb.resnet18(10)
    net.initialize(ctx=ctx, init=init.Xavier())
    trainer = gluon.Trainer(net.collect_params(), 'adam',
                            {'le…

Replies: 7 comments 4 replies

Comment options

You must be logged in to vote
0 replies
Comment options

astonzhang
Dec 19, 2018
Collaborator Author

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by szha
Comment options

You must be logged in to vote
2 replies
@szha
Comment options

szha Oct 3, 2020
Collaborator

@ndeepesh
Comment options

Comment options

You must be logged in to vote
2 replies
@szha
Comment options

szha Oct 3, 2020
Collaborator

@ndeepesh
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
5 participants
Converted from issue

This discussion was converted from issue #13684 on October 03, 2020 22:06.