There is a sentence in your paper: > We train our models using stochastic gradient descent (SGD) with 0.9 **Nesterov momentum** and 10-4 weight decay. But in line 77 in train_imagenet.py, nesterov=True is not set in torch.optim.SGD(). Hence, is Nesterov momentum used in models for ImageNet on earth?