Your network computes a probability distribution over the target language vocabulary. Then the whole vector must be used to compute the loss, not only the first element. It makes no sens and does not even compile. How cold you train your model with this criterion ?
loss += criterion(decoder_output[0], target_var[di])
It should be
loss += criterion(decoder_output, target_var[di])