Input branches

Hi, it seems that your models have 2 input branches, one for the words and one for the image descriptor. Instead, in the paper the input is the same. That is, in the first time the image descriptor is fed, then the (embedded) words are fed to the LSTM.