In this work I propose a set of experiments on the Image Captioning for Romanian Language task, starting with an automatically translated and humanly revised "Flickr30k" dataset, on which I worked among two other students. The architectures used for these experiments include Convolutional Networks, Recurrent Networks and both Visual and Text Transformers. The results are not comparable to other state-of-the-art works, but one of the models proposed is generating pretty decent and accurate captions.
Dataset available here: https://github.com/dima331453/Flickr30k-Romanian