The repo contains some practise code on vision transformers taken from different sources, but with some customizations to enhance understanding. It will primarily contain Google colab codes and so it can used by anyone who wants to quickly fine-tune a pre-trained ViT model.
For more information on Vision transformers, please refer to the original paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. The repo currently has only one .ipynb notebook. It contains code to use a custom Patch embedding layer with a pre-trained ViT on a CIFAR10 dataset.
Feel free to ask me questions or point out errors.