This repository is comprised of notebooks that contains code for testing Vision Transformer and XResnet50, both of them pre-trained, on the ImageWoof dataset, using Adam and Ranger optimizers.
The objective of this project is to get a clear comparison between the performances of pre-trained Vision Transformer (here, ViT-Large) and pre-trained XResnet50, when they are fine-tuned using different (here, Adam and Ranger) optimizers, on the ImageWoof dataset. This project is helpful to people who want to use state-of-the-art pre-trained vision models, but have limited computational resources and are dependent on online environments such as Google Colab.
NB : For this project, ViT-Large was used as I (that is, Prakash Pandey) wanted to get the state-of-the-art model, and, also, because training ViT-Huge model threw 'CUDA Out of Memory' error even with batch size = 1, on Google Colab. So, I found ViT-Large to be the 'deepest' vision transformer that could be trained on Google Colab.
The ImageWoof dataset is a subset of 10 classes from Imagenet that aren't so easy to classify, since they're all dog breeds. The breeds are: Australian terrier, Border terrier, Samoyed, Beagle, Shih-Tzu, English foxhound, Rhodesian ridgeback, Dingo, Golden retriever, Old English sheepdog.
The vision transformer, introduced here, has some variants, and ViT-Large is one of them. It comprises 24 layers and 307M parameters. For this project, I have used a pre-trained ViT-Large model.
This is a pre-trained Resnet50 model with some tricks based on Bag of Tricks for Resnet paper. There are few other tricks as well :
- Mish - A new activation function that has shown fantastic results
- Self-Attention - Bringing in ideas from GAN's into image classification
- MaxBlurPool - Better generalization
- Flatten + Anneal Scheduling - Mikhail Grankin
- Label Smoothing Cross Entropy - A threshold base (were you close) rather than yes or no
Adam is a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation.
Ranger is an optimizer based on two seperate papers :
- On the Varience of the Adaptive Learning rate and Beyond, RAdam
- Lookahead Optimizer: k steps forward, 1 step back
The pre-trained ViT-Large model achieved an accuracy of 81.29%, whereas, the pre-trained XResnet50 model achieved 34.69%; both of them on the ImageWoof dataset.
The pre-trained ViT-Large model achieved an accuracy of 27.28%, whereas, the pre-trained XResnet50 model achieved 43.72%; both of them on the ImageWoof dataset.
The pre-trained ViT-Large model achieved an accuracy of 81.29% with Adam, whereas, 27.28% with Ranger; both of them on the ImageWoof dataset.
The pre-trained XResnet50 model achieved an accuracy of 34.69% with Adam, whereas, 43.72% with Ranger; both of them on the ImageWoof dataset.