Skip to content

Ys-Prakash/Vit-vs-xresnet

Repository files navigation

ViT-Large vs XResnet50 (using Adam and Ranger optimizers)

This repository is comprised of notebooks that contains code for testing Vision Transformer and XResnet50, both of them pre-trained, on the ImageWoof dataset, using Adam and Ranger optimizers.

Acknowledgement

  1. The Walk with fatsai course on ImageWoof
  2. Ross Wightman's repository for vision transformer
  3. fastai

The Objective

The objective of this project is to get a clear comparison between the performances of pre-trained Vision Transformer (here, ViT-Large) and pre-trained XResnet50, when they are fine-tuned using different (here, Adam and Ranger) optimizers, on the ImageWoof dataset. This project is helpful to people who want to use state-of-the-art pre-trained vision models, but have limited computational resources and are dependent on online environments such as Google Colab.

NB : For this project, ViT-Large was used as I (that is, Prakash Pandey) wanted to get the state-of-the-art model, and, also, because training ViT-Huge model threw 'CUDA Out of Memory' error even with batch size = 1, on Google Colab. So, I found ViT-Large to be the 'deepest' vision transformer that could be trained on Google Colab.

The ImageWoof dataset

The ImageWoof dataset is a subset of 10 classes from Imagenet that aren't so easy to classify, since they're all dog breeds. The breeds are: Australian terrier, Border terrier, Samoyed, Beagle, Shih-Tzu, English foxhound, Rhodesian ridgeback, Dingo, Golden retriever, Old English sheepdog.

The Models

ViT_Large :

The vision transformer, introduced here, has some variants, and ViT-Large is one of them. It comprises 24 layers and 307M parameters. For this project, I have used a pre-trained ViT-Large model.

XResnet50 :

This is a pre-trained Resnet50 model with some tricks based on Bag of Tricks for Resnet paper. There are few other tricks as well :

  1. Mish - A new activation function that has shown fantastic results
  2. Self-Attention - Bringing in ideas from GAN's into image classification
  3. MaxBlurPool - Better generalization
  4. Flatten + Anneal Scheduling - Mikhail Grankin
  5. Label Smoothing Cross Entropy - A threshold base (were you close) rather than yes or no

The Optimizers

Adam :

Adam is a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation.

Ranger :

Ranger is an optimizer based on two seperate papers :

  1. On the Varience of the Adaptive Learning rate and Beyond, RAdam
  2. Lookahead Optimizer: k steps forward, 1 step back

The Result

1. Model based comparision :

A. Using Adam :

The pre-trained ViT-Large model achieved an accuracy of 81.29%, whereas, the pre-trained XResnet50 model achieved 34.69%; both of them on the ImageWoof dataset.

B. Using Ranger :

The pre-trained ViT-Large model achieved an accuracy of 27.28%, whereas, the pre-trained XResnet50 model achieved 43.72%; both of them on the ImageWoof dataset.

2. Optimizer based comparision :

A. Using pre-trained ViT-Large model :

The pre-trained ViT-Large model achieved an accuracy of 81.29% with Adam, whereas, 27.28% with Ranger; both of them on the ImageWoof dataset.

B. Using pre-trained XResnet50 model :

The pre-trained XResnet50 model achieved an accuracy of 34.69% with Adam, whereas, 43.72% with Ranger; both of them on the ImageWoof dataset.

Clearly, we see that the best combination of models and optimizers, used herein, is ViT-Large + Adam.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published