This page lists the MaskAlign model weights. CLIP-L/14* denotes input 196 × 196 resolution image to CLIP-L/14. This will keep the same feature map size as the student model. PT epochs and FT Acc denotes pre-training epochs and fine-tuning accuracy on ImageNet-1K, respectively.
Model | Teacher Model | PT epochs | Link | FT Acc. |
---|---|---|---|---|
ViT-B/16 | CLIP-B/16 | 200 | gdrive | 85.4 |
ViT-L/16 | CLIP-B/16 | 200 | gdrive | 86.5 |
ViT-L/16 | CLIP-L/14* | 200 | gdrive | 87.4 |