This repo contains code and weights for A Spitting Image: Modular Superpixel Tokenization in Vision Transformers, accepted for MELEX, ECCVW 2024.
For an introduction to our work, visit the project webpage.
We are working on releasing this package on PyPi, however, the package can currently be installed via:
# HTTPS
pip install git+https://github.com/dsb-ifi/SPiT.git
# SSH
pip install git+ssh://[email protected]/dsb-ifi/SPiT.git
To load a Superpixel Transformer model, we suggest using the wrapper:
from spit import load_model
model = load_model.load_SPiT_B16(grad=True, pretrained=True)
This will load the model and downloaded the pretrained weights, stored in your local torch.hub
directory. If you would rather download the full weights, please use:
Model | Link | MD5 |
---|---|---|
SPiT-S16 | Manual Download | 8e899c846a75c51e1c18538db92efddf |
SPiT-S16 (w. grad.) | Manual Download | e49be7009c639c0ccda4bd68ed34e5af |
SPiT-B16 | Manual Download | 9d3483a4c6fdaf603ee6528824d48803 |
SPiT-B16 (w. grad.) | Manual Download | 9394072a5d488977b1af05c02aa0d13c |
ViT-S16 | Manual Download | 73af132e4bb1405b510a5eb2ea74cf22 |
ViT-S16 (w. grad.) | Manual Download | b8e4f1f219c3baef47fc465eaef9e0d4 |
ViT-B16 | Manual Download | ce45dcbec70d61d1c9f944e1899247f1 |
ViT-B16 (w. grad.) | Manual Download | 1caa683ecd885347208b0db58118bf40 |
RViT-S16 | Coming Soon | |
RViT-S16 (w. grad.) | Coming Soon | |
RViT-B16 | Manual Download | 18c13af67d10f407c3321eb1ca5eb568 |
RViT-B16 (w. grad.) | Manual Download | 50d25403adfd5a12d7cb07f7ebfced97 |
We provide a Jupyter notebook as a sandbox for loading, evaluating, and extracting segmentations for the models. Examples will be updated along with new releases and updates for the project repo.
Currently the code features some slight modifications to streamline use of the RViT models. The original RViT models sampled partitions from a dataset of pre-computed Voronoi tesselations for training and evaluation. This is impractical for deployment, and we have yet to implement a CUDA kernel for computing Voronoi with lower memory overhead.
However, we have developed a fast implementation for generating fast tesselations with PCA trees [1], which mimic Voronoi tesselations relatively well, and can be computed on-the-fly. There are, however still some minor issues with the small capacity RViT models. Consequently, the RViT-B16 models will perform marginally different than the reported results in the paper. We appreciate the readers patience with regard to this matter.
Note that the RViT models are inherently stochastic so that different runs can yield different results. Also, it is worth mentioning that SPiT models can yield slightly different results for each run, due to nondeterministic behaviours in CUDA kernels.
[1] Refinements to nearest-neighbor searching in
- Include foundational code and model weights.
- Add manual links with MD5 hash for manual weight download.
- Add module for loading models, and provide example notebook.
- Create temporary solution to on-line Voronoi tesselation.
- Add standalone train and eval scripts.
- Add CUDA kernels for on-line Voronoi Tesselations.
- Add example for extracting attribution maps with Att.Flow and Proto.PCA.
- Add example for computing sufficiency and comprehensiveness.
- Add assets for computed attribution maps for XAI experiments.
- Add code and examples for salient segmentation.
- Add code and examples for feature correspondences.
If you find our work useful, please consider citing our work.
@inproceedings{Aasan2024,
title={A Spitting Image: Modular Superpixel Tokenization in Vision Transformers},
author={Aasan, Marius and Kolbj\o{}rnsen, Odd and Schistad Solberg, Anne and Ram\'irez Rivera, Ad\'in},
booktitle={{CVF/ECCV} More Exploration, Less Exploitation Workshop ({MELEX} {ECCVW})},
year={2024}
}