Telling Left from Right identifies the problem of geometry-aware semantic correspondence and analyzes the geometric understanding of the pretrained features of foundation models, i.e., Stable Diffusion and DINOv2. We propose several techniques to improve it with a lightweight post-processor and achieve state-of-the-art performance on multiple benchmarks, e.g., 85.6 [email protected] at SPair-71k. We also introduce a large-scale and challenging semantic correspondence benchmark based on AP-10k dataset, for both pretraining and evaluation.
This repository is the official implementation of the paper:
Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence Junyi Zhang, Charles Herrmann, Junhwa Hur, Eric Chen, Varun Jampani, Deqing Sun*, Ming-Hsuan Yang* CVPR, 2024. [Project Page] [Paper]
- Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence
To install the required dependencies, use the following commands:
conda create -n geo-aware python=3.9
conda activate geo-aware
conda install pytorch=1.13.1 torchvision=0.14.1 pytorch-cuda=11.6 -c pytorch -c nvidia
conda install -c "nvidia/label/cuda-11.6.1" libcusolver-dev
git clone [email protected]:Junyi42/GeoAware-SC.git
cd GeoAware-SC
pip install -e .
PS: There are some common issues when installing Mask2Former. You might find this issue helpful if you encounter any problems.
(Optional) You may want to install xformers for efficient transformer implementation (which can significantly reduce the VRAM consumpution):
pip install xformers==0.0.16
(Optional) You may also want to install SAM to extract the instance masks for adaptive pose alignment technique:
pip install git+https://github.com/facebookresearch/segment-anything.git
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
We provide the scripts to download the datasets in the data
folder. To download specific datasets, use the following commands:
- SPair-71k:
bash data/prepare_spair.sh
- PF-Pascal:
bash data/prepare_pfpascal.sh
- AP-10k:
First, download the original dataset with the following command:
bash data/prepare_ap10k.sh
Then, you could preprocess the dataset to get the correspondence benchmark by running the notebook prepare_ap10k.ipynb.
We also provide sample visualizations of the benchmark in the last cell of the notebook.
To enable efficient training and evaluation, we pre-extract the feature maps of the datasets. (Note this step is not necessary if you only want to run the demo.) To do so, run the following commands:
- SPair-71k:
python preprocess_map.py ./data/SPair-71k/JPEGImages
Substitute the path ./data/SPair-71k/JPEGImages
with ./data/PF-Pascal/JPEGImages
or ./data/AP-10k/JPEGImages
for the PF-Pascal and AP-10k datasets, respectively.
For the SPair-71k dataset, it takes roughly 2 hours to extract the feature maps (for both the original and flipped images) on a single RTX 3090 GPU, and consumes around 90GB of disk space.
For the default adaptive pose alignment method which requires the source instance mask, we also pre-extract the masks of the dataset for efficiency. To do so, run the following commands:
- SPair-71k:
python preprocess_mask_sam.py ./data/SPair-71k/JPEGImages
We provide the implementation of extracting and visualizing the post-processed features (with an interactive demo!) in the get_processed_feat.ipynb file. It includes example image pairs from both dataset images and your own images.
Extratcing the post-processed features could be as simple as running the following script:
import torch
import torch.nn.functional as F
from PIL import Image
from utils.utils_correspondence import resize
from model_utils.extractor_sd import load_model, process_features_and_mask
from model_utils.extractor_dino import ViTExtractor
from model_utils.projection_network import AggregationNetwork
# load the pretrained weights
num_patches = 60
sd_model, sd_aug = load_model(diffusion_ver='v1-5', image_size=num_patches*16, num_timesteps=50)
extractor_vit = ViTExtractor('dinov2_vitb14', stride=14, device='cuda')
aggre_net = AggregationNetwork(feature_dims=[640,1280,1280,768], projection_dim=768, device='cuda')
aggre_net.load_pretrained_weights(torch.load('results_spair_cvpr/best_856.PTH'))
def get_processed_features(sd_model, sd_aug, aggre_net, extractor_vit, num_patches, img):
# extract stable diffusion features
img_sd_input = resize(img, target_res=num_patches*16, resize=True, to_pil=True)
features_sd = process_features_and_mask(sd_model, sd_aug, img_sd_input, mask=False, raw=True)
del features_sd['s2']
# extract dinov2 features
img_dino_input = resize(img, target_res=num_patches*14, resize=True, to_pil=True)
img_batch = (extractor_vit.preprocess_pil(img_dino_input)).cuda()
features_dino = extractor_vit.extract_descriptors(img_batch, layer=11, facet='token')
features_dino = features_dino.permute(0, 1, 3, 2).reshape(1, -1, num_patches, num_patches)
# aggregate the features and apply post-processing
desc_gathered = torch.cat([
features_sd['s3'],
F.interpolate(features_sd['s4'], size=(num_patches, num_patches), mode='bilinear', align_corners=False),
F.interpolate(features_sd['s5'], size=(num_patches, num_patches), mode='bilinear', align_corners=False),
features_dino], dim=1)
desc = aggre_net(desc_gathered) # 1, 768, 60, 60
# normalize the descriptors
norms_desc = torch.linalg.norm(desc, dim=1, keepdim=True)
desc = desc / (norms_desc + 1e-8)
return desc
img = Image.open('path/to/image').convert('RGB')
feat1 = get_processed_features(sd_model, sd_aug, aggre_net, extractor_vit, num_patches, img)
We provide the implementation of the global pose awareness analysis in the pose_awareness_test.py file.
You could run the following command to analyze the global pose awareness of the pretrained features:
- DINOv2 features:
python pose_awareness_test.py --ONLY_DINO 1
- Stable Diffusion features:
python pose_awareness_test.py --FUSE_DINO 0
- Fused features:
python pose_awareness_test.py
python pck_train.py --config configs/train_ap10k.yaml
Some important parameters in pck_train.py include:
--TRAIN_DATASET
: set toap10k
orpascal
to train on the AP-10k/PF-Pascal dataset, the default training dataset is SPair-71k.--EVAL_DATASET
: set to True to evaluate on the AP-10k/PF-Pascal dataset, the default evaluation dataset is SPair-71k. For the AP-10k dataset, you could also specify--AP10K_EVAL_SUBSET
to evaluate on the different test settings (e.g., cross-species, cross-family).--COMPUTE_GEOAWARE_METRICS
: set to True to evaluate on the geometry-aware semantic correspondence subset (more details in Paper Sec. 3.1).--DENSE_OBJ
: set to 1 to enable the dense training objective (more details in Paper Sec. 4.2).--PAIR_AUGMENT
: set to True to enable the pose-variant pair augmentation (more details in Paper Sec. 4.3).--SOFT_EVAL
: set to True to enable the window soft-argmax evaluation (more details in Paper Sec. 4.4).--LOAD
: the path to the pretrained weight.--TOTAL_SAVE_RESULT
: number of samples to save the qualitative results, set to 0 to disable and accelerate the evaluation process.
Please refer to the pck_train.py file for more details. You may find samples of qualitative results in the results_ap10k
folder.
python pck_train.py --config configs/train_spair.yaml
Note that the TEST_SAMPLE
is the number of sampled pairs for each category for evaluation, which is set to 20 by default. Set to 0
to use all the samples in the evaluation set.
python pck_train.py --config configs/train_pascal.yaml
python pck_train.py --config configs/eval_zero_shot_spair.yaml
Some important parameters here include:
--DO_EVAL
: set to True to start the evaluation process.--DUMMY_NET
: set to True to use the identical post-processing network for the unsupervised setting.--ADAPT_FLIP
: set to True to enable the test-time adaptive pose alignment method (more details in Paper Sec. 4.1). In default, it requires the source instance mask to align the pose. You could also bypass the use of mask by enabling--MUTUAL_NN
to use the mutual nearest neighbor distance for the pose alignment.--KPT_RESULT
: set to True to get the results under the per keypoint setting (as in Tab. 2 of our paper). In default, the per image result is evaluated (as in Tab. 3 of our paper). Please refer to the caption of Tab.2 for more details, and the Sec 5.2 of DIFT paper for the difference of two settings.
python pck_train.py --config configs/eval_spair.yaml
You could also replace the --config
with configs/eval_pascal.yaml
or configs/eval_ap10k.yaml
to evaluate on the PF-Pascal or AP-10k dataset.
We provide several pretrained weights to reproduce the results in the paper. You could find the pretrained weights in the results_spair
folder.
If you find our work useful, please cite:
@inproceedings{zhang2024telling,
title={Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence},
author={Zhang, Junyi and Herrmann, Charles and Hur, Junhwa and Chen, Eric and Jampani, Varun and Sun, Deqing and Yang, Ming-Hsuan},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}
Our code is largely based on the following open-source projects: A Tale of Two Features, Diffusion Hyperfeatures, DIFT, DenseMatching, and SFNet. Our heartfelt gratitude goes to the developers of these resources!