Unofficial Implementation of the paper What does CLIP know about a red circle? Visual prompt engineering for VLMs
prerequisites: Installation following CLIP repo.
We have the example of the pan_1.png and pan_2.png, and match them to texts ["an image of the handle of a pan", "an image of the cooking area of a pan"]. After running the script, we have a probability of [[0.6423, 0.3527], [0.3517, 0.6433]] as the final scores.
We borrow the optimal transport function from SuperGlue