-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interpolate clip #31900
Interpolate clip #31900
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
Just a small comment on the tests
|
||
@slow | ||
def test_inference_interpolate_pos_encoding(self): | ||
# ViT models have an `interpolate_pos_encoding` argument in their forward method, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# ViT models have an `interpolate_pos_encoding` argument in their forward method, | |
# XCLIP models have an `interpolate_pos_encoding` argument in their forward method, |
# to visualize self-attention on higher resolution images. | ||
model = XCLIPModel.from_pretrained("microsoft/xclip-base-patch32").to(torch_device) | ||
|
||
image_processor = XCLIPProcessor.from_pretrained("microsoft/xclip-base-patch32", size=480) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The returned object is a processor - not an image processor
image_processor = XCLIPProcessor.from_pretrained("microsoft/xclip-base-patch32", size=480) | |
processor = XCLIPProcessor.from_pretrained("microsoft/xclip-base-patch32", size=480) |
with torch.no_grad(): | ||
outputs = model(**inputs, interpolate_pos_encoding=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add in the test to check an error is raise if nterpolate_pos_encoding=False
?
|
||
# forward pass | ||
with torch.no_grad(): | ||
outputs = model(**inputs, interpolate_pos_encoding=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here re failing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amyeroberts , I am getting the same results in model
with interpolate_pos_encoding=True
or False
in both x_clip or kosmos2. No error, just the same exact tensor for outputs.vision_model_output.last_hidden_state
. I've tried with different sizes for the shortest_edge
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason is because a logic like this:
if not interpolate_pos_encoding and (height != self.image_size[0] or width != self.image_size[1]):
raise ValueError(
f"Input image size ({height}*{width}) doesn't match model"
f" ({self.image_size[0]}*{self.image_size[1]})."
)
is not implemented in the modeling_kosmos2.py
file and its corresponding one in modeling_x_clip.py
.
Happy to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the issue is originated, in the case of the model kosmos2, because
processor = AutoProcessor.from_pretrained(
"microsoft/kosmos-2-patch14-224", padding_side="left", size={"shortest_edge": 480}
)
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
inputs = processor(text="what's in the image", images=image, return_tensors="pt").to(torch_device)
is not changing the image size. It is still returning a 224x224 image, even when you reduce the shortest_edge to less than 224.
Need to investigate further while I learn more about the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found out what was missing:
# default imnage size of pretrained kosmos_2 is 224 x 224
processor = AutoProcessor.from_pretrained(
"microsoft/kosmos-2-patch14-224", size={"shortest_edge": 180}, crop_size = {'height': 180, 'width': 180}
)
Now, I have implemented the ValueError in the modeling_kosmos2.py
file.
…ake repo-consistency` and related processes.
@amyeroberts I've implemented changes in all clip family models to account for the use case when |
060857f
to
db5a20a
Compare
I've attempted to sync with the huggingface:main branch with
as there are some failed tests due to being out of sync, but is not working. When running
and I see I am missing this test and the get_set_embeddings even after fetching. @amyeroberts help would be appreciated. |
Closing this one in favour of #32600 |
This PR addresses the suggestions of @amyeroberts in the existing PR #30783.