Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpolate clip #31900

Closed
wants to merge 10 commits into from
Closed

Interpolate clip #31900

wants to merge 10 commits into from

Conversation

manuelsh
Copy link
Contributor

This PR addresses the suggestions of @amyeroberts in the existing PR #30783.

@manuelsh manuelsh marked this pull request as ready for review July 10, 2024 20:49
@manuelsh manuelsh mentioned this pull request Jul 10, 2024
5 tasks
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

Just a small comment on the tests


@slow
def test_inference_interpolate_pos_encoding(self):
# ViT models have an `interpolate_pos_encoding` argument in their forward method,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# ViT models have an `interpolate_pos_encoding` argument in their forward method,
# XCLIP models have an `interpolate_pos_encoding` argument in their forward method,

# to visualize self-attention on higher resolution images.
model = XCLIPModel.from_pretrained("microsoft/xclip-base-patch32").to(torch_device)

image_processor = XCLIPProcessor.from_pretrained("microsoft/xclip-base-patch32", size=480)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned object is a processor - not an image processor

Suggested change
image_processor = XCLIPProcessor.from_pretrained("microsoft/xclip-base-patch32", size=480)
processor = XCLIPProcessor.from_pretrained("microsoft/xclip-base-patch32", size=480)

Comment on lines 749 to 748
with torch.no_grad():
outputs = model(**inputs, interpolate_pos_encoding=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add in the test to check an error is raise if nterpolate_pos_encoding=False?


# forward pass
with torch.no_grad():
outputs = model(**inputs, interpolate_pos_encoding=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here re failing

Copy link
Contributor Author

@manuelsh manuelsh Aug 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amyeroberts , I am getting the same results in model with interpolate_pos_encoding=True or False in both x_clip or kosmos2. No error, just the same exact tensor for outputs.vision_model_output.last_hidden_state. I've tried with different sizes for the shortest_edge parameter.

Copy link
Contributor Author

@manuelsh manuelsh Aug 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason is because a logic like this:

if not interpolate_pos_encoding and (height != self.image_size[0] or width != self.image_size[1]):
            raise ValueError(
                f"Input image size ({height}*{width}) doesn't match model"
                f" ({self.image_size[0]}*{self.image_size[1]})."
            )

is not implemented in the modeling_kosmos2.py file and its corresponding one in modeling_x_clip.py.

Happy to implement it.

Copy link
Contributor Author

@manuelsh manuelsh Aug 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the issue is originated, in the case of the model kosmos2, because

        processor = AutoProcessor.from_pretrained(
            "microsoft/kosmos-2-patch14-224", padding_side="left", size={"shortest_edge": 480}
        )
        image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
        inputs = processor(text="what's in the image", images=image, return_tensors="pt").to(torch_device)

is not changing the image size. It is still returning a 224x224 image, even when you reduce the shortest_edge to less than 224.

Need to investigate further while I learn more about the library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found out what was missing:

        # default imnage size of pretrained kosmos_2 is 224 x 224
        processor = AutoProcessor.from_pretrained(
            "microsoft/kosmos-2-patch14-224", size={"shortest_edge": 180}, crop_size = {'height': 180, 'width': 180}
        )

Now, I have implemented the ValueError in the modeling_kosmos2.py file.

@manuelsh
Copy link
Contributor Author

manuelsh commented Aug 4, 2024

@amyeroberts I've implemented changes in all clip family models to account for the use case when interpolate_pos_encoding=False and the resolution is not the same as the pretrained model. I also implemented its respective tests.

@manuelsh manuelsh closed this Aug 4, 2024
@manuelsh manuelsh reopened this Aug 4, 2024
@manuelsh manuelsh closed this Aug 4, 2024
@manuelsh manuelsh mentioned this pull request Aug 4, 2024
@manuelsh
Copy link
Contributor Author

manuelsh commented Aug 4, 2024

I've attempted to sync with the huggingface:main branch with

git fetch upstream
git rebase upstream/main

as there are some failed tests due to being out of sync, but is not working.

When running make repo-consistency I get:

python utils/check_copies.py
Traceback (most recent call last):
  File "/usr/src/app/transformers/utils/check_copies.py", line 1106, in <module>
    check_copies(args.fix_and_overwrite, args.file)
  File "/usr/src/app/transformers/utils/check_copies.py", line 852, in check_copies
    new_diffs = is_copy_consistent(filename, overwrite, buffer)
  File "/usr/src/app/transformers/utils/check_copies.py", line 675, in is_copy_consistent
    target_lines, theoretical_code, theoretical_code_splits = find_code_and_splits(
  File "/usr/src/app/transformers/utils/check_copies.py", line 521, in find_code_and_splits
    code, (lines, target_start_index, target_end_index) = find_code_in_transformers(
  File "/usr/src/app/transformers/utils/check_copies.py", line 456, in find_code_in_transformers
    raise ValueError(f" {object_name} does not match any function or class in {module}.")
ValueError:  models.clip.test_modeling_clip.CLIPModelTest.test_model_get_set_embeddings does not match any function or class in models/clip/test_modeling_clip.

and I see I am missing this test and the get_set_embeddings even after fetching.

@amyeroberts help would be appreciated.

@manuelsh manuelsh reopened this Aug 4, 2024
@manuelsh
Copy link
Contributor Author

Closing this one in favour of #32600

@manuelsh manuelsh closed this Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants