-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Community contribution: enable dynamic resolution input for more vision models. #30579
Comments
I can take Clip and Blip2. |
Some heads up here; people have complained about the fact that |
i can work on vit_mae and tvp |
Thanks for the heads up @NielsRogge!
OK, that's good to know. If many models have this it's a good reason to spend some time to figure out a solution! The most important thing is that it will work with a standard forward / backwards pass - if that's working we should be able to find a way to integrate if it's a wanted feature.
Agreed |
Yes so the problem is that the However, there's a workaround: https://discuss.huggingface.co/t/fine-tuning-vit-with-more-patches-higher-resolution/18731/4?u=nielsr |
…lip vision model This commit introduces the `interpolate_pos_encoding` function to the `altclip` classes. It allows for high resolution images to be processed without image resizing. partially solves Issue huggingface#30579
`interpolate_pos_encoding` function to the `altclip` vision models. It allows for high resolution images into the model for finetunning irrespective of the pre-trained image configuration. issue huggingface#30579
I can work on deit! |
I'd like to work on vivit |
Hi ashavinni |
I can work on chinese_clip. Will keep the team posted in the next few days. If I get more free time and there are remaining ones by then, happy to help out on additional tasks. |
Working on detr, a bit tricky. Will explain in the PR. |
Actually, I can also take bridgetower as well. They will come in as separate PRs though. Shouldn't be more complicated than chinese_clip. |
How you manage this with Or avoid `make fix-copies' altogether before sending a PR? |
I will work on Swin, since DeiT is already implemented. |
I will work on owlvit. |
@nileshkokane01 This is a good point - I'll update the list of models to indicates models which are "grouped" together. In the case of e.g. the CLIP family, there should just be one PR opened for adding the feature to CLIP and the models which are copied from it. The steps would be:
|
@nileshkokane01 @amyeroberts In that case, I will refrain from working on Update: oh nice thank you Amy for updating the description to group them |
I can take |
@amyeroberts Doesn't transformers/src/transformers/models/idefics2/modeling_idefics2.py Lines 139 to 149 in cf7bed9
For example, the following sample script: import torch
import requests
from PIL import Image
from transformers import Idefics2Processor, Idefics2ForConditionalGeneration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
url = "https://upload.wikimedia.org/wikipedia/commons/c/cc/ESC_large_ISS022_ISS022-E-11387-edit_01.JPG"
images = [Image.open(requests.get(url, stream=True).raw)]
messages = [{
"role": "user",
"content": [
{"type": "text", "text": "What's on the image?"},
{"type": "image"},
],
}]
processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
# Instead of the default 980, allow the largest edge to be 1500
processor.image_processor.size["longest_edge"] = 1500
model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b").to(device)
text = processor.apply_chat_template(messages)
inputs = processor(text=text, images=images, return_tensors="pt", padding=True)
for k, v in inputs.items():
inputs[k] = v.to(device)
print("Input image shape:", inputs["pixel_values"].shape)
with torch.no_grad():
out = model(**inputs)
print("Finished!") Executes without errors and prints the following: Loading checkpoint shards: 100%|████████████████████████| 7/7 [00:03<00:00, 2.26it/s]
Input image shape: torch.Size([1, 1, 3, 994, 1500])
Finished! |
Since all clip like models can just borrow changes made to clip model, I will take tvp instead of altclip. |
@zafstojano Indeed! That's what I get for doing a quick grep and not double checking. Thanks for showing an example to verify. I'll take it off the list |
Opened a PR (#30722) addressing this issue for the BLIP family of models (BLIP, BLIP2, InstructBLIP). |
@amyeroberts I would like to work on DETR. Is anyone working on it? |
I'm almost done. Was busy with work in the past 2 weeks. |
I'll be working on grounding_dino and hopefuly I will have a PR soon. |
@MightyStud Thanks for picking a model and working to add this feature! After reviewing #30921, I realised that this isn't something we can add for models with backbones, which includes grounding DINO and DETR related models. I've updated the list to reflect this. |
@amyeroberts Aha, thanks for letting me know, I'd like to work on swin2sr then since I already allocated time this week. |
Hi @amyeroberts Can I try out beit or data2vec? |
@OmarManzoor Certainly! |
@amyeroberts Is there any model that I can work on in this task? |
@kishore-s-15 There is currently no open PR for deit |
Thanks, @amyeroberts, I would love to work on it. Could you assign it for me? |
@amyeroberts have opened a PR(#31131) for deit |
@amyeroberts Are there any models I can work on for this task? |
@jacksoncamp42 All models currently have open PRs. If you're interested in adding features to vision models, another way to contribute would be adding enabling |
@amyeroberts Thanks for the suggestion. Unfortunately, I currently don't have access to a multi-GPU environment. Is there another area or feature that I can contribute to without needing a multi-GPU setup? |
@jacksoncamp42 Anyone in the community is welcome to tackle any issue within the library. For people who are contributing for the first time, we suggest looking for issues with the |
CLIP family models have been tackled (and merged) here: #32600 |
@amyeroberts , can you please have a look at the PR #34268 which adds interpolation in the owlvit models. Thanks |
Feature request
Some of our models interpolate its positional embeddings, enabling pretrained checkpoints to be used on different input resolutions. For example, here in ViT.
fixes clip interpolate #30783adding positional encoder changes and tests #32600Motivation
Let's add this to more models, to leverage existing checkpoints for new cases!
Your contribution
For anyone who would like to contribute, please comment on the issue, claiming a model you'd like to work on and share a link to the PR.
Each PR should:
interpolate_pos_encoding
methodThere was a PR opened to add this to CLIP models, which is now inactive, but useful for reference of the changes to make: #27457
Once the PR is ready, you can ping me for review 🤗
The text was updated successfully, but these errors were encountered: