Does GroundingDINO support batched inference? #32206

royvelich · 2024-07-25T00:08:58Z

It seems like grounding-dino states in the documentation that it can take a batch of images, but when I try to do so, I get an error, as specified here - https://discuss.huggingface.co/t/how-to-perform-batch-inference-on-groundingdino-model/90940.
Is it supposed to work?

qubvel · 2024-07-25T14:26:36Z

Hi @royvelich, thanks for the question, would be nice to have minimal reproducing example and environment 🙂

I was able to run a batched inference with the following env and code:

- `transformers` version: 4.44.0.dev0
- Platform: Linux-6.5.0-1020-aws-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version (GPU?): 2.4.0+cu118 (True)
- GPU type: NVIDIA A10G

import requests

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

model_id = "IDEA-Research/grounding-dino-tiny"
device = "cuda"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
images = [image, image]
texts = [
    "a cat. a remote control.",
    "a cat. a remote control. a sofa.",
]

inputs = processor(images=images, text=texts, padding=True, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)

w, h = image.size
results = processor.post_process_grounded_object_detection(
    outputs,
    inputs.input_ids,
    box_threshold=0.4,
    text_threshold=0.3,
    target_sizes=[(h, w), (h, w)],
)
print(results)

royvelich · 2024-08-05T08:22:38Z

@qubvel When I supply a batch of images, should all the images have the same resolution?

qubvel · 2024-08-05T08:29:15Z

Hi @royvelich, the processor will take care of this. The above example works even if we provide images with different sizes for processor.

...
images = [image.resize((512, 256)), image.resize((256, 256))] 
...
inputs = processor(images=images, text=texts, padding=True, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)
...

royvelich · 2024-08-06T07:44:30Z

@qubvel
Hi,
Thanks for your support. For some reason, when I run outputs = model(**inputs), I get the following error:

TypeError: GroundingDinoForObjectDetection.forward() missing 1 required positional argument: 'input_ids'

I checked it, and input.input_ids does not exist.

Do you have any idea what I should do?

Thanks!

royvelich · 2024-08-06T07:47:27Z

@qubvel Hi, Thanks for your support. For some reason, when I run outputs = model(**inputs), I get the following error:
TypeError: GroundingDinoForObjectDetection.forward() missing 1 required positional argument: 'input_ids'
I checked it, and input.input_ids does not exist.

Do you have any idea what I should do?

Thanks!

Wait, let me check something. It works in your example, but I get this error when I integrate it into my project.

royvelich · 2024-08-06T08:10:06Z

@qubvel
Ok, now it works. But I wonder - previously, I used a pipeline for running grounding-dino:

detector = pipeline(model="IDEA-Research/grounding-dino-tiny", task="zero-shot-object-detection", device=device)

Can we work in batches there as well? Also, it looks like the boxes that the pipeline returns are different from the boxes that I get using your code (using the same images/labels/hyper-parameters). Is it just a different format?

qubvel · 2024-08-06T08:21:37Z

Hi @royvelich, indeed there is a bug in the pipeline for object-detection, it was reported previously in this issue:

Object Detection Pipeline only outputs first element when batching #31356 (comment)

royvelich · 2024-08-06T10:54:43Z

This one should work for pipeline:

import requests
from PIL import Image
from transformers import pipeline

device = "cuda"

detector = pipeline(model="IDEA-Research/grounding-dino-tiny", task="zero-shot-object-detection", device=device)

image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
images = [image, image]
texts = [
    "a cat. a remote control.",
    "a cat. a remote control. a sofa.",
]

data = [{'image': image, 'candidate_labels': text} for image, text in zip(images, texts)]

results = detector(data)

print(results)

github-actions · 2024-09-13T08:06:02Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

qubvel mentioned this issue Aug 7, 2024

Fix zero shot detection pipeline #32490

Draft

github-actions bot closed this as completed Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does GroundingDINO support batched inference? #32206

Does GroundingDINO support batched inference? #32206

royvelich commented Jul 25, 2024

qubvel commented Jul 25, 2024

royvelich commented Aug 5, 2024

qubvel commented Aug 5, 2024 •

edited

Loading

royvelich commented Aug 6, 2024

royvelich commented Aug 6, 2024

royvelich commented Aug 6, 2024

qubvel commented Aug 6, 2024

royvelich commented Aug 6, 2024

github-actions bot commented Sep 13, 2024

Does GroundingDINO support batched inference? #32206

Does GroundingDINO support batched inference? #32206

Comments

royvelich commented Jul 25, 2024

qubvel commented Jul 25, 2024

royvelich commented Aug 5, 2024

qubvel commented Aug 5, 2024 • edited Loading

royvelich commented Aug 6, 2024

royvelich commented Aug 6, 2024

royvelich commented Aug 6, 2024

qubvel commented Aug 6, 2024

royvelich commented Aug 6, 2024

github-actions bot commented Sep 13, 2024

qubvel commented Aug 5, 2024 •

edited

Loading