-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about location tokens #4
Comments
Same here, I am trying to figure out what they are. Are they a fixed grid or some learnable parameter? |
I think it appears to be the same grid like structure used in deformable DETR. Basically it's a uniform grid across image coordinates, and each grid centre is used as an anchor, over which the model regresses the deviation of correct bbox. |
Hi, I have a similar problem with you. If VisionLLM uses the Deformable DETR-like decoder, and object queries act as positional anchors, the Hungarian matching is required to assign GT boxes to object queries. However, the authors don't mention that in the paper. What do you think of the possible training details of these object queries? |
Also, Same here. maybe below link will be helpful https://openreview.net/forum?id=Vx1JadlOIt¬eId=616Bhd6O5S maybe below image will be helpful too |
Hi, your work is great! But I am confused about the location tokens you used in Decoder, could you provide more details it?
The text was updated successfully, but these errors were encountered: