You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
clip_text_features are the features from the CLIP text encoder for the whole text. clip_word_tokens are the features for the particular class name (use the end_token_ids as the index). The clip_text_features can represent the text feature for the bag of the region, but clip_word_tokens represent the text features for one proposal. Do I understand this correctly?
More importantly, the implementation for the clip_word_tokens makes me confused. In lines
def forward(self, x, return_tokens=False, cls_indices=None, attn_masks=None):
att, tokens = self.attention(self.ln_1(x), return_tokens, attn_masks=attn_masks)
if return_tokens:
assert cls_indices is not None
if not isinstance(cls_indices, int):
assert len(cls_indices) == x.shape[1] # x: LNC
cls_tokens = x[cls_indices, torch.arange(x.shape[1])]
tokens = cls_tokens[None] + tokens
tokens = tokens + self.mlp(self.ln_2(tokens))
x = x + att
x = x + self.mlp(self.ln_2(x))
return x, tokens
else:
assert tokens is None
x = x + att
# x = x + self.attention(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x, None
Could the author provide some explanations or papers for this implementation? This really helps me a lot, thanks!
The text was updated successfully, but these errors were encountered:
Hi, thanks for your excellent work.
I am reading the code of this project but the difference between clip_text_features and clip_word_tokens in this line:
clip_text_features
are the features from the CLIP text encoder for the whole text.clip_word_tokens
are the features for the particular class name (use theend_token_ids
as the index). Theclip_text_features
can represent the text feature for the bag of the region, butclip_word_tokens
represent the text features for one proposal. Do I understand this correctly?More importantly, the implementation for the
clip_word_tokens
makes me confused. In linesCould the author provide some explanations or papers for this implementation? This really helps me a lot, thanks!
The text was updated successfully, but these errors were encountered: