Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the default token, UNK(2) or PAD(0)? #27

Closed
shiyegao opened this issue Jul 3, 2024 · 2 comments
Closed

What is the default token, UNK(2) or PAD(0)? #27

shiyegao opened this issue Jul 3, 2024 · 2 comments
Assignees

Comments

@shiyegao
Copy link

shiyegao commented Jul 3, 2024

As can be seen in from esm.utils.constants import esm3 as C, there are two kinds of tokens.

SS8_UNK_TOKEN = 2
SS8_PAD_TOKEN = 0
  1. If we just use ESM3.forward(sequence_tokens=xxx), the ss8_tokens will be 2 by defaults.
  2. However, if we use ESM3.generate(xxx), the ss8_tokens will be 0 by default_protein_tensor.

Although the PAD tokens may be learned as UNK tokens by training, I wonder what the best none token is during only-sequence embedding extraction.

Besides, is this inconsistency a bug? Will you fix this inconsistency later?

@shiyegao shiyegao changed the title Is default UNK or PAD What is the default token, UNK(2) or PAD(0)? Jul 3, 2024
@ebetica
Copy link
Contributor

ebetica commented Jul 9, 2024

The default ss8_token should be 0 for an all masked sequence. Will fix in the next release.

@ebetica
Copy link
Contributor

ebetica commented Jul 18, 2024

This should be fixed on main, expect a release to pip in the next week.

@ebetica ebetica closed this as completed Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants