New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

What is the default token, UNK(2) or PAD(0)? #27

Closed

shiyegao opened this issue Jul 3, 2024 · 2 comments

Assignees

shiyegao commented Jul 3, 2024

As can be seen in from esm.utils.constants import esm3 as C, there are two kinds of tokens.

SS8_UNK_TOKEN = 2
SS8_PAD_TOKEN = 0

If we just use ESM3.forward(sequence_tokens=xxx), the ss8_tokens will be 2 by defaults.
However, if we use ESM3.generate(xxx), the ss8_tokens will be 0 by default_protein_tensor.

Although the PAD tokens may be learned as UNK tokens by training, I wonder what the best none token is during only-sequence embedding extraction.

Besides, is this inconsistency a bug? Will you fix this inconsistency later?

The text was updated successfully, but these errors were encountered:

shiyegao changed the title ~~Is default UNK or PAD~~ What is the default token, UNK(2) or PAD(0)?

Contributor

ebetica commented Jul 9, 2024

The default ss8_token should be 0 for an all masked sequence. Will fix in the next release.

ebetica self-assigned this

santiag0m mentioned this issue

About Generating Protein Sequence Embeddings with Your Model #2

Closed

Contributor

ebetica commented Jul 18, 2024

This should be fixed on main, expect a release to pip in the next week.

ebetica closed this as completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment