Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Generating Protein Sequence Embeddings with Your Model #2

Closed
BlenderWang9487 opened this issue Jun 25, 2024 · 9 comments
Closed

Comments

@BlenderWang9487
Copy link
Contributor

Hi!

Thank you for your great work. I would like to ask if your model can be used solely for generating protein sequence embedding. For example, given a protein sequence, is there a function that produces its embedding for downstream tasks such as similarity search or property prediction with a simple linear head?

If so, do you have an example script that I can refer to? Or is there a best practice for generating such embedding?

Thank you!

@santiag0m
Copy link
Contributor

Sure, you can get embeddings out of ESM3!

from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, SamplingConfig
from esm.utils.constants.models import ESM3_OPEN_SMALL


client = ESM3.from_pretrained(ESM3_OPEN_SMALL, device="cuda")

# Peptidase S1A, chymotrypsin family: https://www.ebi.ac.uk/interpro/structure/PDB/1utn/
protein = ESMProtein(
    sequence=(
        "FIFLALLGAAVAFPVDDDDKIVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG"
        "NEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKAP"
        "ILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN"
    )
)
protein_tensor = client.encode(protein)

output = client.forward_and_sample(
    protein_tensor, SamplingConfig(return_per_residue_embeddings=True)
)
print(output.per_residue_embedding.shape)

If you have a PDB file you can also load the protein directly:

protein = ESMProtein.from_pdb("./1utn.pdb")

@BlenderWang9487
Copy link
Contributor Author

Got it! 👍

@santiag0m santiag0m pinned this issue Jun 26, 2024
@fulacse
Copy link

fulacse commented Jun 28, 2024

Sure, you can get embeddings out of ESM3!

from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, SamplingConfig
from esm.utils.constants.models import ESM3_OPEN_SMALL


client = ESM3.from_pretrained(ESM3_OPEN_SMALL, device="cuda")

# Peptidase S1A, chymotrypsin family: https://www.ebi.ac.uk/interpro/structure/PDB/1utn/
protein = ESMProtein(
    sequence=(
        "FIFLALLGAAVAFPVDDDDKIVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG"
        "NEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKAP"
        "ILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN"
    )
)
protein_tensor = client.encode(protein)

output = client.forward_and_sample(
    protein_tensor, SamplingConfig(return_per_residue_embeddings=True)
)
print(output.per_residue_embedding.shape)

If you have a PDB file you can also load the protein directly:

protein = ESMProtein.from_pdb("./1utn.pdb")

If I understand, output shape is [num_amino_acid, dim_embed]. I want process multi proteins at a time to get a tensor shaping [batch_size, num_amino_acid, dim_embed]. How to make a batch?

@ddofer
Copy link

ddofer commented Jul 3, 2024

I'd suggest adding this as an example or even in the readme, it's gonna be a recurring question.
(Ideally, running on a large set of sequences, and with the over trained small model?).
Thanks!

@vuhongai
Copy link

vuhongai commented Jul 6, 2024

Sure, you can get embeddings out of ESM3!

from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, SamplingConfig
from esm.utils.constants.models import ESM3_OPEN_SMALL


client = ESM3.from_pretrained(ESM3_OPEN_SMALL, device="cuda")

# Peptidase S1A, chymotrypsin family: https://www.ebi.ac.uk/interpro/structure/PDB/1utn/
protein = ESMProtein(
    sequence=(
        "FIFLALLGAAVAFPVDDDDKIVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG"
        "NEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKAP"
        "ILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN"
    )
)
protein_tensor = client.encode(protein)

output = client.forward_and_sample(
    protein_tensor, SamplingConfig(return_per_residue_embeddings=True)
)
print(output.per_residue_embedding.shape)

If you have a PDB file you can also load the protein directly:

protein = ESMProtein.from_pdb("./1utn.pdb")

In case of the input is sequence only, like in this example, should the per_residue_embedding calculated from forward_and_sample and the embedding directly from the forward pass differ? If I understood well the code here, it should not be different, right? I however see that they are very different. Maybe I missed something, can anyone please help me to understand why?

Here's my example:

from esm.models.esm3 import ESM3
from esm.sdk.api import (
    ESMProtein,
    SamplingConfig
)

protein = ESMProtein(
    sequence="FIFLALLGAAVAFPVDDDDKIVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG"
)
protein_tensor = model.encode(protein)

with torch.no_grad():
    output1 = model(sequence_tokens=protein_tensor.sequence[None]).embeddings.detach().cpu().numpy()[0]
    output2 = model.forward_and_sample(
        protein_tensor, SamplingConfig(return_per_residue_embeddings=True)
    ).per_residue_embedding.detach().cpu().numpy()

@zas97
Copy link

zas97 commented Jul 8, 2024

Is there a way to add padding to the sequence before generating the embeddings?

@santiag0m
Copy link
Contributor

@zas97 check here: #28 (comment)

@santiag0m santiag0m pinned this issue Jul 9, 2024
@santiag0m
Copy link
Contributor

@vuhongai We use the model forward inside the forward_and_sample endpoint. It is likely that the difference you are seeing comes from how we initialize the defaults.

We are working on it (see #27)

@Emalude
Copy link

Emalude commented Oct 2, 2024

Sure, you can get embeddings out of ESM3!

from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, SamplingConfig
from esm.utils.constants.models import ESM3_OPEN_SMALL


client = ESM3.from_pretrained(ESM3_OPEN_SMALL, device="cuda")

# Peptidase S1A, chymotrypsin family: https://www.ebi.ac.uk/interpro/structure/PDB/1utn/
protein = ESMProtein(
    sequence=(
        "FIFLALLGAAVAFPVDDDDKIVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG"
        "NEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKAP"
        "ILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN"
    )
)
protein_tensor = client.encode(protein)

output = client.forward_and_sample(
    protein_tensor, SamplingConfig(return_per_residue_embeddings=True)
)
print(output.per_residue_embedding.shape)

If you have a PDB file you can also load the protein directly:

protein = ESMProtein.from_pdb("./1utn.pdb")

If I understand, output shape is [num_amino_acid, dim_embed]. I want process multi proteins at a time to get a tensor shaping [batch_size, num_amino_acid, dim_embed]. How to make a batch?

Any update on this? Right now, the only way I can do this is by calling multiple time the model on each ESMProtein object, which I guess is much slower than batch processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants