About Generating Protein Sequence Embeddings with Your Model #2

BlenderWang9487 · 2024-06-25T17:12:22Z

Hi!

Thank you for your great work. I would like to ask if your model can be used solely for generating protein sequence embedding. For example, given a protein sequence, is there a function that produces its embedding for downstream tasks such as similarity search or property prediction with a simple linear head?

If so, do you have an example script that I can refer to? Or is there a best practice for generating such embedding?

Thank you!

santiag0m · 2024-06-25T18:35:19Z

Sure, you can get embeddings out of ESM3!

from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, SamplingConfig
from esm.utils.constants.models import ESM3_OPEN_SMALL


client = ESM3.from_pretrained(ESM3_OPEN_SMALL, device="cuda")

# Peptidase S1A, chymotrypsin family: https://www.ebi.ac.uk/interpro/structure/PDB/1utn/
protein = ESMProtein(
    sequence=(
        "FIFLALLGAAVAFPVDDDDKIVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG"
        "NEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKAP"
        "ILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN"
    )
)
protein_tensor = client.encode(protein)

output = client.forward_and_sample(
    protein_tensor, SamplingConfig(return_per_residue_embeddings=True)
)
print(output.per_residue_embedding.shape)

If you have a PDB file you can also load the protein directly:

protein = ESMProtein.from_pdb("./1utn.pdb")

BlenderWang9487 · 2024-06-26T00:07:08Z

Got it! 👍

fulacse · 2024-06-28T12:14:50Z

Sure, you can get embeddings out of ESM3!

from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, SamplingConfig
from esm.utils.constants.models import ESM3_OPEN_SMALL


client = ESM3.from_pretrained(ESM3_OPEN_SMALL, device="cuda")

# Peptidase S1A, chymotrypsin family: https://www.ebi.ac.uk/interpro/structure/PDB/1utn/
protein = ESMProtein(
    sequence=(
        "FIFLALLGAAVAFPVDDDDKIVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG"
        "NEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKAP"
        "ILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN"
    )
)
protein_tensor = client.encode(protein)

output = client.forward_and_sample(
    protein_tensor, SamplingConfig(return_per_residue_embeddings=True)
)
print(output.per_residue_embedding.shape)

If you have a PDB file you can also load the protein directly:

protein = ESMProtein.from_pdb("./1utn.pdb")

If I understand, output shape is [num_amino_acid, dim_embed]. I want process multi proteins at a time to get a tensor shaping [batch_size, num_amino_acid, dim_embed]. How to make a batch?

ddofer · 2024-07-03T08:24:38Z

I'd suggest adding this as an example or even in the readme, it's gonna be a recurring question.
(Ideally, running on a large set of sequences, and with the over trained small model?).
Thanks!

vuhongai · 2024-07-06T21:12:29Z

Sure, you can get embeddings out of ESM3!

from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, SamplingConfig
from esm.utils.constants.models import ESM3_OPEN_SMALL


client = ESM3.from_pretrained(ESM3_OPEN_SMALL, device="cuda")

# Peptidase S1A, chymotrypsin family: https://www.ebi.ac.uk/interpro/structure/PDB/1utn/
protein = ESMProtein(
    sequence=(
        "FIFLALLGAAVAFPVDDDDKIVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG"
        "NEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKAP"
        "ILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN"
    )
)
protein_tensor = client.encode(protein)

output = client.forward_and_sample(
    protein_tensor, SamplingConfig(return_per_residue_embeddings=True)
)
print(output.per_residue_embedding.shape)

If you have a PDB file you can also load the protein directly:

protein = ESMProtein.from_pdb("./1utn.pdb")

In case of the input is sequence only, like in this example, should the per_residue_embedding calculated from forward_and_sample and the embedding directly from the forward pass differ? If I understood well the code here, it should not be different, right? I however see that they are very different. Maybe I missed something, can anyone please help me to understand why?

Here's my example:

from esm.models.esm3 import ESM3
from esm.sdk.api import (
    ESMProtein,
    SamplingConfig
)

protein = ESMProtein(
    sequence="FIFLALLGAAVAFPVDDDDKIVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG"
)
protein_tensor = model.encode(protein)

with torch.no_grad():
    output1 = model(sequence_tokens=protein_tensor.sequence[None]).embeddings.detach().cpu().numpy()[0]
    output2 = model.forward_and_sample(
        protein_tensor, SamplingConfig(return_per_residue_embeddings=True)
    ).per_residue_embedding.detach().cpu().numpy()

zas97 · 2024-07-08T07:50:26Z

Is there a way to add padding to the sequence before generating the embeddings?

santiag0m · 2024-07-09T19:16:09Z

@zas97 check here: #28 (comment)

santiag0m · 2024-07-09T20:53:14Z

@vuhongai We use the model forward inside the forward_and_sample endpoint. It is likely that the difference you are seeing comes from how we initialize the defaults.

We are working on it (see #27)

Emalude · 2024-10-02T08:43:07Z

Sure, you can get embeddings out of ESM3!

from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, SamplingConfig
from esm.utils.constants.models import ESM3_OPEN_SMALL


client = ESM3.from_pretrained(ESM3_OPEN_SMALL, device="cuda")

# Peptidase S1A, chymotrypsin family: https://www.ebi.ac.uk/interpro/structure/PDB/1utn/
protein = ESMProtein(
    sequence=(
        "FIFLALLGAAVAFPVDDDDKIVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG"
        "NEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKAP"
        "ILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN"
    )
)
protein_tensor = client.encode(protein)

output = client.forward_and_sample(
    protein_tensor, SamplingConfig(return_per_residue_embeddings=True)
)
print(output.per_residue_embedding.shape)

If you have a PDB file you can also load the protein directly:

protein = ESMProtein.from_pdb("./1utn.pdb")

If I understand, output shape is [num_amino_acid, dim_embed]. I want process multi proteins at a time to get a tensor shaping [batch_size, num_amino_acid, dim_embed]. How to make a batch?

Any update on this? Right now, the only way I can do this is by calling multiple time the model on each ESMProtein object, which I guess is much slower than batch processing.

BlenderWang9487 closed this as completed Jun 26, 2024

santiag0m pinned this issue Jun 26, 2024

cclark1e mentioned this issue Jun 28, 2024

Going from sequence to embedding to sequence #21

Closed

tomsercu unpinned this issue Jun 28, 2024

young-su-ko mentioned this issue Jul 5, 2024

way to use ESM3 like ESM2. #33

Closed

santiag0m mentioned this issue Jul 9, 2024

multimodal representation #23

Closed

santiag0m pinned this issue Jul 9, 2024

santiag0m unpinned this issue Jul 9, 2024

TerenceChen95 mentioned this issue Jul 11, 2024

TypeError: Cannot convert numpy.ndarray to numpy.ndarray #41

Closed

sdow1ll mentioned this issue Jul 29, 2024

Get Corresponding Token from Output Logits #67

Closed

AlpriElse mentioned this issue Sep 10, 2024

ESM3 embeddings #105

Closed

lpkoh mentioned this issue Oct 6, 2024

How to get a per protein embedding? #116

Closed

WzWang-2000 mentioned this issue Oct 22, 2024

About Getting Protein Sequence Embeddings #128

Closed

Gift-OYS mentioned this issue Dec 5, 2024

Why is the model's accuracy so low when using residue embeddings from pre-trained model? #156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Generating Protein Sequence Embeddings with Your Model #2

About Generating Protein Sequence Embeddings with Your Model #2

BlenderWang9487 commented Jun 25, 2024

santiag0m commented Jun 25, 2024

BlenderWang9487 commented Jun 26, 2024

fulacse commented Jun 28, 2024

ddofer commented Jul 3, 2024

vuhongai commented Jul 6, 2024

zas97 commented Jul 8, 2024

santiag0m commented Jul 9, 2024

santiag0m commented Jul 9, 2024

Emalude commented Oct 2, 2024

About Generating Protein Sequence Embeddings with Your Model #2

About Generating Protein Sequence Embeddings with Your Model #2

Comments

BlenderWang9487 commented Jun 25, 2024

santiag0m commented Jun 25, 2024

BlenderWang9487 commented Jun 26, 2024

fulacse commented Jun 28, 2024

ddofer commented Jul 3, 2024

vuhongai commented Jul 6, 2024

zas97 commented Jul 8, 2024

santiag0m commented Jul 9, 2024

santiag0m commented Jul 9, 2024

Emalude commented Oct 2, 2024