-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About Generating Protein Sequence Embeddings with Your Model #2
Comments
Sure, you can get embeddings out of ESM3! from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, SamplingConfig
from esm.utils.constants.models import ESM3_OPEN_SMALL
client = ESM3.from_pretrained(ESM3_OPEN_SMALL, device="cuda")
# Peptidase S1A, chymotrypsin family: https://www.ebi.ac.uk/interpro/structure/PDB/1utn/
protein = ESMProtein(
sequence=(
"FIFLALLGAAVAFPVDDDDKIVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG"
"NEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKAP"
"ILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN"
)
)
protein_tensor = client.encode(protein)
output = client.forward_and_sample(
protein_tensor, SamplingConfig(return_per_residue_embeddings=True)
)
print(output.per_residue_embedding.shape) If you have a PDB file you can also load the protein directly: protein = ESMProtein.from_pdb("./1utn.pdb") |
Got it! 👍 |
If I understand, output shape is [num_amino_acid, dim_embed]. I want process multi proteins at a time to get a tensor shaping [batch_size, num_amino_acid, dim_embed]. How to make a batch? |
I'd suggest adding this as an example or even in the readme, it's gonna be a recurring question. |
In case of the input is sequence only, like in this example, should the per_residue_embedding calculated from forward_and_sample and the embedding directly from the forward pass differ? If I understood well the code here, it should not be different, right? I however see that they are very different. Maybe I missed something, can anyone please help me to understand why? Here's my example:
|
Is there a way to add padding to the sequence before generating the embeddings? |
@zas97 check here: #28 (comment) |
Any update on this? Right now, the only way I can do this is by calling multiple time the model on each ESMProtein object, which I guess is much slower than batch processing. |
Hi!
Thank you for your great work. I would like to ask if your model can be used solely for generating protein sequence embedding. For example, given a protein sequence, is there a function that produces its embedding for downstream tasks such as similarity search or property prediction with a simple linear head?
If so, do you have an example script that I can refer to? Or is there a best practice for generating such embedding?
Thank you!
The text was updated successfully, but these errors were encountered: