Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training data for single chain model #120

Open
LivC93 opened this issue Nov 27, 2024 · 0 comments
Open

Training data for single chain model #120

LivC93 opened this issue Nov 27, 2024 · 0 comments

Comments

@LivC93
Copy link

LivC93 commented Nov 27, 2024

Hi, thank you for sharing the code and the weights with the community and congrats on your results and subsequent work.

I wanted to ask something for the single chain model. You state in your paper:

"We first sought to improve performance of the model on recovering the amino acid sequences of native monomeric proteins given their backbone structures, using as training and validation sets 19.7k high resolution single-chain structures from the PDB split based on the CATH protein classification"

Could you make available this list of 19.7k pdb ids? If not could you maybe clarify the following points:

  1. What does high resolution mean exactly? Is it the same resolution cutoff as for the rest of the paper, 3.5A?
  2. Are there any sequence length constrains? For the multi-chain model you state: less than 10,000
    residues
  3. Is the cutoff date the same as for the rest of the paper, Aug 02, 2021?
  4. Any other filters that you might have in place like discarding chains that have too many missing residues or too larger of a coil content?

Using the guidelines in your paper I end up with over 60,000 distinct PDB IDs. I am not sure how to reach your 19.7k set.

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant