Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Post-Translational Modifications (PTMs) in Protein Prediction #54

Open
wtni-gidle opened this issue Nov 15, 2024 · 5 comments
Labels
bug Something isn't working question Further information is requested

Comments

@wtni-gidle
Copy link

Thanks for providing the AF3 source!
To test AlphaFold3 using the example 7BBV provided by AlphaFold3 server, I used the following JSON file as input:

{
    "name": "7BBV",
    "modelSeeds": [1],
    "sequences": [
        {
            "protein": {
                "id": "A",
                "sequence": "TPTPTIQEDGSPALIAKRASVTESCNIGYASTNGGTTGGKGGATTTVSTLAQFTKAAESSGKLNIVVKGKISGGAKVRVQSDKTIIGQKGSELVGTGLYINKVKNVIVRNMKISKVKDSNGDAIGIQASKNVWVDHCDLSSDLKSGKDYYDGLLDITHGSDWVTVSNTFLHDHFKASLIGHTDSNAKEDKGKLHVTYANNYWYNVNSRNPSVRFGTVHIYNNYYLEVGSSAVNTRMGAQVRVESTVFDKSTKNGIISVDSKEKGYATVGDISWGSSTNTAPKGTLGSSNIPYSYNLYGKNNVKARVYGTAGQTLGFAAASFLEQKLISEEDLNSAVDHHHHHH",
                "modifications": [
                    {"ptmType": "MAN", "ptmPosition": 22},
                    {"ptmType": "MAN", "ptmPosition": 44},
                    {"ptmType": "MAN", "ptmPosition": 45},
                    {"ptmType": "MAN", "ptmPosition": 46},
                    {"ptmType": "MAN", "ptmPosition": 48},
                    {"ptmType": "MAN", "ptmPosition": 54}
                ]
            }
        },
        {
            "ligand": {
                "id": "B",
                "ccdCodes": ["CA"]
            }
        }

    ],
    "dialect": "alphafold3",
    "version": 1
}

However, during the run_inference stage, the following error occurred:

ValueError: First MSA sequence TPTPTIQEDGSPALIAKRASVTESCNIGYASTNGGTTGGKGGATTTVSTLAQFTKAAESSGKLNIVVKGKISGGAKVRVQSDKTIIGQKGSELVGTGLYINKVKNVIVRNMKISKVKDSNGDAIGIQASKNVWVDHCDLSSDLKSGKDYYDGLLDITHGSDWVTVSNTFLHDHFKASLIGHTDSNAKEDKGKLHVTYANNYWYNVNSRNPSVRFGTVHIYNNYYLEVGSSAVNTRMGAQVRVESTVFDKSTKNGIISVDSKEKGYATVGDISWGSSTNTAPKGTLGSSNIPYSYNLYGKNNVKARVYGTAGQTLGFAAASFLEQKLISEEDLNSAVDHHHHHH is not the query_sequence='TPTPTIQEDGSPALIAKRASVXESCNIGYASTNGGTTGGKGGAXXXVXTLAQFXKAAESSGKLNIVVKGKISGGAKVRVQSDKTIIGQKGSELVGTGLYINKVKNVIVRNMKISKVKDSNGDAIGIQASKNVWVDHCDLSSDLKSGKDYYDGLLDITHGSDWVTVSNTFLHDHFKASLIGHTDSNAKEDKGKLHVTYANNYWYNVNSRNPSVRFGTVHIYNNYYLEVGSSAVNTRMGAQVRVESTVFDKSTKNGIISVDSKEKGYATVGDISWGSSTNTAPKGTLGSSNIPYSYNLYGKNNVKARVYGTAGQTLGFAAASFLEQKLISEEDLNSAVDHHHHHH'

Upon inspecting the error, I observed that:

  1. Modified residues in the query sequence are converted to X.
  2. However, the first MSA retains the original unmodified sequence.

Is it necessary for me to modify the input file, or is there a bug in the source code?

Thanks.

@wtni-gidle
Copy link
Author

I spent some time analyzing the source code and found that:

During the run_inference stage, the query_sequence is generated as follows:

if chain_type in mmcif_names.POLYMER_CHAIN_TYPES:
sequence = substruct.chain_single_letter_sequence()[b_chain_id]

This relies on struct:

struct = fold_input.to_structure(ccd=ccd)

sequence in struct:

match chain:
case ProteinChain():
sequences.append('(' + ')('.join(chain.to_ccd_sequence()) + ')')

to_ccd_sequence replaces the original residues with ccd from the ptm:

for ptm_code, ptm_index in self.ptms:
ccd_coded_seq[ptm_index - 1] = ptm_code

This means that the query_sequence is not exactly the same as the original input. It first replaces the original residues based on the provided modifications. Then, the 'actual' sequence is extracted using the chain_single_letter_sequence function.

In the chain_single_letter_sequence function, the residue_names.CCD_NAME_TO_ONE_LETTER dictionary is used to convert CCD residues to their corresponding one-letter codes.

chain_res = string_array.remap(
chain_res,
mapping=residue_names.CCD_NAME_TO_ONE_LETTER,
inplace=False,
default_value=unknown_default,
)

So, if I understand correctly, if a CCD code from the modifications is present in the CCD_NAME_TO_ONE_LETTER dictionary and the converted residue matches the original residue, there will be no sequence mismatch error. However, if the converted residue does not match the original (i.e. CCD code is transformed into X, N, or some other residue), this would result in the same error as mentioned in the issue.

Then I further checked CCD_NAME_TO_ONE_LETTER and found that there is no corresponding single letter for MAN. And when I tried running the given example in https://github.com/google-deepmind/alphafold3/blob/main/docs/input.md#protein, it ran successfully, and the CCD code had a corresponding single letter in CCD_NAME_TO_ONE_LETTER. These experiments should have confirmed my thoughts.

Therefore, I think there should be a restriction on which CCD codes can be used in PTM (e.g. they must be recorded in CCD_NAME_TO_ONE_LETTER), or the source code should be improved to better handle the identification of query_sequence.

Please let me know if I’m wrong.

Thanks.

@joshabramson
Copy link
Collaborator

"MAN" is a glycan and should be defined as a bonded ligand, see https://github.com/google-deepmind/alphafold3/blob/main/docs/input.md#bonds

Please note that converting AlphaFold-Server JSONs containing glycans is not currently supported, see https://github.com/google-deepmind/alphafold3/blob/main/docs/input.md#glycans

For PDB examples where a cif already exists, one can create the input json from the cif using from_mmcif in the folding_input class: https://github.com/google-deepmind/alphafold3/blob/main/src/alphafold3/common/folding_input.py#L795C7-L795C17 (we will add this info to the input docs soon)

@joshabramson
Copy link
Collaborator

On, your follow up message, thanks for digging into the code!

I think there should be a restriction on which CCD codes can be used in PTM (e.g. they must be recorded in CCD_NAME_TO_ONE_LETTER), or the source code should be improved to better handle the identification of query_sequence

Good find - we will look into this.

@Augustin-Zidek Augustin-Zidek added question Further information is requested bug Something isn't working labels Nov 18, 2024
@zhaisilong
Copy link

zhaisilong commented Nov 28, 2024

@wtni-gidle I encountered a similar question while working on my modified sequence. Following your guidance, I successfully ran a modified example. Here’s the case:

If you want to apply modifications at positions 2 and 5, changing them to HY3 and P1L for a sequence like "S(N)AD(E)VTVGKFYATFLIQEYFRKFKKRKEQGLVGKPS" (from PDB entry 6DAE, chain C), you might encounter issues. Specifically, this approach would result in an incorrect sequence.

To resolve this, you can refer to the reversed mapping dictionary from CCD_NAME_TO_ONE_LETTER. This dictionary shows that HY3 maps back to P and P1L maps back to C. Thus, the correct input sequence should be "S(P)AD(C)VTVGKFYATFLIQEYFRKFKKRKEQGLVGKPS". When provided in this format, AF3 works correctly.

@Daniel-Halvey
Copy link

I am just looking at this for the first time today, I am confident we will have an elegant solution with the x mutation in the run_inferance in due time. I am just good at typing creative solutions, my Old Man is M.D.. Let me load up pycharm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants