Question about Post-Translational Modifications (PTMs) in Protein Prediction #54

wtni-gidle · 2024-11-15T18:13:45Z

Thanks for providing the AF3 source!
To test AlphaFold3 using the example 7BBV provided by AlphaFold3 server, I used the following JSON file as input:

{
    "name": "7BBV",
    "modelSeeds": [1],
    "sequences": [
        {
            "protein": {
                "id": "A",
                "sequence": "TPTPTIQEDGSPALIAKRASVTESCNIGYASTNGGTTGGKGGATTTVSTLAQFTKAAESSGKLNIVVKGKISGGAKVRVQSDKTIIGQKGSELVGTGLYINKVKNVIVRNMKISKVKDSNGDAIGIQASKNVWVDHCDLSSDLKSGKDYYDGLLDITHGSDWVTVSNTFLHDHFKASLIGHTDSNAKEDKGKLHVTYANNYWYNVNSRNPSVRFGTVHIYNNYYLEVGSSAVNTRMGAQVRVESTVFDKSTKNGIISVDSKEKGYATVGDISWGSSTNTAPKGTLGSSNIPYSYNLYGKNNVKARVYGTAGQTLGFAAASFLEQKLISEEDLNSAVDHHHHHH",
                "modifications": [
                    {"ptmType": "MAN", "ptmPosition": 22},
                    {"ptmType": "MAN", "ptmPosition": 44},
                    {"ptmType": "MAN", "ptmPosition": 45},
                    {"ptmType": "MAN", "ptmPosition": 46},
                    {"ptmType": "MAN", "ptmPosition": 48},
                    {"ptmType": "MAN", "ptmPosition": 54}
                ]
            }
        },
        {
            "ligand": {
                "id": "B",
                "ccdCodes": ["CA"]
            }
        }

    ],
    "dialect": "alphafold3",
    "version": 1
}

However, during the run_inference stage, the following error occurred:

ValueError: First MSA sequence TPTPTIQEDGSPALIAKRASVTESCNIGYASTNGGTTGGKGGATTTVSTLAQFTKAAESSGKLNIVVKGKISGGAKVRVQSDKTIIGQKGSELVGTGLYINKVKNVIVRNMKISKVKDSNGDAIGIQASKNVWVDHCDLSSDLKSGKDYYDGLLDITHGSDWVTVSNTFLHDHFKASLIGHTDSNAKEDKGKLHVTYANNYWYNVNSRNPSVRFGTVHIYNNYYLEVGSSAVNTRMGAQVRVESTVFDKSTKNGIISVDSKEKGYATVGDISWGSSTNTAPKGTLGSSNIPYSYNLYGKNNVKARVYGTAGQTLGFAAASFLEQKLISEEDLNSAVDHHHHHH is not the query_sequence='TPTPTIQEDGSPALIAKRASVXESCNIGYASTNGGTTGGKGGAXXXVXTLAQFXKAAESSGKLNIVVKGKISGGAKVRVQSDKTIIGQKGSELVGTGLYINKVKNVIVRNMKISKVKDSNGDAIGIQASKNVWVDHCDLSSDLKSGKDYYDGLLDITHGSDWVTVSNTFLHDHFKASLIGHTDSNAKEDKGKLHVTYANNYWYNVNSRNPSVRFGTVHIYNNYYLEVGSSAVNTRMGAQVRVESTVFDKSTKNGIISVDSKEKGYATVGDISWGSSTNTAPKGTLGSSNIPYSYNLYGKNNVKARVYGTAGQTLGFAAASFLEQKLISEEDLNSAVDHHHHHH'

Upon inspecting the error, I observed that:

Modified residues in the query sequence are converted to X.
However, the first MSA retains the original unmodified sequence.

Is it necessary for me to modify the input file, or is there a bug in the source code?

Thanks.

The text was updated successfully, but these errors were encountered:

wtni-gidle · 2024-11-16T14:39:00Z

I spent some time analyzing the source code and found that:

During the run_inference stage, the query_sequence is generated as follows:

alphafold3/src/alphafold3/model/features.py

Lines 444 to 445 in 2ffe43f

    
           if chain_type in mmcif_names.POLYMER_CHAIN_TYPES: 
        
             sequence = substruct.chain_single_letter_sequence()[b_chain_id]

This relies on struct:

alphafold3/src/alphafold3/model/pipeline/pipeline.py

Line 166 in 2ffe43f

struct = fold_input.to_structure(ccd=ccd)

sequence in struct:

alphafold3/src/alphafold3/common/folding_input.py

Lines 912 to 914 in 2ffe43f

    
           match chain: 
        
             case ProteinChain(): 
        
               sequences.append('(' + ')('.join(chain.to_ccd_sequence()) + ')')

to_ccd_sequence replaces the original residues with ccd from the ptm:

alphafold3/src/alphafold3/common/folding_input.py

Lines 235 to 236 in 2ffe43f

    
           for ptm_code, ptm_index in self.ptms: 
        
             ccd_coded_seq[ptm_index - 1] = ptm_code

This means that the query_sequence is not exactly the same as the original input. It first replaces the original residues based on the provided modifications. Then, the 'actual' sequence is extracted using the chain_single_letter_sequence function.

In the chain_single_letter_sequence function, the residue_names.CCD_NAME_TO_ONE_LETTER dictionary is used to convert CCD residues to their corresponding one-letter codes.

alphafold3/src/alphafold3/structure/structure.py

Lines 1940 to 1945 in 2ffe43f

    
           chain_res = string_array.remap( 
        
               chain_res, 
        
               mapping=residue_names.CCD_NAME_TO_ONE_LETTER, 
        
               inplace=False, 
        
               default_value=unknown_default, 
        
           )

So, if I understand correctly, if a CCD code from the modifications is present in the CCD_NAME_TO_ONE_LETTER dictionary and the converted residue matches the original residue, there will be no sequence mismatch error. However, if the converted residue does not match the original (i.e. CCD code is transformed into X, N, or some other residue), this would result in the same error as mentioned in the issue.

Then I further checked CCD_NAME_TO_ONE_LETTER and found that there is no corresponding single letter for MAN. And when I tried running the given example in https://github.com/google-deepmind/alphafold3/blob/main/docs/input.md#protein, it ran successfully, and the CCD code had a corresponding single letter in CCD_NAME_TO_ONE_LETTER. These experiments should have confirmed my thoughts.

Therefore, I think there should be a restriction on which CCD codes can be used in PTM (e.g. they must be recorded in CCD_NAME_TO_ONE_LETTER), or the source code should be improved to better handle the identification of query_sequence.

Please let me know if I’m wrong.

Thanks.

joshabramson · 2024-11-18T10:02:33Z

"MAN" is a glycan and should be defined as a bonded ligand, see https://github.com/google-deepmind/alphafold3/blob/main/docs/input.md#bonds

Please note that converting AlphaFold-Server JSONs containing glycans is not currently supported, see https://github.com/google-deepmind/alphafold3/blob/main/docs/input.md#glycans

For PDB examples where a cif already exists, one can create the input json from the cif using from_mmcif in the folding_input class: https://github.com/google-deepmind/alphafold3/blob/main/src/alphafold3/common/folding_input.py#L795C7-L795C17 (we will add this info to the input docs soon)

joshabramson · 2024-11-18T10:05:19Z

On, your follow up message, thanks for digging into the code!

I think there should be a restriction on which CCD codes can be used in PTM (e.g. they must be recorded in CCD_NAME_TO_ONE_LETTER), or the source code should be improved to better handle the identification of query_sequence

Good find - we will look into this.

zhaisilong · 2024-11-28T13:47:35Z

@wtni-gidle I encountered a similar question while working on my modified sequence. Following your guidance, I successfully ran a modified example. Here’s the case:

If you want to apply modifications at positions 2 and 5, changing them to HY3 and P1L for a sequence like "S(N)AD(E)VTVGKFYATFLIQEYFRKFKKRKEQGLVGKPS" (from PDB entry 6DAE, chain C), you might encounter issues. Specifically, this approach would result in an incorrect sequence.

To resolve this, you can refer to the reversed mapping dictionary from CCD_NAME_TO_ONE_LETTER. This dictionary shows that HY3 maps back to P and P1L maps back to C. Thus, the correct input sequence should be "S(P)AD(C)VTVGKFYATFLIQEYFRKFKKRKEQGLVGKPS". When provided in this format, AF3 works correctly.

Daniel-Halvey · 2024-11-30T14:13:46Z

I am just looking at this for the first time today, I am confident we will have an elegant solution with the x mutation in the run_inferance in due time. I am just good at typing creative solutions, my Old Man is M.D.. Let me load up pycharm.

Augustin-Zidek added question Further information is requested bug Something isn't working labels Nov 18, 2024

Augustin-Zidek mentioned this issue Nov 29, 2024

First MSA sequence is not the {query_sequence=}' #132

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Post-Translational Modifications (PTMs) in Protein Prediction #54

Question about Post-Translational Modifications (PTMs) in Protein Prediction #54

wtni-gidle commented Nov 15, 2024

wtni-gidle commented Nov 16, 2024

joshabramson commented Nov 18, 2024

joshabramson commented Nov 18, 2024

zhaisilong commented Nov 28, 2024 •

edited

Loading

Daniel-Halvey commented Nov 30, 2024

Question about Post-Translational Modifications (PTMs) in Protein Prediction #54

Question about Post-Translational Modifications (PTMs) in Protein Prediction #54

Comments

wtni-gidle commented Nov 15, 2024

wtni-gidle commented Nov 16, 2024

joshabramson commented Nov 18, 2024

joshabramson commented Nov 18, 2024

zhaisilong commented Nov 28, 2024 • edited Loading

Daniel-Halvey commented Nov 30, 2024

zhaisilong commented Nov 28, 2024 •

edited

Loading