Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

With multiple ligand copies (SMILES), sometimes get "Failed to construct RDKit reference structure" #102

Open
smg3d opened this issue Nov 21, 2024 · 4 comments
Labels
question Further information is requested third party tool Issue with a third party tool

Comments

@smg3d
Copy link

smg3d commented Nov 21, 2024

Input is one protein + N copies of the same ligand.

Depending on the value of N (40, 50, 60, 80, 100, ..., 200), I get between 1 and 6 rdkit warning during "constructing SMILES reference structure". The warning message is :

I1120 02:11:55.910477 140432297644032 features.py:1499] Success constructing SMILES reference structure for: LIG_BS
I1120 02:11:55.964899 140432297644032 features.py:1499] Success constructing SMILES reference structure for: LIG_BT
W1120 02:11:55.997912 140432297644032 features.py:1519] Failed to construct RDKit reference structure for: LIG_BT
W1120 02:11:55.998116 140432297644032 features.py:1558] All ref positions unknown for: LIG_BT
I1120 02:11:56.001026 140432297644032 features.py:1499] Success constructing SMILES reference structure for: LIG_BU
I1120 02:11:56.066736 140432297644032 features.py:1499] Success constructing SMILES reference structure for: LIG_BV
I1120 02:11:56.152411 140432297644032 features.py:1499] Success constructing SMILES reference structure for: LIG_BW

also, if I get one rdkit warning, I also get the following (the number of lines = number of atoms in the ligand).

I1120 02:11:56.518429 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.518555 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.518632 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.518705 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.518779 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.518856 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.518932 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519007 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519084 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519158 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519235 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519312 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519390 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519466 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519541 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519618 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519696 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519773 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519850 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.519926 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520001 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520076 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520153 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520230 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520307 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520385 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520462 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520538 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520613 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520689 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520765 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.
I1120 02:11:56.520841 140432297644032 features.py:2042] Found identical coordinates: Assigning as colinear.

The structure inference proceed without warning / error, and the ligand with rdkit warning have coordinates

HETATM 2861 C C21 . LIG_BS BS 34 .   ? -20.036 8.644   -2.704  1.00 37.82 1   BS 1 
HETATM 2862 C C22 . LIG_BS BS 34 .   ? -19.076 9.331   -1.829  1.00 38.11 1   BS 1 
HETATM 2863 O O7  . LIG_BS BS 34 .   ? -20.886 7.213   -9.300  1.00 38.66 1   BS 1 
HETATM 2864 O O8  . LIG_BS BS 34 .   ? -25.223 9.323   -13.970 1.00 40.94 1   BS 1 
HETATM 2865 O O1  . LIG_BT BT 35 .   ? -5.460  -5.732  7.422   1.00 50.92 1   BT 1 
HETATM 2866 P P1  . LIG_BT BT 35 .   ? -4.092  -5.278  7.721   1.00 56.34 1   BT 1 
HETATM 2867 O O2  . LIG_BT BT 35 .   ? -3.877  -5.308  9.260   1.00 40.80 1   BT 1 
HETATM 2868 C C1  . LIG_BT BT 35 .   ? -3.126  -6.282  9.781   1.00 51.54 1   BT 1 
HETATM 2869 C C2  . LIG_BT BT 35 .   ? -3.604  -6.512  11.152  1.00 51.66 1   BT 1 
HETATM 2870 N N1  . LIG_BT BT 35 .   ? -2.710  -7.340  11.881  1.00 35.27 1   BT 1 
HETATM 2871 C C3  . LIG_BT BT 35 .   ? -1.487  -6.620  12.127  1.00 45.03 1   BT 1 
HETATM 2872 C C4  . LIG_BT BT 35 .   ? -2.420  -8.536  11.153  1.00 42.59 1   BT 1 
HETATM 2873 C C5  . LIG_BT BT 35 .   ? -3.315  -7.687  13.119  1.00 43.89 1   BT 1 
HETATM 2874 O O3  . LIG_BT BT 35 .   ? -3.882  -3.765  7.361   1.00 35.57 1   BT 1 
HETATM 2875 C C6  . LIG_BT BT 35 .   ? -4.932  -2.899  7.372   1.00 40.89 1   BT 1 
HETATM 2876 C C7  . LIG_BT BT 35 .   ? -4.675  -1.847  6.347   1.00 42.96 1   BT 1 
HETATM 2877 O O4  . LIG_BT BT 35 .   ? -5.889  -1.204  6.084   1.00 37.67 1   BT 1 
HETATM 2878 C C8  . LIG_BT BT 35 .   ? -5.872  0.119   6.255   1.00 41.91 1   BT 1 
HETATM 2879 C C9  . LIG_BT BT 35 .   ? -6.749  0.847   5.345   1.00 32.44 1   BT 1 
HETATM 2880 C C10 . LIG_BT BT 35 .   ? -6.126  2.123   4.969   1.00 35.99 1   BT 1 
HETATM 2881 C C11 . LIG_BT BT 35 .   ? -6.872  2.717   3.827   1.00 34.30 1   BT 1 
HETATM 2882 C C12 . LIG_BT BT 35 .   ? -6.222  3.988   3.425   1.00 34.18 1   BT 1 
HETATM 2883 C C13 . LIG_BT BT 35 .   ? -6.926  4.564   2.248   1.00 32.43 1   BT 1 
HETATM 2884 C C14 . LIG_BT BT 35 .   ? -6.304  5.835   1.829   1.00 30.69 1   BT 1 
HETATM 2885 O O5  . LIG_BT BT 35 .   ? -5.224  0.634   7.068   1.00 39.24 1   BT 1 
HETATM 2886 C C15 . LIG_BT BT 35 .   ? -4.220  -2.458  5.068   1.00 46.10 1   BT 1 
HETATM 2887 O O6  . LIG_BT BT 35 .   ? -3.778  -1.454  4.222   1.00 45.43 1   BT 1 
HETATM 2888 C C16 . LIG_BT BT 35 .   ? -2.489  -1.455  4.020   1.00 45.46 1   BT 1 
HETATM 2889 C C17 . LIG_BT BT 35 .   ? -1.932  -0.427  3.122   1.00 40.81 1   BT 1 
HETATM 2890 C C18 . LIG_BT BT 35 .   ? -2.939  0.165   2.216   1.00 35.12 1   BT 1 
HETATM 2891 C C19 . LIG_BT BT 35 .   ? -2.288  1.214   1.384   1.00 36.52 1   BT 1 
HETATM 2892 C C20 . LIG_BT BT 35 .   ? -3.278  1.842   0.482   1.00 34.07 1   BT 1 
HETATM 2893 C C21 . LIG_BT BT 35 .   ? -2.619  2.921   -0.314  1.00 33.31 1   BT 1 
HETATM 2894 C C22 . LIG_BT BT 35 .   ? -3.600  3.558   -1.222  1.00 31.37 1   BT 1 
HETATM 2895 O O7  . LIG_BT BT 35 .   ? -1.770  -2.254  4.519   1.00 41.62 1   BT 1 
HETATM 2896 O O8  . LIG_BT BT 35 .   ? -3.037  -6.058  7.053   1.00 41.36 1   BT 1 
HETATM 2897 O O1  . LIG_BU BU 36 .   ? 18.988  0.042   0.142   1.00 43.12 1   BU 1 
HETATM 2898 P P1  . LIG_BU BU 36 .   ? 19.127  -1.382  -0.125  1.00 52.35 1   BU 1 
HETATM 2899 O O2  . LIG_BU BU 36 .   ? 20.431  -1.602  -0.914  1.00 34.40 1   BU 1 

However, all metrics related to that ligand are null in summary_confidences.json:

 "chain_pair_iptm": [
  [
   0.78,
...
   0.02,
   null,                            <=== diagonal for LIG_BT
   0.03,
...
  ],
 "chain_pair_pae_min": [
...
  [                                   <===  LIG_BT
   null,
   null,
...
   null,
   null
  ],
...
 ],
 "chain_ptm": [
  0.78,
...
  0.31,
  null,                           <===  LIG_BT
  0.33,
...
 ],

The number of problematic ligands varies between runs with different ligands, and sometimes between different seeds within the same run, eg:

# SEED 1
lrat_dhpc-11/gra1342-27320233.out:W1120 22:23:15.401729 46919536050176 features.py:1519] Failed to construct RDKit reference structure for: LIG_FI
lrat_dhpc-11/gra1342-27320233.out:W1120 22:23:17.371930 46919536050176 features.py:1519] Failed to construct RDKit reference structure for: LIG_DL
lrat_dhpc-11/gra1342-27320233.out:W1120 22:23:19.033136 46919536050176 features.py:1519] Failed to construct RDKit reference structure for: LIG_FN
lrat_dhpc-11/gra1342-27320233.out:W1120 22:23:23.795290 46919536050176 features.py:1519] Failed to construct RDKit reference structure for: LIG_GT
lrat_dhpc-11/gra1342-27320233.out:W1120 22:23:33.568039 46919536050176 features.py:1519] Failed to construct RDKit reference structure for: LIG_GF
lrat_dhpc-11/gra1342-27320233.out:W1120 22:23:38.974108 46919536050176 features.py:1519] Failed to construct RDKit reference structure for: LIG_GM

#SEED 2
lrat_dhpc-11/gra1342-27320233.out:W1120 22:24:17.748973 46919536050176 features.py:1519] Failed to construct RDKit reference structure for: LIG_GF
lrat_dhpc-11/gra1342-27320233.out:W1120 22:24:23.128701 46919536050176 features.py:1519] Failed to construct RDKit reference structure for: LIG_GM

For 30+ runs with N >= 40 : they all get at least one warning (with associated null metrics.)
For all runs with N<= 30: no rdkit warning

The structure of the problematic ligand appears normal.

If it wasn't for the null metrics associated to that ligand, I would not worry. Maybe all is fine, and it might just be a problem with the metrics computation routine if there is somehow something "wrong" with that ligand at the start (i.e. Found identical coordinates: Assigning as colinear.).

@Augustin-Zidek Augustin-Zidek added the bug Something isn't working label Nov 22, 2024
@joshabramson
Copy link
Collaborator

joshabramson commented Nov 22, 2024

Thanks for the report. This happens if rdkit fails to generate a conformer for some random seeds, and there is no fallback idealised coordinates given in the ccd cif defining the ligand input. You can work around this by adding idealised coordinates.

When there are no conformer coordinates, we cannot generate frames for PAE and without a frame we give up on generating a confidence. However that is behavior we could change - we had single-atom ions in mind for that case (where there were no frames in training either), full ligands should be fine at inference time, as the frames aren't actually used at inference time. But perhaps given there are no reference coordinates, its better to have nans here, so that users are aware by looking at the output that something is different in these cases (likely not as good a prediction).

@smg3d
Copy link
Author

smg3d commented Nov 25, 2024

I had little experience with rdkit. But it sure fails on a lot of the SMILES and ccdCodes I have tried lately. If I understand correctly, even with ccdCodes that have coordinates, AF3 first tries to generate initial molecular coordinates with rdkit, then fails (quite often is my recent attempts), and only then will it use the ccdCodes coordinates? Is it an initial attempt to generate random molecular conformation for the ligands?

@joshabramson
Copy link
Collaborator

you are correct, the code first tries to generate a conformer for a ligand, and if that fails then it looks for coordinates in the ccd input

if not atom_names:
- it does this for one ligand at a time as part of the data pipeline

@Augustin-Zidek Augustin-Zidek added question Further information is requested third party tool Issue with a third party tool and removed bug Something isn't working labels Nov 26, 2024
@smg3d
Copy link
Author

smg3d commented Nov 27, 2024

This is a band-aid fix, but by adding the params.maxIterations variable in get_reference() of features.py and setting it to a large enough value, I can eliminate all the Failed to construct RDKit reference structure errors, and eliminate the null in the confidence files. Setting at 1e4, I still get several failed constructions, at 1e5 I get 10% failed constructions, and at 1e6 I get no failed constructions. Not necessarily a great fix as it increases significantly the featurizing time, but it allows to have confidences for all components if this is important for the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested third party tool Issue with a third party tool
Projects
None yet
Development

No branches or pull requests

3 participants