Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T008: Current query misses existing RCSB ligands #248

Closed
BJWiley233 opened this issue Jun 24, 2022 · 6 comments · Fixed by #259
Closed

T008: Current query misses existing RCSB ligands #248

BJWiley233 opened this issue Jun 24, 2022 · 6 comments · Fixed by #259
Assignees
Labels
bug Something isn't working

Comments

@BJWiley233
Copy link

BJWiley233 commented Jun 24, 2022

Hi,

For talktorial 8 for accessing RCSB I think RCSB is having issue with not having all the ligands under the field ["rcsb_entry_info"]["nonpolymer_bound_components"]. For instance I am using MAP2K1 but I'm for sure it will affect other proteins and for entry https://data.rcsb.org/rest/v1/core/entry/3ZLS there is only ["NA"] value for nonpolymer_bound_components when the structure has '92P' as the largest ligand. I found an alternative under ["pdbx_vrpt_summary"]["restypes_notchecked_for_bond_angle_geometry"] so the following should work better as an alternative in talktorial 8.

nonpolymers = info.get("pdbx_vrpt_summary", {}).get('restypes_notchecked_for_bond_angle_geometry',[])

This is also kind of interesting to find out that most ligands are not checked for the bond angle geometry. All the more reason to use TinkerTools/poltype2 for ligand geometry optimization! 😄

@dominiquesydow
Copy link
Collaborator

dominiquesydow commented Aug 1, 2022

Hi @BJWiley233,

Thanks a lot for raising this issue, 3ZLS is an interesting case!

As a general remark: Compared to what we are currently doing in T008, I have the feeling there should be a nicer way to check for all non-polymer residues in a PDB entry. For example, parsing all non-polymer _chem_comp entries in a https://files.rcsb.org/view/{pdb_id}.cif file - looking for a good CIF parser (EDIT: PDBeCIF).

But --- building on what we already have in T008 and what you suggested here, we could use both queries and filter the results by ligand size (to extract solvents and ions). Ligand size could be a user-defined value here.


# Example PDBs
pdb_id = "5UG9"
pdb_id = "3ZLS"

def get_ligands(pdb_id, ligand_min_size=100):
    """
    RCSB has not provided a new endpoint for ligand information yet. As a
    workaround we are obtaining extra information from ligand-expo.rcsb.org,
    using HTML parsing. Check Talktorial T011 for more info on this technique!
    """

    info = pypdb.get_info(pdb_id)
    # Extract ligands (not-so-nice workaround)
    # - marked as non-polymer in the `rcsb_entry_info` field
    # - since not all non-polymers are listed in the `rcsb_entry_info` field, 
    #   also look for residues with unchecked bond angle geometry
    #   (see discussion at https://github.com/volkamerlab/teachopencadd/issues/248)
    _nonpolymers1 = info.get("rcsb_entry_info", {}).get("nonpolymer_bound_components", [])
    _nonpolymers2 = info.get("pdbx_vrpt_summary", {}).get("restypes_notchecked_for_bond_angle_geometry",[])
    nonpolymers = list(set(_nonpolymers1 + _nonpolymers2))
    
    # Extract ligand annotations from ligand-expo.rcsb.org
    ligands = {}
    for ligand_expo_id in nonpolymers:
        url = f"http://ligand-expo.rcsb.org/reports/{ligand_expo_id[0]}/{ligand_expo_id}/"
        print(url)
        r = requests.get(url)
        r.raise_for_status()
        html = BeautifulSoup(r.text)
        info = {}
        for table in html.find_all("table"):
            for row in table.find_all("tr"):
                cells = row.find_all("td")
                if len(cells) != 2:
                    continue
                key, value = cells
                if key.string and key.string.strip():
                    info[key.string.strip()] = "".join(value.find_all(string=True))
        # Only keep ligands that 
        # - are of component type non-polymer
        # - have a MW of more than 100 Da
        if info["Component type"].lower() == "non-polymer" and float(info["Molecular weight"].split()[0]) >= ligand_min_size:
            ligands[ligand_expo_id] = info
    print(_nonpolymers1)
    print(_nonpolymers2)
    print(ligands.keys())
    return ligands

get_ligands(pdb_id)

For reference:
_pdbx_vrpt_summary.restypes_notchecked_for_bond_angle_geometry

@dominiquesydow dominiquesydow self-assigned this Aug 1, 2022
@dominiquesydow dominiquesydow changed the title RCSB API T008: Current query misses existing RCSB ligands Aug 1, 2022
@dominiquesydow dominiquesydow added the bug Something isn't working label Aug 1, 2022
@BJWiley233
Copy link
Author

BJWiley233 commented Aug 1, 2022

I have actually been using graphQL based on Rachel Green's advice at RSCB. It works well and you don't need to make it a single line to work and it should return all the ligands in the cif entry:

import urllib3
http = urllib3.PoolManager()

query = '''{
      entry(entry_id: "%s") {
        nonpolymer_entities {
          pdbx_entity_nonpoly {
            comp_id
            name
            rcsb_prd_id
          }
        }
      }
    }''' % pdb_id

query_url='https://data.rcsb.org/graphql?query=%s' % query
r = http.request('GET', query_url, preload_content=False)
r.status
my_json = json.loads(r.data)

dominiquesydow added a commit that referenced this issue Aug 2, 2022
Thanks @BJWiley233 and Rachel Green for the fix.
@dominiquesydow dominiquesydow linked a pull request Aug 2, 2022 that will close this issue
4 tasks
@dominiquesydow
Copy link
Collaborator

Thanks, @BJWiley233, using GraphQL is a great idea! Will incorporate the changes to T008 (#259).

@khanmf
Copy link

khanmf commented Aug 5, 2022

Hi @dominiquesydow ,

When I am running the same coddes for pdb_id = '3POZ', it is giving the following error:

Traceback (most recent call last):
File "PDB_Query.py", line 428, in
ligand_id, properties = max(ligands.items(), key=lambda kv: kv[1]["Molecular weight"])
ValueError: max() arg is an empty sequence

How this could be an empty sequence when there are two bound ligands 03P and SO4? Out of which 03P is the bigger one.

@dominiquesydow
Copy link
Collaborator

Did you pull the latest changes from the master branch already?

The updated get_ligand function

get_ligands("3POZ")

returns the following:

{'SO4': {'Name': 'SULFATE ION',
  'Formula': 'O4 S',
  'Formal charge': -2,
  'Molecular weight': 96.063,
  'Component type': 'NON-POLYMER',
  'Atom count': 5,
  'Chiral atom count': 0,
  'Bond count': 4,
  'Aromatic bond count': 0,
  'Systematic name (ACDLabs)': 'sulfate',
  'Systematic name (OpenEye OEToolkits)': 'sulfate',
  'Stereo SMILES (CACTVS)': '[O-][S]([O-])(=O)=O',
  'SMILES (CACTVS)': '[O-][S]([O-])(=O)=O',
  'Stereo SMILES (OpenEye)': '[O-]S(=O)(=O)[O-]',
  'InChI descriptor': 'InChI=1S/H2O4S/c1-5(2,3)4/h(H2,1,2,3,4)/p-2',
  'InChIKey descriptor': 'QAOWNCQODCNURD-UHFFFAOYSA-L',
  'Last modified': '2011-06-04',
  'Created': '1999-07-08',
  'Release status': 'REL',
  'Replaces': 'SUL',
  'Model PDB code': '1BXO',
  'Processing site': 'RCSB'},
 '03P': {'Name': 'N-{2-[4-({3-chloro-4-[3-(trifluoromethyl)phenoxy]phenyl}amino)-5H-pyrrolo[3,2-d]pyrimidin-5-yl]ethyl}-3-hydroxy-3-methylbutanamide',
  'Synonyms': 'TAK-285',
  'Formula': 'C26 H25 Cl F3 N5 O3',
  'Formal charge': 0,
  'Molecular weight': 547.957,
  'Component type': 'NON-POLYMER',
  'Atom count': 63,
  'Chiral atom count': 0,
  'Bond count': 66,
  'Aromatic bond count': 22,
  'Systematic name (ACDLabs)': 'N-{2-[4-({3-chloro-4-[3-(trifluoromethyl)phenoxy]phenyl}amino)-5H-pyrrolo[3,2-d]pyrimidin-5-yl]ethyl}-3-hydroxy-3-methylbutanamide',
  'Systematic name (OpenEye OEToolkits)': 'N-[2-[4-[[3-chloranyl-4-[3-(trifluoromethyl)phenoxy]phenyl]amino]pyrrolo[3,2-d]pyrimidin-5-yl]ethyl]-3-methyl-3-oxidanyl-butanamide',
  'Stereo SMILES (CACTVS)': 'CC(C)(O)CC(=O)NCCn1ccc2ncnc(Nc3ccc(Oc4cccc(c4)C(F)(F)F)c(Cl)c3)c12',
  'SMILES (CACTVS)': 'CC(C)(O)CC(=O)NCCn1ccc2ncnc(Nc3ccc(Oc4cccc(c4)C(F)(F)F)c(Cl)c3)c12',
  'Stereo SMILES (OpenEye)': 'CC(C)(CC(=O)NCCn1ccc2c1c(ncn2)Nc3ccc(c(c3)Cl)Oc4cccc(c4)C(F)(F)F)O',
  'InChI descriptor': 'InChI=1S/C26H25ClF3N5O3/c1-25(2,37)14-22(36)31-9-11-35-10-8-20-23(35)24(33-15-32-20)34-17-6-7-21(19(27)13-17)38-18-5-3-4-16(12-18)26(28,29)30/h3-8,10,12-13,15,37H,9,11,14H2,1-2H3,(H,31,36)(H,32,33,34)',
  'InChIKey descriptor': 'ZYQXEVJIFYIBHZ-UHFFFAOYSA-N',
  'Last modified': '2021-03-01',
  'Created': '2010-11-30',
  'Release status': 'REL',
  'Model PDB code': '3POZ',
  'Processing site': 'RCSB'}}

@khanmf
Copy link

khanmf commented Aug 5, 2022

@dominiquesydow Thanks. I haven't seen the recent changes in the codes. Its working fine, now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants