Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read Partial Charges from mol2 and SDF #258

Merged
merged 20 commits into from
Dec 6, 2024

Conversation

rwxayheee
Copy link
Contributor

@rwxayheee rwxayheee commented Nov 30, 2024

This is for #224.

Chem.MolFromMol2File in RDKit parses partial charges into atom property name _TriposPartialCharge. This PR simply reads the charges from atom property at the point init_atom.

An additional attribute, charge_propname, is added to MoleculePreparation, and it needs to be used with option charge_model = "read". The default is _TriposPartialCharge.
The corresponding argument read_charges_from_prop is also added to the from_mol method for RDKitMoleculeSetup to be accessed at a lower level. It must not be conflicting with argument compute_gasteiger_charges (formerly assign_charges).
Some error messages for conflict handling are added.

Notes:

  • The total charge from the mol2 file will be preserved (i.e. if the total charge is non-integer at the beginning, the sum of the output charges will be the same).
  • In addition to _TriposPartialCharge, any other named RDKit Atom attribute can be used as the source of charges.
  • The RDKit parser needs the MOL2 file to have valid SYBYL atom types, which is not the case for some definition MOL2 files in/from Amber/AmberTools.

Python usage example:
212163792.mol2.txt

from meeko import MoleculePreparation
from meeko import PDBQTWriterLegacy
from rdkit import Chem

input_filename = "212163792.mol2"
output_filename = "212163792.pdbqt"
mol = Chem.MolFromMol2File(input_filename, removeHs=False)
mk_prep = MoleculePreparation(charge_model="read", charge_atom_prop="_TriposPartialCharge")
molsetup_list = mk_prep(mol)
molsetup = molsetup_list[0]
pdbqt_string = PDBQTWriterLegacy.write_string(molsetup)

with open(output_filename, "w") as f:
    f.write(pdbqt_string[0])

@rwxayheee rwxayheee requested a review from diogomart November 30, 2024 21:54
@rwxayheee rwxayheee changed the base branch from release to develop December 1, 2024 02:46
@rwxayheee rwxayheee changed the base branch from develop to release December 2, 2024 22:41
@rwxayheee rwxayheee changed the base branch from release to develop December 2, 2024 22:41
Copy link
Contributor

@diogomart diogomart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks!

Minor requests:

  • If mol2 files are the only source of partial charges in rdkit atom properties (as far as I know they are, but I could be wrong), set default of charge_propname to "_TriposPartialCharge" instead of None in MoleculePreparation.__init__ and remove the warning in lines 168-173 of preparation.py.
  • in molsetup.py, keep the warning and default of read_charges_from_prop to None exactly as they are. Maybe some user is building a molsetup manually and these warnings can be helpful.
  • suggestion: rename charge_propname in MoleculePreparation to charge_atom_prop, and read_charges_from_prop also to charge_atom_prop. Makes it easier to find connections in the code with grep.

@diogomart
Copy link
Contributor

Another thing is to add "read" choice to mk_prepare_ligand.py --charge_model option.

raise Warning triggers an error, we should either raise an error or use warnings.warn (needs import warnings)

@diogomart
Copy link
Contributor

If any other property name was to be used, it would use the MOL/SDF format, as MOL2 charges would always be parsed into the _TriposPartialCharge atom property. Looks like SDF can store atom properties:
openforcefield/openff-toolkit#250
rdkit/rdkit#2297
so I guess we could expose also --charge_atom_prop in mk_prepare_ligand.py, and even go to the extent of having mk_prepare_ligand.py set charge_atom_prop="PartialCharge" if the input is SDF, and charge_atom_prop="_TriposPartialCharge" if MOL2. I'm not sure anymore about setting the default charge_atom_prop to "_TriposPartialCharge".

@rwxayheee
Copy link
Contributor Author

Hi @diogomart
Thanks for looking at this PR.

  • About the default property name

The None was to explicitly handle possible conflicts of charge_model/compute_gasteiger_charges(formerly assign_charge): when a valid property name is given, all other built-in charge models will be considered conflicting. I kind of want to keep the default to be None, because when charge_model is read, a default property name becomes "_TriposPartialCharge". https://github.com/forlilab/Meeko/pull/258/files#diff-8fe710b00c5629fbd7adc003157cd2d4a1e204a15eb4fb0f38221f7970300712R167-R169 is where that happens.
Therefore, the default property name is "_TriposPartialCharge" when there isn't a conflicting charge model.

  • About the two classes that both use the property name keyword arguments

MoleculePreparation is on a higher level than RDKitMoleculeSetup (keyword arguments flow from MoleculePreparation to RDKitMoleculeSetup). Both of them can be used on a molecule with manually setup charges (?). So I set up the default values for both in the same way. But maybe that's redundant.. I copied and pasted most of the argument handling part.

  • About the different keyword argument names

I did this on purpose because they have different dominance over other options. read_charges_from_prop means to read AND to read from a specified property. Whereas charge_propname only specifies the property name, but whether to read depends on charge_model (and this is also for conflict handling, but maybe this is all unnecessary..). I can change the names to make tracking easier.

  • About SDF support
    The PR is to support charges from MOL2, but I will try reading charges from SDF. There are other sources to read charges from (Amber, QChem files) but as long as users can store them into a custom atom property, we just need the property name.

Tomorrow I will work on:

  • Keyword name changes
  • Warning and CLI options
  • Parsing charges from SDF, usually it doesn't store partial charges (only formal charges) in the molecule section, but the charges could be parsed from the property field (??? I don't know)

@diogomart
Copy link
Contributor

Ok, all of that sounds good to me. SDF is already supported, it's just a different atom property name.

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Dec 4, 2024

Ok, all of that sounds good to me. SDF is already supported, it's just a different atom property name.

(If you happen to have one) could you provide an SDF with partial charges? Do people write partial charges in property field or somewhere else? The SDF on PubChem usually doesn’t come with partial charges. But other software / packages might be able to write them in custom fields. The charges will be parsed as molecule property but not atom property. This PR currently needs charges in atom property

no worry though, I can look that up from some other databases tomorrow.

@diogomart
Copy link
Contributor

They can actually be parsed to an atom property (I learned that today), see my post above, the link to an rdkit PR 2297 has an example SDF with charges.

@rwxayheee
Copy link
Contributor Author

Ahhhh!!! I see, thanks! I will look into that

@rwxayheee rwxayheee changed the title Read Partial Charges from mol2 Read Partial Charges from mol2 and SDF Dec 4, 2024
@rwxayheee
Copy link
Contributor Author

Summary of added commits:

  • Changed default atom property name to PartialCharge

  • Changed keyword and variable name charge_propname to charge_atom_prop

  • Use warnings.warn instead of raise Warning to not interrupts the flow of execution

  • Added argument --charge_from_prop to mk_prepare_ligand.py
    Unlike how other arguments are passed, their changes are made in the main function, and right before the config right before MoleculePreparation.from_config because we need access ext (extension of input files) to determine the right property name.

Expected CLI usage:

# switch property name based on extension
mk_prepare_ligand.py -i 212163792.mol2 --charge_model read 

# customize property name 
mk_prepare_ligand.py -i 212163792.mol2 --charge_model read --charge_from_prop "_TriposPartialCharge"

The following will not work:

mk_prepare_ligand.py -i 212163792.mol2 --charge_from_prop "_TriposPartialCharge"

Because the default charge_model in mk_prepare_ligand.py is "gasteiger".

I tested with a few Python and command-line script examples. @diogomart thanks again for your suggestions, and please let me know what you think!

@@ -78,6 +78,11 @@ def cmd_lineparser():
"--name_from_prop",
help="set molecule name from RDKit/SDF property",
)
io_group.add_argument(
"--charge_from_prop",
nargs="*",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for nargs


preparator.charge_model = "read"
if args.charge_from_prop:
preparator.charge_atom_prop = args.charge_from_prop[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for [0] after nargs above is removed

preparator.charge_atom_prop = args.charge_from_prop[0]
else:
if ext=="mol2":
preparator.charge_atom_prop = "_TriposPartialCharge"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is inside the if args.charge_from_prop is not None so i think this never gets executed

@rwxayheee
Copy link
Contributor Author

Hi @diogomart
Thanks for looking at this. Sorry for not getting this right. I missed an "if" statement after I rewrote something else. But I noticed another problem.

    if args.charge_from_prop is not None: 
        if args.charge_model != "read": 
            print(f'--charge_from_prop must be used with --charge_model "read", but the current charge_model is "{args.charge_model}". ')
            sys.exit(1)
        config["charge_atom_prop"] = args.charge_from_prop if args.charge_from_prop else "_TriposPartialCharge" if ext == "mol2" else None
    else:
        if args.charge_model == "read": 
            if ext=="mol2": 
                config["charge_atom_prop"] = "_TriposPartialCharge"

Here I wanted to override config["charge_atom_prop"] according to arguments and ext (extension of input), which is only accessible in the main function without further changes. But it basically ignores the values from the original config and it's not right to do that. I will fix it a little later today. Sorry for my confusion!

@rwxayheee
Copy link
Contributor Author

0568941 should fix the current issues that are specific to the added arguments.

There might still be a small problem with how command line arguments override config, specifically those with a default value for all other command-line arguments, like charge_model. These default default values always override the config file (?), but the overriding should happen only when the arguments are explicitly provided.

Something like the following might work:

    # command line arguments override config
    explicitly_provided_args = {
        arg.lstrip('-').replace('-', '_') for arg in remaining_argv if arg.startswith('--')
    }

    for key in config:
        if key in explicitly_provided_args:
            config[key] = args.__dict__[key]

To handle arguments that are provided in the short-form, a mapping can be created, like:

    # Mapping arguments short-form to long-form 
    short_to_long = {}

    for action in parser._actions:
        option_strings = action.option_strings
        
        short_option = next((opt for opt in option_strings if opt.startswith('-') and not opt.startswith('--')), None)
        long_option = next((opt for opt in option_strings if opt.startswith('--')), None)
        
        if short_option and long_option:
            short_to_long[short_option.lstrip('-')] = long_option.lstrip('--')

    explicitly_provided_args = explicitly_provided_args | {
        short_to_long[arg.strip('-')] for arg in remaining_argv if arg.startswith('-') and not arg.startswith('--')
    }

But maybe it's all unnecessary. @diogomart Could you please confirm that we want to keep the config option, and should make the changes to only allow argument overriding config for arguments that are explicitly provided? Does the current type of overriding align with what we want, and it's not necessary to change?

I will come back and take a closer look later this evening. Thank you for your time and kind advice in advance.

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Dec 6, 2024

The charge parsing should be working now, hopefully.. Also fixed some small problems with command line arguments defaults unexpectedly overriding config options. The changes are, specifically:

  • When reading options from config file, check if the keyword is accepted by MoleculePreparation. (Not necessaary; removed in 81151af)

  • Removed one redundant assignment from args to config

config["load_atom_params"] = args.load_atom_params

Because config should already be synced after

config[key] = args.__dict__[key]
  • args.add_atom_types_json and config["add_atom_types"] are handled differently (from other args) because they have different names. The script used to ignore the existing assignments in config

  • config["charge_atom_prop"] is handled differently because it might depend on file extension, which function cmd_lineparser doesn't have access to.

  • Cleaned up the unused backend = "rdkit" blocks; redirect error messages to Standard Error

Copy link
Contributor

@diogomart diogomart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Thank you for all the clean up. Great catch with the add_atom_types option.

We do want to keep the option to pass a JSON config file, and have any command line arguments override those in the config file, but only if they are explicitly passed. This is what is implemented, and based on some testing it is working as expected.

I have only minor requests, which are in-lined. Thanks again!

@@ -52,7 +51,11 @@ def cmd_lineparser():
if confargs.config_file is not None:
with open(confargs.config_file) as f:
c = json.load(f)
config.update(c)
if any(key not in config for key in c):
print(f"Error: Got unsupported keyword ({key}) for MoleculePreparation from the config file ({confargs.config_file}). ",
Copy link
Contributor

@diogomart diogomart Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key is not defined in the f string (that's why we iterate over the keys in MoleculePreparation.from_config)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure we need to test the keys here, MoleculePreparation.from_config does a check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right on both. I made the unnecessary changes but I forgot to revert.

@@ -78,6 +81,10 @@ def cmd_lineparser():
"--name_from_prop",
help="set molecule name from RDKit/SDF property",
)
io_group.add_argument(
"--charge_atom_prop",
help="set atom partial charges from an RDKit atom property based on the input file. The default is 'PartialCharge' for SDF and '_TriposPartialCharge' for MOL2 unless overriden by a user defined property name. ",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this next to --charge_model in the Molecule Preparation group. While it is certainly related to I/O, being next to charge_model is more important, and charge_model belongs in molecule preparation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

# command line arguments override config
for key in config:
if key in args.__dict__:
for key in args.__dict__:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would like the "command line arguments override config" comment back :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary, I would revert that

@rwxayheee
Copy link
Contributor Author

Hi @diogomart thanks again for the comments. Sorry this took so many commits; I hope things are ok now. I did not understand the overriding at first, so I did too much rewrite that was unnecessary.

@diogomart
Copy link
Contributor

Thanks for the edits. The overriding is done in a complicated way... couldn't come up with a better one. I just pushed a commit to add a couple comments and moved a block of code to make it more understandable. Sorry for the lack of comments!

@diogomart diogomart self-requested a review December 6, 2024 22:42
@rwxayheee
Copy link
Contributor Author

Hi @diogomart Thanks so much for the comments and for your kind support, as always, merging!

@rwxayheee rwxayheee merged commit efc78f5 into forlilab:develop Dec 6, 2024
1 check passed
@rwxayheee rwxayheee deleted the charges_from_mol2 branch December 6, 2024 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants