Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For MPcules: Molecule Trajectory and graph hashes #2945

Merged
merged 36 commits into from
Apr 25, 2023

Conversation

espottesmith
Copy link
Contributor

@espottesmith espottesmith commented Apr 19, 2023

Summary

We are nearing the inclusion of a large dataset of molecules and molecular properties to the Materials Project. This PR adds some features to pymatgen which will be useful for the molecules data pipeline in emmet, namely graph hashes and a Trajectory class that works for Molecules.

The graph hashing code, which is currently in emmet but is being moved to pymatgen because we expect it to be useful for more general users (rather than only developers), was taken (following licenses, I hope) from networkx. We took the code directly because we want to have access to a stable version of the hashing algorithm (networkx has made subtle changes to the algorithm in the past, leading to the same graphs producing different hashes).

The MoleculeOptimizeTrajectory class is basically a copy-paste of the existing Trajectory class, designed for Molecule objects rather than Structure objects. This will be important if we include geometry optimization trajectories in the new MP dataset, and/or if we want to visualize atomic motion (e.g. molecular normal modes of vibration).

Checklist

  • Doc strings have been added in the Google docstring format. Run pydocstyle on your code.
  • Type annotations are highly encouraged. Run mypy path/to/file.py to type check your code.
  • Tests have been added for any new functionality or bug fixes.
  • All linting and tests pass.

Note that the CI system will run all the above checks. But it will be much more efficient if you already fix most errors prior to submitting the PR. We highly recommended installing pre-commit hooks. Simply Run

@janosh
Copy link
Member

janosh commented Apr 19, 2023

Thanks @espottesmith! Could you briefly explain the major differences between Trajectory and MoleculeOptimizeTrajectory? I understand latter is meant to handle Molecules but I'm curious what is preventing an object agnostic Trajectory class that handles both Structures and Mols?

Also, are you anticipating a MoleculeMDTrajectory class or similar with different needs? If not, why not call it MoleculeTrajectory and call the other StructureTrajectory with an alias for Trajectory for backwards compat?

@espottesmith
Copy link
Contributor Author

There's nothing preventing an agnostic Trajectory class, in principle. It just felt easier to make two versions, one that had to consider Lattices and another that doesn't. This is the same approach taken previously, e.g. for StructureGraph vs. MoleculeGraph.

I don't care much about the name. MoleculeTrajectory and StructureTrajectory are fine.

@janosh
Copy link
Member

janosh commented Apr 19, 2023

If there's nothing preventing it, my preference is for an agnostic Trajectory class as that minimizes LoC and will prevent API drift that could occur between MoleculeTrajectory and StructureTrajectory over time. But this is just one opinion. Happy to be overruled by other maintainers or arguments.

@espottesmith
Copy link
Contributor Author

I would respectfully say that if that's a design pattern that you or others want to move towards, then y'all should have at it. As I said, this pattern is easier to write (in my opinion and experience), and as that implies, writing a unified interface would mean a nontrivial amount of additional work (that I don't want to do if I don't have to).

@shyuep
Copy link
Member

shyuep commented Apr 19, 2023

I agree with @janosh that it is preferable to have a single unified interface with a consistent API. Trajectories are trajectories. Whether there is a periodic boundary condition shouldn't make any difference. And I would prefer if the original contributor of the PR, i.e., @espottesmith makes the change.

@espottesmith
Copy link
Contributor Author

The interface is as unified as it reasonably can be.

@janosh
Copy link
Member

janosh commented Apr 20, 2023

Thanks a lot, Evan! I really appreciate you going the extra mile!

Btw, different topic but wanted to mention I could get behind dropping mypy. Sounds like you'd be in favor? 🤣
In my personal experience, the pain mypy adds to developing can be too high for the number of coding errors it helps avoid. If @shyuep is not against, I'm happy to make the change.

@shyuep
Copy link
Member

shyuep commented Apr 20, 2023

For mypy, I think the general approach should be to not bother with type annotations where it is not necessary but to add them where you want a more stringent control of the types being passed. If you don't annotate, mypy will not check?

@janosh
Copy link
Member

janosh commented Apr 20, 2023

If you don't annotate, mypy will not check?

That's true.

But unlike most (all?) other languages, I think the main beneficiaries of type annotations in Python are actually the users, not the developers. Tools like Sphinx automatically integrate type annos into static docs and IDEs display them on hover and when passing arguments like in this example:

Screenshot 2023-04-20 at 07 45 48

So from that perspective, the more type hints the better. But the more you use them, the more mypy can be a nuisance.

@shyuep
Copy link
Member

shyuep commented Apr 20, 2023

Philosophically, I see type annotations as doing two things: 1. forcing developers to be more rigorous about what they intend, and 2. helping users use the code as intended. I generally think that they are most useful for primitive types, e.g., saying a generic variable name like mode can be a string, int or something else. It becomes messy when the types are built on objects with a hierarchy or even more complex things like lists or dicts.

It is definitely more effort than just not doing type annotations, but in the long run, it will be better. I am more fussy about type annotations for something like pymatgen/core but less for pymatgen/analysis. The more users who use a particular code, the more important it is that the types are properly specified.

@espottesmith
Copy link
Contributor Author

Stepping away from the mypy discussion, does anyone know why one of the tests failed? It failed during the dependency install, but this PR doesn't touch any of pymatgen's dependencies (at least, as far as I can remember).

@janosh
Copy link
Member

janosh commented Apr 20, 2023

Yeah failure is unrelated to this PR. Not sure why phonopy suddenly fails to install on Mac py3.8. Not holding back this PR. Haven't had time to re-review yet.

janosh added 2 commits April 22, 2023 07:30
can we rely on heuristic if lattice is None, we're handling molecules?
@janosh
Copy link
Member

janosh commented Apr 22, 2023

@espottesmith One more question. Do we need the new use_molecule kwarg? I just tried removing it and seeing if relying on the heuristic "if lattice is None = we're handling molecules" can work. The tests pass all still pass but maybe there's some subtlety that we're not checking for? Let me know if anything seems off to you in 4c9e24c?

@janosh janosh added awaiting user Needs more information from OP. enhancement A new feature or improvement to an existing one and removed awaiting user Needs more information from OP. labels Apr 25, 2023
@janosh janosh merged commit a553f37 into materialsproject:master Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement to an existing one
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants