Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One-off error in QM9 data? #24

Closed
tothemoonandback27 opened this issue Feb 28, 2022 · 1 comment
Closed

One-off error in QM9 data? #24

tothemoonandback27 opened this issue Feb 28, 2022 · 1 comment

Comments

@tothemoonandback27
Copy link

Hi @gasteigerjo, I'm looking at the file https://github.com/klicperajo/dimenet/blob/master/data/qm9_eV.npz and am trying to recreate it from the original raw QM9 data. When I look at the list of uncharacterized molecules, i.e. the ones that failed to converge in DFT, I think there might be a one-off error. So the QM9 data set indexing starts at 1 for how they label molecules. The first uncharacterized index is 58. So e.g. the value for U0 in dsgdb9nsd_000058.xyz is -242.19573 Ha and after subtracting the isolated atom energies and converting to eV you'd get -34.008354871077934 eV. This matches the 58-th entry (index 57) in the npz data files you uploaded: -34.008354871077934. But this is the index that should be excluded because it's one of the unconverged molecules. This leads me to believe there might be a one-off error in your data creation process as this will likely be repeated for all of the 3054 unconverged molecules? If not, maybe you could let me know where I'm thinking wrong here?

@gasteigerjo
Copy link
Owner

gasteigerjo commented Aug 1, 2023

Hi!

Thank you for your interest and for checking this! And sorry for the very late response.

I think you are indeed correct. An easier way of checking this is by looking at the coordinates, which need no transformation. And indeed, we find:

Our index (zero-based) Corresponding original index (one-based)
57 58
58 60
59 61
60 63
77 80

Indices 58, 61, and 80 are the first 3 molecules that should have been filtered out, but they are not. Instead, the next indices are removed in our dataset: 59, 62, etc. There is clearly an off-by-one bug in filtering.

I think this should only have a detrimental effect on the results reported in our papers, so those should be fine as an upper bound on performance. I will add this error to the known issues and add some warnings in the repo.

Thank you again for looking at this more closely and reporting the bug!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants