Skip to content

Commit

Permalink
doc: remove reference to alignn (#4)
Browse files Browse the repository at this point in the history
* doc: remove reference to alignn

* data/wbm/readme.md explain meaning of numbers in material_ids

add table listing number of materials per substitution step before and after cleaning

Co-authored-by: Janosh Riebesell <[email protected]>
  • Loading branch information
CompRhys and janosh authored Dec 10, 2022
1 parent c761f95 commit 4572d4c
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 3 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,4 @@ jobs:
twine upload --skip-existing dist/*
env:
TWINE_USERNAME: janosh
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
13 changes: 12 additions & 1 deletion data/wbm/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,14 @@ The **WBM dataset** was published in [Predicting stable crystalline compounds us

The resulting novel structures were relaxed using MP-compatible VASP inputs (i.e. using `pymatgen`'s `MPRelaxSet`) and identical POTCARs in an attempt to create a database of Materials Project compatible novel crystals. Any degrade in model performance from training to test set should therefore largely be a result of extrapolation error rather than covariate shift in the underlying data.

The authors performed 5 rounds of elemental substitution in total, each time relaxing generated structures and adding those found to lie on the convex hull back to the source pool. In total, ~20k or close to 10% were found to lie on the Materials Project convex hull.
The authors performed 5 rounds of elemental substitution in total, each time relaxing all generated structures and adding those found to lie on the convex hull back to the source pool. In total, ~20k or close to 10% were found to lie on the Materials Project convex hull.

Since repeated substitutions should - on average - increase chemical dissimilarity, the 5 iterations of this data-generation process are a unique and compelling feature as it allows out-of distribution testing. We can check how model performance degrades when asked to predict on structures increasingly more dissimilar from the training set (which is restricted to the MP 2022 database release (or earlier) for all models in this benchmark).

## About the IDs

As you may have guessed, the first integer in each material ID following the prefix `wbm-` ranges from 1 to 5 and indicates the substitution iteration count. Each iteration has varying numbers of materials counted by the 2nd integer. Note the 2nd integer is not strictly consecutive. A small number of materials (~0.2%) were removed by the data processing steps detailed below. Don't be surprised to find an ID like `wbm-3-70804` followed by

## Data processing steps

The full set of processing steps used to curate the WBM test set from the raw data files (downloaded from the URLs listed below) can be found in [`data/wbm/fetch_process_wbm_dataset.py`](https://github.com/janosh/matbench-discovery/blob/site/data/wbm/fetch_process_wbm_dataset.py). Processing involved
Expand All @@ -20,6 +24,13 @@ The full set of processing steps used to curate the WBM test set from the raw da
- apply the [`MaterialsProject2020Compatibility`](https://pymatgen.org/pymatgen.entries.compatibility.html#pymatgen.entries.compatibility.MaterialsProject2020Compatibility) energy correction scheme to the formation energies
- compute energy to the convex hull constructed from all MP `ComputedStructureEntries` queried on 2022-09-16 ([database release 2021.05.13](https://docs.materialsproject.org/changes/database-versions#v2021.05.13))

The number of materials in each step before and after processing are:

| step | 1 | 2 | 3 | 4 | 5 | total |
| ---- | ------ | ------ | ------ | ------ | ------ | ------- |
| pre | 61,848 | 52,800 | 79,205 | 40,328 | 23,308 | 257,487 |
| post | 61,466 | 52,755 | 79,160 | 40,314 | 23,268 | 256,963 |

Invoking that script with `python fetch_process_wbm_dataset.py` will auto-download and regenerate the WBM test set files from scratch. If you find any questionable in the released test set or inconsistencies between the files on GitHub vs the output of that script, please [raise an issue](https://github.com/janosh/matbench-discovery/issues).

## Links to WBM data files
Expand Down
2 changes: 1 addition & 1 deletion readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Such models are suited for a materials discovery workflow in which they pre-filt

This project aims to complement Matbench using the **WBM dataset** published in [Predicting stable crystalline compounds using chemical similarity](https://nature.com/articles/s41524-020-00481-6). They generated ~250k structures with chemical similarity-based elemental substitution and relaxed all of them. ~20k or 10% were found to lie on the Materials Project convex hull. They did 5 iterations of this substitution process. This is a unique and compelling feature of the dataset as it allows out-of distribution testing. We can look at how a model performs when asked to predict on structures increasingly more different from the training set (which is restricted to MP for all models in this benchmark at the moment) since repeated substitutions should - on average - increase chemical dissimilarity.

A good set of baseline models would be CGCNN, Wren and Voronoi tessellation combined with a random forest. In addition to CGCNN, Wren and Voronoi plus RF, this benchmark includes ALIGNN (current Matbench SOTA), BOWSR and M3GNet to see how many of the 20k stable structures each of these models recovers and how their performance changes as a function of iteration number, i.e. how well they extrapolate. Like Matbench, future model submission to this benchmark can be added via PRs to this repo.
A good set of baseline models would be CGCNN, Wren and Voronoi tessellation combined with a random forest. In addition to CGCNN, Wren and Voronoi plus RF, this benchmark includes BOWSR and M3GNet to see how many of the 20k stable structures each of these models recovers and how their performance changes as a function of iteration number, i.e. how well they extrapolate. Like Matbench, future model submission to this benchmark can be added via PRs to this repo.

Our goal with this site is to serve as an interactive dashboard for researchers that makes it easy to compare the performance of different energy models on metrics like precision, recall and discovery enrichment to find the model that best suits your needs. You can then make an informed decision about which model to pick by trading off compute savings from increased hit rate to more complete discovery in your materials space of interest from higher recall.

Expand Down

0 comments on commit 4572d4c

Please sign in to comment.