Merge pull request #100 from openforcefield/dataset-standards

Dataset submission guidelines/standards
openforcefield · Jun 5, 2020 · 1474845 · 1474845
2 parents 261fb8f + ce2cb89
commit 1474845
Show file tree

Hide file tree

Showing 2 changed files with 66 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -30,3 +30,6 @@ Data generation and submission scripts for the QCArchive ecosystem.
 * `2019-07-02 VEHICLe optimization dataset`: source files for `Open Force Field VEHICLe optimization dataset 1.0.0` (`OptimizationDataset`) for the VEHICLe dataset (heteraromatic rings of the future) (@jchodera)
 * `2019-07-05 OpenFF NCI250K Boron 1`: source files for `OpenMM NCI250K Boron 1` (`OptimizationDataset`) where small boron-containing compounds are extracted from the [NCI250K](https://cactus.nci.nih.gov/download/nci/) (@jchodera)
 * `2019-09-07-Pfizer-discrepancy-optimization-dataset-1`: source files for `Pfizer discrepancy optimization dataset 1` (@jchodera)
+
+## Guidelines and standards for submitting new datasets
+* See [STANDARDS.md](./STANDARDS.md)
diff --git a/STANDARDS.md b/STANDARDS.md
@@ -0,0 +1,63 @@
+This file outlines the standards and requirements needed for submitting a dataset to QCArchive.
+This ensures that we have a consistent data model for downstream processes.
+
+# Required fields 
+
+Current list:
+* Ensure all submissions have cmiles, most important are mapped hydrogen smiles
+* Ensure the WBO is requested for all submissions, this should be included in the scf properties list using the flag `wiberg_lowdin_indices`
+
+# Best practices
+* If any calculations are to be redone from another collection, re-use the old input (coordinates, atom ordering etc) as this will avoid running the calculation again and will just create new references in the database to the old results and should help keep the cost of the calculations down.  
+
+# Dataset naming and versioning
+
+Each dataset shall be versioned.
+- The naming of a dataset should have the following structure:
+
+    `"OpenFF <descriptive and uniquely-identifying name> v<version number>"`
+
+- The first submission of a dataset will have a version `"v1.0"`
+
+- A dataset with the suffix `"-beta"` is not to be used for production work.
+
+- A minor version change (e.g. `"v1.1"`) means cosmetic or minor additions/problems were addressed
+    - mispelling
+    - addition of a e.g. Wiberg bond orders
+
+# Tags indicate status
+
+A tag `"complete"` indicates that a dataset is completed as far as OpenFF is concerned.
+This means that any errors remaining are known to be acceptable or impossible to fix.
+It also means that no additional work is being done on the dataset to get it to completion.
+
+A tag `"inflight"` indicates that a dataset is not completed as far as OpenFF is concerned.
+This means that any errors remaining are being actively addressed.
+
+All datasets should also feature a `"openff"` tag.
+
+# Force Field Releases 
+
+When a new force field is released, a dataset corresponding to all results used for the force field fitting should be created.
+This gives a single reference for these data instead of many references.
+The format of these dataset names is:
+
+    `"OpenFF Force Field <friendly name> <ff version>"`
+
+# Group
+
+The dataset's group should be set to `"OpenFF"`.
+
+# Molecule validation
+
+* See ["Molecule submission checklist"](https://github.com/openforcefield/qcsubmit/issues/9)
+
+# Standard functions and modules for entry preparation
+
+* QCSubmit (https://github.com/openforcefield/qcsubmit)
+
+# Related/ongoing discussions
+
+## Required fields
+
+* See ["Fields that should be required for OpenFF submissions"](https://github.com/openforcefield/qcsubmit/issues/3)