Skip to content

Commit

Permalink
Merge pull request #100 from openforcefield/dataset-standards
Browse files Browse the repository at this point in the history
Dataset submission guidelines/standards
  • Loading branch information
dotsdl authored Jun 5, 2020
2 parents 261fb8f + ce2cb89 commit 1474845
Show file tree
Hide file tree
Showing 2 changed files with 66 additions and 0 deletions.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,6 @@ Data generation and submission scripts for the QCArchive ecosystem.
* `2019-07-02 VEHICLe optimization dataset`: source files for `Open Force Field VEHICLe optimization dataset 1.0.0` (`OptimizationDataset`) for the VEHICLe dataset (heteraromatic rings of the future) (@jchodera)
* `2019-07-05 OpenFF NCI250K Boron 1`: source files for `OpenMM NCI250K Boron 1` (`OptimizationDataset`) where small boron-containing compounds are extracted from the [NCI250K](https://cactus.nci.nih.gov/download/nci/) (@jchodera)
* `2019-09-07-Pfizer-discrepancy-optimization-dataset-1`: source files for `Pfizer discrepancy optimization dataset 1` (@jchodera)

## Guidelines and standards for submitting new datasets
* See [STANDARDS.md](./STANDARDS.md)
63 changes: 63 additions & 0 deletions STANDARDS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
This file outlines the standards and requirements needed for submitting a dataset to QCArchive.
This ensures that we have a consistent data model for downstream processes.

# Required fields

Current list:
* Ensure all submissions have cmiles, most important are mapped hydrogen smiles
* Ensure the WBO is requested for all submissions, this should be included in the scf properties list using the flag `wiberg_lowdin_indices`

# Best practices
* If any calculations are to be redone from another collection, re-use the old input (coordinates, atom ordering etc) as this will avoid running the calculation again and will just create new references in the database to the old results and should help keep the cost of the calculations down.

# Dataset naming and versioning

Each dataset shall be versioned.
- The naming of a dataset should have the following structure:

`"OpenFF <descriptive and uniquely-identifying name> v<version number>"`

- The first submission of a dataset will have a version `"v1.0"`

- A dataset with the suffix `"-beta"` is not to be used for production work.

- A minor version change (e.g. `"v1.1"`) means cosmetic or minor additions/problems were addressed
- mispelling
- addition of a e.g. Wiberg bond orders

# Tags indicate status

A tag `"complete"` indicates that a dataset is completed as far as OpenFF is concerned.
This means that any errors remaining are known to be acceptable or impossible to fix.
It also means that no additional work is being done on the dataset to get it to completion.

A tag `"inflight"` indicates that a dataset is not completed as far as OpenFF is concerned.
This means that any errors remaining are being actively addressed.

All datasets should also feature a `"openff"` tag.

# Force Field Releases

When a new force field is released, a dataset corresponding to all results used for the force field fitting should be created.
This gives a single reference for these data instead of many references.
The format of these dataset names is:

`"OpenFF Force Field <friendly name> <ff version>"`

# Group

The dataset's group should be set to `"OpenFF"`.

# Molecule validation

* See ["Molecule submission checklist"](https://github.com/openforcefield/qcsubmit/issues/9)

# Standard functions and modules for entry preparation

* QCSubmit (https://github.com/openforcefield/qcsubmit)

# Related/ongoing discussions

## Required fields

* See ["Fields that should be required for OpenFF submissions"](https://github.com/openforcefield/qcsubmit/issues/3)

0 comments on commit 1474845

Please sign in to comment.