Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formalize YODA principles #2

Open
asmacdo opened this issue Jan 24, 2025 · 0 comments
Open

Formalize YODA principles #2

asmacdo opened this issue Jan 24, 2025 · 0 comments

Comments

@asmacdo
Copy link

asmacdo commented Jan 24, 2025

YODA has also been proposed to be a standard/best-practice for ReproNim ReproNim/repronim.org#206.

IMO, YODA should clearly separate the principles from the suggestions, and should be fully decoupled from DataLad.

"Standards speak" would need to be expounded and explained to make sense to the unfamiliar, but this is what I have in mind for the formal bit. wdyt @yarikoptic?

YODA IDEALS

  • "YODA compliant datasets" contain well-defined, portable computational environments to compute analysis results.
  • "YODA compliant datasets" preserve provenance of the computational procedures that produce or alter derivative data.
  • "YODA compliant datasets" strive for reproducibility.

YODA PRINCIPLES:

  • All assets essential to replicate computational execution MUST be included
  • All assets essential to replicate computational execution MUST be version controlled
  • All assets essential to replicate computational execution SHOULD be version controlled using the
    same version control system
  • All assets essential to replicate computational execution MAY be linked(subdataset) or included directly in the dataset
  • Provenance of all modifications to the assets MUST be annotated
  • Dataset structure SHOULD accommodate domain standards
  • Assets SHOULD be organized in a modular structure

YODA ASSETS:

(This part could probably be left out of the formal section and discussed in the detailed explanation)

MUST:

  • input data
  • custom analysis code/scripts (upstream or custom code)
  • computational environments (e.g. as container images)
  • Documentation

SHOULD:

  • Test scripts
  • Automation

NOTES

Original Organigram: https://f1000research.com/posters/7-1965

Top level

Track all input data, code, and computational environments needed to produce analysis outputs in
version controlled datasets — and reproducibility you will achieve!

Learn control you must.
Size matters not!

- Subdataset references in a dataset are
  extremely lightweight yet guarantee data identity via cryptographic hashes.  Subdatasets can be
  detached without losing this information, yielding massively improved storage efficiency and
  reduced archive costs.

- Publicly shared data compliant with a common standard are an optimal element in a modular study
  setup. From mid-2018 OpenNeuro (previously OpenFMRI) will offer DataLad datasets for direct
  download

Principles

*P1* Use well-defined, portable computational environments to compute analysis results

*P2* Exhaustively track ALL analysis inputs in the same version control system
as the computed results, including:
- input data
- custom analysis code/scripts
- required computational environments (e.g. as container images)

*P3* Structure study elements (data, code, environments) in modular
components to facilitate reuse within or outside the context of the
original study

Dataset Layout

Dataset structure is fully flexible to be able to accommodate domain standards (e.g. BIDS).  Element
location/name can be discovered from configuration.

Required (3rd-party) code repositories can be referenced as subdatasets just like datasets with data
files. Repository state is unambiguous version record.

Images of containerized computational environments are tracked in version control just like any
other data file. Actual storage can be local or in cloud

Any input data is referenced via the dataset that contans it.  Dataset state provides unambi- guous
version specification for any data dependency.

DataLad can obtain required subdataset content on demand.  Only content elements actually required
for an analysis are present. Directory structure is expanded recursively as needed

Test scripts can be used to check analysis code, verify data integrity, and assess computational
reproducibility.

Datalad Handbook

https://handbook.datalad.org/en/latest/basics/101-127-yoda.html

Principles

P1: One thing, one dataset
P2: Record where you got it from, and where it is now
P3: Record what you did to it, and with what
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant