Formalize YODA principles #2

asmacdo · 2025-01-24T18:51:04Z

YODA has also been proposed to be a standard/best-practice for ReproNim ReproNim/repronim.org#206.

IMO, YODA should clearly separate the principles from the suggestions, and should be fully decoupled from DataLad.

"Standards speak" would need to be expounded and explained to make sense to the unfamiliar, but this is what I have in mind for the formal bit. wdyt @yarikoptic?

YODA IDEALS

"YODA compliant datasets" contain well-defined, portable computational environments to compute analysis results.
"YODA compliant datasets" preserve provenance of the computational procedures that produce or alter derivative data.
"YODA compliant datasets" strive for reproducibility.

YODA PRINCIPLES:

All assets essential to replicate computational execution MUST be included
All assets essential to replicate computational execution MUST be version controlled
All assets essential to replicate computational execution SHOULD be version controlled using the
same version control system
All assets essential to replicate computational execution MAY be linked(subdataset) or included directly in the dataset
Provenance of all modifications to the assets MUST be annotated
Dataset structure SHOULD accommodate domain standards
Assets SHOULD be organized in a modular structure

YODA ASSETS:

(This part could probably be left out of the formal section and discussed in the detailed explanation)

MUST:

input data
custom analysis code/scripts (upstream or custom code)
computational environments (e.g. as container images)
Documentation

SHOULD:

Test scripts
Automation

NOTES

Original Organigram: https://f1000research.com/posters/7-1965

Top level

Track all input data, code, and computational environments needed to produce analysis outputs in
version controlled datasets — and reproducibility you will achieve!

Learn control you must.
Size matters not!

- Subdataset references in a dataset are
  extremely lightweight yet guarantee data identity via cryptographic hashes.  Subdatasets can be
  detached without losing this information, yielding massively improved storage efficiency and
  reduced archive costs.

- Publicly shared data compliant with a common standard are an optimal element in a modular study
  setup. From mid-2018 OpenNeuro (previously OpenFMRI) will offer DataLad datasets for direct
  download

Principles

*P1* Use well-defined, portable computational environments to compute analysis results

*P2* Exhaustively track ALL analysis inputs in the same version control system
as the computed results, including:
- input data
- custom analysis code/scripts
- required computational environments (e.g. as container images)

*P3* Structure study elements (data, code, environments) in modular
components to facilitate reuse within or outside the context of the
original study

Dataset Layout

Dataset structure is fully flexible to be able to accommodate domain standards (e.g. BIDS).  Element
location/name can be discovered from configuration.

Required (3rd-party) code repositories can be referenced as subdatasets just like datasets with data
files. Repository state is unambiguous version record.

Images of containerized computational environments are tracked in version control just like any
other data file. Actual storage can be local or in cloud

Any input data is referenced via the dataset that contans it.  Dataset state provides unambi- guous
version specification for any data dependency.

DataLad can obtain required subdataset content on demand.  Only content elements actually required
for an analysis are present. Directory structure is expanded recursively as needed

Test scripts can be used to check analysis code, verify data integrity, and assess computational
reproducibility.

Datalad Handbook

https://handbook.datalad.org/en/latest/basics/101-127-yoda.html

Principles

P1: One thing, one dataset
P2: Record where you got it from, and where it is now
P3: Record what you did to it, and with what

The text was updated successfully, but these errors were encountered:

This was referenced Jan 24, 2025

Consider submission of YODA for INCF endorsement #1

Open

Add YODA into "Best Practices" of "Tools" ReproNim/repronim.org#206

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formalize YODA principles #2

Formalize YODA principles #2

asmacdo commented Jan 24, 2025

Formalize YODA principles #2

Formalize YODA principles #2

Comments

asmacdo commented Jan 24, 2025

YODA IDEALS

YODA PRINCIPLES:

YODA ASSETS:

NOTES

Original Organigram: https://f1000research.com/posters/7-1965

Top level

Principles

Dataset Layout

Datalad Handbook

Principles