List of modules #9

mslw · 2021-10-21T13:20:58Z

During planning, we arrived at four modules, which would introduce the following (datalad commands are listed, but they also represent concepts):

Module 1 (day 1, first half) [Content tracking with datalad]

datalad help
datalad create
datalad save
datalad run (moved from module 3)
binary vs. text (git-annex remarks: datalad get, datalad drop?)
git log

Module 2 (day 1, second half) [Structuring data]

Naming (structured names, not leaking personal data)
Data organisation (files/directories, tabular data, binary data sidecar metadata strategy)
Data modularity

Module 3 (day 2, first half) [Dataset management]]

datalad create (subdatasets)
~~datalad run~~ (moved to module 1)

Module 4 (day2, second half) [remote cooperation]

datalad clone
datalad get
datalad drop
datalad push
datalad update

Later ideas include:

carving out time for Q&A (from module 3?) and general introduction / motivation (from or before module 1)
having a separate, more technical, module with a concise tour of core DataLad commands (not included in the workshop, but available as a reference)

jsheunis · 2021-10-27T22:40:19Z

I'm not sure if there's space in the course to cover this, but I think something that could be a useful part of such training is a walk though of how to use DataLad in practice for evolving research datasets. What I've found is that the existing DataLad resources are great for understanding its core capabilities and commands and to know which options are available for data hosting and more (and how to configure/implement them). But I think there's a lack of good training resources to help people with the practical challenges of deciding when and why to implement DataLad in a specific way, or when to do which steps (and what to be cognisant of when doing so), especially at varying stages of the life cycle of research data.

Practical examples that my limited experience allows me to think of atm:

Half of my data is raw data, the other half is organised according to some standard structure. Should I first datalad create -f all the raw data, then rerun the standardisation scripts on all data? Or should I create a datalad dataset from the data once everything is standardised?
If I don't yet know where my data will be hosted, and currently it lives on some compute cluster, how should I approach creating the dataset and saving content to an annex?
Are there recommendations for giving people access to restricted data that are structured as datalad datasets?
What If I'm going to do a lot of manipulation with the data that's just going to add incomprehensible history to the datalad dataset. Shouldn't I just create the datalad dataset once the data is in a standardised and clean state?
How long does it take datalad to create a dataset from, say, 1TB of data. Is this something to take into account when I create the dataset?

Perhaps something to discuss further if others also feel there's a need for something like this.

mslw · 2021-12-14T17:02:57Z

Sorry for keeping this silent in a while. For me it's definitely worth discussing further - especially in terms of whether we want some of the more-general issues discussed within the last (4th) module of the "core" workshop, or whether we want to create an additional module to talk about these things (potentially to be used for more advanced workshops) or maybe put this content elsewhere. Main question for me is whether discussion of these questions will be informative to people who have just learned the very basics about DataLad.

jsheunis · 2021-12-14T18:53:55Z

Good point. I think they might first be confronted with such challenges once they have had time to work with datalad or tried it on their own large dataset. Perhaps a more useful way to structure this type of lesson is via a practical walk-through, where the questions (and subsequent answers) present themselves logically and chronologically through the storyline. The combination of this approach with the above mentioned questions feels like it might be more suited to a more advanced audience, but I think such an approach (i.e. practical walkthrough with a challenge+solution-based storyline) could also work well for beginner topics. Could be an idea for the 4th module, whatever the topic ends up being.

mslw added the content discussion Discussion regarding course content label Oct 21, 2021

mslw changed the title ~~Modules~~ List of modules Oct 21, 2021

mslw mentioned this issue Oct 21, 2021

Dataset management module contents #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List of modules #9

List of modules #9

mslw commented Oct 21, 2021

jsheunis commented Oct 27, 2021

mslw commented Dec 14, 2021

jsheunis commented Dec 14, 2021

List of modules #9

List of modules #9

Comments

mslw commented Oct 21, 2021

jsheunis commented Oct 27, 2021

mslw commented Dec 14, 2021

jsheunis commented Dec 14, 2021