Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of modules #9

Open
mslw opened this issue Oct 21, 2021 · 3 comments
Open

List of modules #9

mslw opened this issue Oct 21, 2021 · 3 comments
Labels
content discussion Discussion regarding course content

Comments

@mslw
Copy link
Contributor

mslw commented Oct 21, 2021

During planning, we arrived at four modules, which would introduce the following (datalad commands are listed, but they also represent concepts):

Module 1 (day 1, first half) [Content tracking with datalad]

  • datalad help
  • datalad create
  • datalad save
  • datalad run (moved from module 3)
  • binary vs. text (git-annex remarks: datalad get, datalad drop?)
  • git log

Module 2 (day 1, second half) [Structuring data]

  • Naming (structured names, not leaking personal data)
  • Data organisation (files/directories, tabular data, binary data sidecar metadata strategy)
  • Data modularity

Module 3 (day 2, first half) [Dataset management]]

  • datalad create (subdatasets)
  • datalad run (moved to module 1)

Module 4 (day2, second half) [remote cooperation]

  • datalad clone
  • datalad get
  • datalad drop
  • datalad push
  • datalad update

Later ideas include:

  • carving out time for Q&A (from module 3?) and general introduction / motivation (from or before module 1)
  • having a separate, more technical, module with a concise tour of core DataLad commands (not included in the workshop, but available as a reference)
@mslw mslw added the content discussion Discussion regarding course content label Oct 21, 2021
@mslw mslw changed the title Modules List of modules Oct 21, 2021
@jsheunis
Copy link
Contributor

I'm not sure if there's space in the course to cover this, but I think something that could be a useful part of such training is a walk though of how to use DataLad in practice for evolving research datasets. What I've found is that the existing DataLad resources are great for understanding its core capabilities and commands and to know which options are available for data hosting and more (and how to configure/implement them). But I think there's a lack of good training resources to help people with the practical challenges of deciding when and why to implement DataLad in a specific way, or when to do which steps (and what to be cognisant of when doing so), especially at varying stages of the life cycle of research data.

Practical examples that my limited experience allows me to think of atm:

  • Half of my data is raw data, the other half is organised according to some standard structure. Should I first datalad create -f all the raw data, then rerun the standardisation scripts on all data? Or should I create a datalad dataset from the data once everything is standardised?
  • If I don't yet know where my data will be hosted, and currently it lives on some compute cluster, how should I approach creating the dataset and saving content to an annex?
  • Are there recommendations for giving people access to restricted data that are structured as datalad datasets?
  • What If I'm going to do a lot of manipulation with the data that's just going to add incomprehensible history to the datalad dataset. Shouldn't I just create the datalad dataset once the data is in a standardised and clean state?
  • How long does it take datalad to create a dataset from, say, 1TB of data. Is this something to take into account when I create the dataset?

Perhaps something to discuss further if others also feel there's a need for something like this.

@mslw
Copy link
Contributor Author

mslw commented Dec 14, 2021

Sorry for keeping this silent in a while. For me it's definitely worth discussing further - especially in terms of whether we want some of the more-general issues discussed within the last (4th) module of the "core" workshop, or whether we want to create an additional module to talk about these things (potentially to be used for more advanced workshops) or maybe put this content elsewhere. Main question for me is whether discussion of these questions will be informative to people who have just learned the very basics about DataLad.

@jsheunis
Copy link
Contributor

Good point. I think they might first be confronted with such challenges once they have had time to work with datalad or tried it on their own large dataset. Perhaps a more useful way to structure this type of lesson is via a practical walk-through, where the questions (and subsequent answers) present themselves logically and chronologically through the storyline. The combination of this approach with the above mentioned questions feels like it might be more suited to a more advanced audience, but I think such an approach (i.e. practical walkthrough with a challenge+solution-based storyline) could also work well for beginner topics. Could be an idea for the 4th module, whatever the topic ends up being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content discussion Discussion regarding course content
Projects
None yet
Development

No branches or pull requests

2 participants