-
Notifications
You must be signed in to change notification settings - Fork 7
The R package as the unit of reproducible research #31
Comments
Great issue and well summarized! Just to note, this is clearly very closely tied to #11 and probably #6 as well. #11 includes a list of challenges various of us have encountered when trying this. Having tried to practice this for the past five years, I find some of the biggest challenges are as much conceptual as infrastructure. This only gets more difficult when new work builds in existing work, or when one investigation branches into separate ones. When is an idea/line of investigation ready to be a package? When to start a new package vs continue with an existing one? Should I branch an existing repo to explore an new direction? Is it 1 package: 1 paper the ideal? I have plenty of examples of these variations in my own Github account and would love to chat about some of this decision tree on some concrete examples if anyone is interested. Meanwhile, the tooling has definitely gotten better. Just a quick comment on your 5. "The data is too big": Use |
Yes, I'd love to set some time aside to chat through this. |
I will confess to being skeptical that an R package = the natural unit of reproducible research. I'm talking about packaging and documenting a specific data analysis, such as for a publication. The main goal of a package is provide functions behind for reuse in diverse contexts. The main goal of an analysis is to turn a set of inputs into a set of outputs. As far as re-purposing existing tools, I find |
Re: @gaborcsardi's proposal … we'd have to really give vignettes more love in the workflow/tools. |
@jennybc I agree that it is not natural, but maybe with some tools we can make it (more) natural. This is exactly what I wanted to discuss. A good R-based build system is such a tool, for example. Something along the lines of https://github.com/richfitz/remake or the Grunt JavaScript project. My premise is that R packages do provide a lot of things that you need for a "reproducible research project". Let's discuss what is missing, to see if it is reasonable to create it. @cboettig I completely agree that conceptual challenges are at least as big. These will never go away entirely, but good tools can help directing researchers towards good practices. |
I do love the idea of an R-based build system and the Make-like aspects of I should learn something about Grunt …. |
Thanks for this discussion @gaborcsardi One missing piece here is automating documentation for data. @cboettig |
Agree with @sckott , but also if the data are too large and are potentially dynamic, pointing to DOI'd snapshots (a la dataOne?) is an important thing to consider too. |
@jread-usgs In the future we should also be able to just do this with Dat. Such that the package contains a dat remote and appropriate metadata. Then a user can just dat clone when retrieving the raw data. This is planned functionality for rDat. |
@karthik excited for that to be a reality. cool. |
All good ideas. Quick comments:
|
This topic/thread has sparked an interesting discussion in our office, and I wanted to bring up a point that @lawinslow made that I had missed: Often the point of packages is to abstract elements of the data processing, which is in direct conflict with the concepts of reproducible research. Of course the guts would be available in the source, but the points of emphasis may be at odds for the two concepts. The discussion about scripts vs functions that @jennybc brought up are also supporting this need. Just another thing to consider as part of the high-level discussion "R packages as reproducible research". |
@jread-usgs I was probably unclear about a lot of things. I didn't mean to say that all R packages are research. I agree that packages are abstractions, in a sense all of programming is about making abstractions. For the concerns of the research, at any given stage (i.e. at a git commit) of the project these abstractions correspond to some specific implementations. (At least if you want to execute them, you need implementations.) And that is all you need to make the research reproducible. As for scripts, I think they are fine. We can just put them in |
I made some notes and put them up at https://docs.google.com/document/d/1EnQzDe1gp-j_bdg8WXZjFr9mFk2-DutgJxNZhRZQ3bU/edit?usp=sharing |
Hi all. I've just created the repository we discussed yesterday. The README can host our notes from yesterday and developing thoughts: https://github.com/ropensci/rrrpkg I'm about to take @hadley's notes and dump them in. I may also take a pass through, in case I can add anything. Please feel free to add more via PR or ask me if you want to be a collaborator. |
In my opinion a piece of reproducible research needs at least the following ingredients:
If you think about it, this is more or less the description of an R package in a git repository:
/vignettes
/inst
./man
./tests
/data
. (OK, not that simple, because data can be big, so maybe into another data package, or a database.)Imports
on them inDESCRIPTION
./vignettes
./inst
.SystemDependencies
inDESCRIPTION
.So do we have everything to use R packages as research units? Probably not, but we are really close, I think. In my opinion we would need:
devtools
is great for developing packages in general. Some more specialized tools that go further towards this particular use of packages would be nice.If this does make sense to you (or the opposite :), I'll be happy to chat about it more.
The text was updated successfully, but these errors were encountered: