-
Notifications
You must be signed in to change notification settings - Fork 0
Design rationale
This document aims to answer the following questions regarding the design of the modules package. For questions regarding the implementation, refer to the wiki page on Specification.
- Why? Why not use/write packages?
- Why do I manually need to assign the loaded module to a variable?
- Why are nested names accessed via
$
?
While using R for exploratory data analysis as well as writing more robust
analysis code, I have experienced the R mechanism of clumsily source
ing lots
of files to be a big hindrance. In fact, just adding a few helper functions to
make using source
less painful naturally evolved into an incomplete ad-hoc
implementation of modules.
The standard answer to this problem is “write a package”. But in the humble opinion of this person, R packages fall short in several regards, which this package (the irony is not lost on me) strives to rectify.
Writing packages incurs a non-trivial overhead. Packages need to live in their own folder hierarchy (and, importantly, cannot be nested), they require the specification of some meta information, a lot of which is simply irrelevant unless there is an immediate interest in publishing the package (such as the author name and contact, and licensing information). While it’s all right to thus encourage publication, realistically most code, even if reused internally, is never published.
Last but not least, packages, before they can be used in code, need to be built and installed. And this needs to be repeated every time a single line of code is changed in the package. This is fine when developing a package in isolation; not so much when developing it in tandem with a bigger code base.
devtools
improves this work flow, but, as a commenter on Stack Overflow
has pointed out,
devtools […] reduces the packaging effort from X to X/5, but X/5 in R is still significant. In sensible interpreted languages X equals zero!
A direct consequence of this is that many people do end up source
ing all
their code, and copying it between projects, and not putting their reusable code
into a package. At best this is a lost opportunity. At worst you struggle
keeping helper files between different projects in sync, which I’ve seen happen
a lot.
Modular code often naturally forms recursive hierarchies. Most languages recognise this and allow modules to be nested (just think of Python’s or Java’s packages). R is the only widely used modern language (that I can think of) which has a flat package hierarchy.
Allowing hierarchical nesting encourages users to organise project code into small, reusable modules from the outset. Even if these modules never get reused, they still improve the maintainability of the project.
R’s packaging mechanism encourages huge, monolithic packages chock full of unrelated functions. CRAN has plenty of such packages. Without pointing fingers, let me give, as an example, the otherwise tremendously helpful agricolae package, whose description reads
Statistical Procedures for Agricultural Research
… I know projects which use this package because it includes a function to generate a consensus tree via bootstrapping. The projects in question have no relation whatsoever to agricultural research – and yet they resort to using a package whose name hints at its purpose, simply because of low cohesion.
R’s packages fundamentally bias development towards bad software engineering practices.
R packages provide namespaces and a mechanism for shielding client code from imports in the packages themselves. Nevertheless, there are situations where name clashes occur, because not all packages use namespaces (correctly). R 3.0.0 has allegedly solved this (by requiring use of namespaces) but I can still reproducibly generate a name clash with at least one package.
In other words, why does import
force the user to write
module = import('module')
Where the module
name is redundant, instead of
import('module')
With the latter call automatically defining the required variable in the calling
code? R definitely makes this possible (reload
does it). However, several
reasons speak against it. It’s potentially destructive (in as much as it may
inadvertently overwrite an existing variable), and it makes the function rely
entirely on side-effects, something which R code should always be wary of. It
also makes it less obvious how to define an alias for the imported module in
user code. As it is, the user can simply alias a module by assigning it to a
different name, e.g. m = import('module')
.
Granted, both unload
and reload
violate this. However, both are actually
safe because they only change the variable explicitly passed to them, and they
shouldn’t be used in most code anyway (their purpose is for use in interactive
sessions while developing modules).
Module objects are environments and, as such, allow any form of access that
normal environments allow. This notably includes access of objects via the $
operator. This differs from R packages, where objects can be explicitly
addressed with the package::object
syntax. For now, this syntax is not
supported for modules because it is ambiguous when a module name shadows a
package.