Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are versioned rocker images always guaranteed to contain the same R package versions? #201

Closed
myoung3 opened this issue Jul 28, 2021 · 7 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@myoung3
Copy link
Contributor

myoung3 commented Jul 28, 2021

I was previously under the impression that the most recent version-tagged rocker image (e.g. currently geospatial:4.1.0) is not stable with respect to R package versions. I got this (mistaken?) idea from this quote:

Note that the MRAN date associated with the current release (e.g., 3.4.2 at the time of writing) will continue to advance on the Docker-hub image until the next R release.
https://journal.r-project.org/archive/2017/RJ-2017-065/RJ-2017-065.pdf

I was interpreting this quote to mean that the most recent version-tagged rocker image (e.g. 4.1.0) might have its installed R package versions change between the day 4.1.0 was released on rocker through the day the next version is released on rocker (i.e. 4.1.1) because the built-in MRAN date stored in 4.1.0 can change as 4.1.0 is updated during that period.

I now realize that rocker-versioned2 no longer uses MRAN so that quote doesn't even apply anymore. Nevertheless, I'm still uncertain of how package version freezing/pinning works for the versioned images. It would be great if there were documentation on this (maybe in the readme?). Some documentation on the following specific topics would be great:

  • What builds (e.g. versioned vs latest) have stable, pinned R packages. The readme mentions that rstudio package manager is used for installing packages, but the readme doesn't even mention anywhere that R package versions are guaranteed to be constant over time within a specific versioned rocker image. And it would be a good idea to especially call out whether the most recent versioned rocker images are stable over time with respect to R package versions, given the above quote that confused me.
  • How exactly package versions are pinned. It would be great to give a general description of how rstudio package manager is utilized to pin versions (ie what is the process for determining which package versions end up in a versioned rocker image)?
  • Recommendations/best practices for how users might extend the rocker images by building on top of rocker images with even more R packages (with pinned version) in a way that's reproducible and consistent with the how the packages are specified in the versioned rocker images (ie should we use devtools::install_version? Or some other approach?)

Posting this issue on @eddelbuettel's recommendation after I said some things about rocker on twitter that made it apparent I was confused about how this all worked.

Thanks!

@cboettig
Copy link
Member

@myoung3 Apologies for the confusion.

Non-current versions (i.e. tags < 4.1.0 at this time) are frozen, precisely as you say. It is true that since R 4.0.0 we have moved to RStudio Package Manager (RSPM) instead of MRAN, but RSPM provides version-locked snapshots of CRAN in much the same way, e.g. https://github.com/rocker-org/rocker-versioned2/blob/master/stacks/core-4.0.5.json#L15 (though with the added benefit that we get binary installs for Ubuntu this way). RSPM snapshots are a few times a week instead of daily, so we get the closest version that was still built with the locked version of R.

The version of RStudio follows a similar policy, rolling in the current/latest tag and then frozen there-after when that tag is no longer latest. Meanwhile, system libraries (i.e. from apt-get) all come from the Ubuntu upstream LTS distribution (as you probably know, these are effectively frozen on release, though they receive security updates thereafter). We avoid introducing PPAs or rolling-version apt sources, which ensures a stable and predictable set of libraries and compilers (importantly, this also ensures that Rocker compilers and libs match those used by RStudio's RSPM in building the binary images -- there are actually a few special cases where RSPM uses a PPA source, and we seek to match that for compatibility).

Recommendations for users really depend on the user. The simplest thing is to rely on a frozen version -- because we are setting RSPM lock by setting the default CRAN mirror, any further use of install.packages() will continue to install additional packages not already included in Rocker images using the frozen snapshot. This mimics the workflow of a user who generally keeps packages up-to-date to CRAN during normal development and then wants to preserve that environment frozen in time for a particular project with minimal fuss.

However, more advanced users might prefer setting a different snapshot date, i.e. locking to the date their analysis was published, rather than locking to the time of an R release. This is easily done by re-defining the environmental variable, CRAN, to the appropriate date. Other users may prefer to use Docker with packrat or renv, which allows a user to preserve an arbitrary collection of packages, i.e. including older and newer versions of different packages that did not all coexist on CRAN at the same time.

PRs to documentation are most welcome! We tried to describe most of this in some detail in our R-Journal pub, https://doi.org/10.32614/RJ-2017-065. Though that pre-dates the move to RSPM & Ubuntu-LTS images, we have tried to preserve the core principles. (It also pre-dates the introduction of CUDA-based images, which add a whole 'nother wrinkle because they are sensitive to kernel driver versions external to the container)!

@myoung3
Copy link
Contributor Author

myoung3 commented Jul 29, 2021

Thanks @cboettig!

I've written up a draft of what could go in the documentation (see below). In writing this up, I've made a couple of discoveries:

  • CRAN is not environmental variable, but an R option set in Rprofile.site
  • In the most recent version-tagged images (currently 4.1.0) this CRAN option isn't set to a specific date that rolls forward with each push of the image. It's actually set to "latest". This means that if anyone ever downloads the most recent version-tagged image, packags installed when using that imagefile are unreproducible. Perhaps this is what you mean when you say "This mimics the workflow of a user who generally keeps packages up-to-date to CRAN during normal development and then wants to preserve that environment frozen in time for a particular project with minimal fuss", but it requires a user to consciously redownload 4.1.0 once it's finalized (and potentially deal with the consequences of package versions which have changed from their development environment).

Here's my draft of what could go in the documentation. Let me know what you think:

  • With respect to reproducibility, there are effectively three categories of rocker image tags: devel, the current version-tagged images, and all previous version-tagged images.
  • Rocker images are built with an R option (CRAN, set in Rprofile.site) specified as rstudio package manager. For devel and the current version-tagged images, this option is specified as "latest", meaning the most recent version of packages are downloaded at buildtime from rstudio package manager. By contrast, for all previous (non-current) version-tagged images, the CRAN option is set to rstudio package manager's archive of CRAN on a specific date.
  • This suggests that the most recent version-tagged images and the devel tag should not be used for work where package versions need to be frozen. For example, as of 7/29/2021 the most recent version-tagged rocker images are :4.1.0. The most recent version-tagged images are not stable with respect to R package versions in two senses: 1) the package versions baked into the image will continue to change on dockerhub (since these images are periodically rebuilt) until the release of the subsequent version-tagged image (e.g. 4.1.1) 2) any packages installed during runtime of these images (current version-tag and devel) are the most recent version package versions available (since the CRAN option is set to latest).
  • On the other hand, all version-tagged images previous to the current version-tagged image are stable with respect to R package versions. Therefore, as of 7/29/2021, the most recent images that are stable with respect to R packages are the 4.0.5 images.
  • A similar rolling approach is used for Rstudio server version--the rstudio version baked into the images on dockerhub are subject to change in the most recent version-tagged rocker images, but is stable in previous version-tagged images.
  • All rocker images are subject to periodic security updates to the operating system. As of <insert version number>, rocker images are built on an Ubuntu LTS version. The major LTS version a rocker version-tagged image was built on is frozen from the day a version-tagged image is released, but the minor LTS version is periodically updated (ie to to include security updates).
  • Meanwhile, system libraries (i.e. from apt-get) all come from the Ubuntu upstream LTS distribution. We avoid introducing PPAs or rolling-version apt sources, which ensures a stable and predictable set of libraries and compilers (importantly, this also ensures that Rocker compilers and libs match those used by RStudio's RSPM in building the binary images -- there are actually a few special cases where RSPM uses a PPA source, and we seek to match that for compatibility).
  • If you wish to reproducibly build an image based on rocker with additional R packages, or reproducibly install additional packages during runtime, using install.packages() will suffice as long as your image is A) version-tagged (as opposed to devel) B) downloaded when it was not the most recent version-tagged image on CRAN. You can verify that package installation is reproducible by typing getOption("repos"). If you see a that CRAN is set to rstudio package manager with a specific date appended, packages will install reproducibly. If you see that CRAN is set to rstudio package manager with "latest" appended, then package installation in this image is not reproducible.

@eitsupi
Copy link
Member

eitsupi commented Jul 30, 2021

@myoung3 Hi, thanks for working on this.
I read and understood the README in rocker-org/rocker-versioned about using past snapshots of CRAN, but there was no explanation about RStudio version fixing, so I read the Dockerfile and noticed the rule.
(Recently, a script I wrote automatically updates CRAN and RStudio version number. #164)

It would be nice to have an explanation in the README of this repository.

CRAN is not environmental variable, but an R option set in Rprofile.site

A while ago, some scripts would install packages from the CRAN env var, but this was removed in #195 for consistency between scripts and to support arm64 builds.
The CRAN env var, currently specified in the rocker/r-ver Dockerfiles, is written to Renviron.site in this line and will be used as the default repository for many images.

## Add a default CRAN mirror
echo "options(repos = c(CRAN = '${CRAN}'), download.file.method = 'libcurl')" >> ${R_HOME}/etc/Rprofile.site

@myoung3
Copy link
Contributor Author

myoung3 commented Jul 30, 2021

right so I guess strictly speaking CRAN is an environmental variable, just not one that's used directly by R. I can clarify this in the text. I should also make it more clear that the name of the option is "repos" which is a named vector containing an element named "CRAN"

@eddelbuettel
Copy link
Member

Yes it is simply "overloaded" which may look confusing at first. The repos field in options() has a named entry CRAN as per base R, our Dockerfile also has an environment variable with the same name as it aims at same goal: controlling where packages come from, which in this case also defines a point in time.

@eitsupi
Copy link
Member

eitsupi commented Apr 2, 2022

@myoung3 The README and wiki have been updated significantly, and while there is room to add more detail on how to make changes to CRAN, etc., I think the basic instructions are in place.

@eitsupi eitsupi added enhancement New feature or request documentation Improvements or additions to documentation labels Apr 2, 2022
@eitsupi
Copy link
Member

eitsupi commented Jun 4, 2022

I think the wiki page (https://github.com/rocker-org/rocker-versioned2/wiki/Versions) solved this issue.
So I close this now.

@eitsupi eitsupi closed this as completed Jun 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants