CRAN Task View: Computational Environments and Reproducibility

Maintainer:Daniel Nüst
Contact:daniel.nuest at uni-muenster.de
Version:2019-01-11

This Task View contains information about controlling and documenting computational environments in R. The base version of R does not provide features to manage different version of R or collections of packages easily, so a number of approaches and packages exist to simplify computational environments for the sake of development, testing, bug-fixing, and reproducibility. The ReproducibleResearch Task View provides further discussion of packages around scientific reproducibility.

If you have any comments or suggestions for additions or improvements for this Task View, go to GitHub and submit an issue , or make some changes and submit a pull request . If you can’t contribute on GitHub, send Daniel an email . If you have an issue with one of the packages discussed below, please contact the maintainer of that package.

Contributors: [@nuest] (https://github.com/nuest/), [@jdblischak] (https://github.com/jdblischak/)

Virtual Machines and Containers

Virtual machines (VMs) are a straightforward way to encapsulate your runtime environment around the actual data and code.

In computing, a virtual machine (VM) is an emulation of a computer system. Virtual machines are based on computer architectures and provide functionality of a physical computer. https://en.wikipedia.org/wiki/Virtual_machine

Their advantage for many users is the provided visual user interface. VirtualBox is a Free and Open Source (FOSS) virtualization product you can install on most operating systems (OS), then known as the “host” OS, and supports a number of “guest” operating systems , many of which can install and run R just like on a non-virtual OS. VMs must be booted like regular OS and have a virtual disk, which you can archive or share with collaborators (though having to handle the large file size). VMs can share directories, network, and other devices with their host.

Containers are an effective way to apply virtual environments at the system level. Compared to virtual machines their most important advantage for controlled computational environments are performance and transparency. Containers share the hosts core libraries and can “boot” within milliseconds and have negligible computational overhead. They can be created with the help of scripts or “recipes”, which are simple text files. These recipes can be included in code repositories and easily shared online.

The most widespread container solution is Docker . It is available for recent and common operating systems. Dockerfiles are the recipes that can be built to Docker images , which can be run and become Docker containers . Using Docker requires some proficiency with a command line interface (CLI). The Docker Hub is an image repository with a large number of pre-built images for different use cases.

Rocker

The Rocker project provides a number of Docker images for R , including the official r-base image. All Rocker images are available on the Docker Hub. Rocker images are a stable and widespread tool for running R in local and cloud environments and have established useful best practices around containers with R.

Bioconductor provides a collection of images based on rocker/rstudio.

Since images can extend existing ones, using a suitable Rocker image as a base for your own computations is a very good approach to control your computational environment. The simplest way is to run the rocker/rstudio container and work with the RStudio IDE in your web browser. Alternatively, you can develop your analysis on your computer and “package” it in a container only when preparing for a software release of scientific publication.

Other R distributions and operating systems

Docker images for other than the “regular” R distribution and the Debian -based Rocker images are available on Docker Hub, though none at the level of maturity and features of Rocker.

Tools for working with containers

Docker

Deployment

Another alternative to share a well-defined computational environment is setting up R on a server.

Interactive development environments

Apps and APIS

Package management

Structure, templates and workflows

A good project structure is essential to be sure about the actually used computational environment, which includes locally defines functions and data and not just used packages or the R version.

Tracking and provenance

A computational environment evolves as an analysis is developed. These packages help observing these changes, in addition to always recommended code versioning systems .

CRAN packages:

Related links: