ERC specification

An Exectuable Research Compendium (ERC) is a packaging convention for computational research. It provides a well-defined structure for data, code, text, documentation, and user interface controls for a piece of research and is suitable for long-term archival. As such it can also be perceived as a digital object or asset.

This is a draft specification. If you have comments or suggestions please file them in the issue tracker. If you have explicit changes please fork the git repo and submit a pull request.

Version

The version of this specification is 1. This version is under development!

Table of contents

Notational conventions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in RFC 2119.

The key words "unspecified", "undefined", and "implementation-defined" are to be interpreted as described in the rationale for the C99 standard.

Purpose

This specification defines a structure to carry and execute a computational scientific analyses (cf. computational science).

These analyses typically comprise a workspace on a researcher's computer, which comprises data, code, third party software or libraries, and outputs such as plots. Code and libraries are required in executable form to re-do an analysis. Research is only put into a context by a textual publication, which is published in scholarly communication. The text comes in two forms: one that is machine readable, and another one that is suitable for being read by humans. The latter is often derived, or "rendered", from the former and can be static, visual, or even interactive.

Putting all of this elements in a self-contained bundle allows understanding, reproducing, transferring, archiving, and validating computational research. The ERC specification defines metadata and file structures to support these actions.

Fundamental design concepts

The ERC specification is inspired by two approaches to improve development and operation of software. First, "convention over configuration", e.g. as practiced in the Java build tool Maven. Second, "DevOps", see Wikipedia or Boettiger.

Another core goal is simplicity. This specification should not re-do something which already exists (if it is an open specification or tool). It must be possible to create a valid and working ERC manually.

The final important notion is the one of nested containers. We acknowledge well defined standards for packaging a set of files, and different approaches to create an executable code package. Therefore an ERC comprises one or more containers but is itself subject to being put into a container. We distinguish these containers into the inner or "runtime" container and the outer container, which is used for transfer of complete ERC and not content-aware validation.

Finally, this specification may be extended or limited further by so called extensions. Extensions MAY add any additional structure to an ERC or change defaults. But they MUST NOT interfere with this specification, e.g. by changing the meaning of a configuration field.

How to use an ERC

The steps to (re-)run the analysis contained in an ERC are as follows:

This way ERC allow computational reproducibility based on the original code and data.

ERC structure

Base directory

An ERC MUST has a base directory. All paths within this document are relative to this base directory.

The base directory MUST contain an ERC configuration file.

Besides the files mentioned in this specification, the base directory MAY contain any other files and directories.

Main document & display file

An ERC MUST have a main document, i.e. the file which contains the text and instructions being the basis for the scientific publication describing the packaged analysis.

The main document's name SHOULD be main with an appropriate file extension and media type. For example if the main document is RMarkdown, then the file extension should be .Rmd and the media type text/markdown.

An ERC MUST have a display file, i.e. the file which is shown to the user first when he opens an ERC in a supporting platform or tool.

The display file's name SHOULD be view with an appropriate file extension and media type. For example if the main document is Hypertext Markup Language (HTML), then the file extension should be .htm or .html and the media type text/html.

Typically, the _display file_ is "rendered" from the main file, which follows the [literate programming paradigm](https://en.wikipedia.org/wiki/Literate_programming).

Nested runtime

The embedding of a representation of the original runtime environment, in which an analysis was conducted, is crucial for supporting reproducible computations. This section defines two such representations. First, an executable image. Second, a manifest documenting the image's contents.

The format of these representations is undefined here and can be stated more precisely in an extension to this specification.

A concrete runtime extension may choose to (a) embed the runtime environment in the image, or (b) to rely on constructing the runtime environment from the manifest.

Runtime environment or image

The base directory SHOULD contain a runnable image, e.g. a "binary", of the original analysis environment that can be used to re-run the packaged analysis using a suitable software.

The image file MAY be compressed. It SHOULD be named image with an appropriate file extension, such as .tar, tar.gz or .bin, and have an appropriate mime type, e.g. application/vnd.oci.image.layer.tar+gzip.

The name of the image file MUST be given in the ERC configuration file under the node image under the root-level node execution.

The output of the image execution can be shown to the user to convey detailed information on progress or errors.

Runtime manifest

The base directory MUST contain a complete, self-consistent manifest of the runtime image's contents.

This manifest MUST be in a machine-readable format that allows a respective tool to create the runtime image.

The name of the manifest file MUST be given in the ERC configuration file under the node manifest under the root-level node execution.

ERC configuration file

The ERC configuration file is the reproducibility manifest for an ERC. It defines the main entry points for actions performed on an ERC and core metadata elements.

Name, format, and encoding

The filename MUST be erc.yml and it MUST be located in the base directory. The contents MUST be valid YAML 1.2. The file MUST be encoded in UTF-8 and MUST NOT contain a byte-order mark (BOM).

Basic fields

The first document content of this file MUST contain the following string nodes at the root level.

Example:

id: b9b0099e-9f8d-4a33-8acf-cb0c062efaec
spec_version: 1

The main and display file can be defined in root-level nodes named main and display respectively:

id: b9b0099e-9f8d-4a33-8acf-cb0c062efaec
spec_version: 1
main: the_paper_document.rmd
display: view.html

Control statements

The configuration file MUST contain statements to control the runtime container.

These statements MUST be in an array under the root-level node execution in the ERC configuration file in the order in which they must be executed.

Implementations SHOULD support a list of bash commands as control statements. These commands are given as a list under the node cmd under the root-level node execution. If extensions use non-bash commands, they MUST define own nodes under the execution node and SHOULD define defaults.

The execution statements MAY ensure the re-computation being independent from the environment, which may be different depending on the host of the execution environment. For example, the time zone could be fixed via an environment variable TZ=CET, so output formatting of timestamps does not break checking. This is in addition to ERC authors handling such parameters at a script level.

Example control statements:

id: b9b0099e-9f8d-4a33-8acf-cb0c062efaec
spec_version: 1
execution:
  cmd:
    - `./prepare.sh --input my_data`
    - `./execute.sh --output results --iterations 3`

License metadata

The file erc.yml MUST contain a first level node licenses with licensing information for the code, data, and text contained. Each of these three have distinct requirements, hence different licenses need to be applied.

The node licenses MUST have three children: code, data, text.

There is currently no mechanism to define the licenses of the used libraries, as manual creation would be tedious. Tools for automatic creation of ERC may add such detailed licensing information and define an extension to the ERC

The content of each of these child nodes MUST be one of the following

Example for global licenses:

id: b9b0099e-9f8d-4a33-8acf-cb0c062efaec
spec_version: 1
licenses:
  code: Apache-2.0
  data: ODbL-1.0
  text: CC0-1.0

Example using specific licenses for files:

---
id: b9b0099e-9f8d-4a33-8acf-cb0c062efaec
spec_version: 1
licenses:
  code:
    others_lib.bin: MIT
    my_code.c: GPL-3.0
  data: 
    facts.csv: ODbL-1.0
  text:
    README.md: CC0-1.0
    paper/chapter01.doc: CC-BY-4.0
    paper/chapter02.tex: CC-BY-4.0

It IS NOT possible to assign one license to a directory and override that assignment or a single file within that directory, NOR IS it possible to use globs or regular expressions.

Comprehensive example of erc.yml

The following example shows all possible fields of the core specification with example values.

id: b9b0099e-9f8d-4a33-8acf-cb0c062efaec
spec_version: 1
main: the_paper_document.rmd
display: view.html
execution:
  cmd: "Rscript -e 'rmarkdown::render(input = \"paper.Rmd\", output_format = \"html\")'"
licenses: # licenses that the author chooses for their files
  code:
    others_lib.bin: MIT
    my_code.c: GPL-3.0
  data:
    facts.csv: ODbL-1.0
  text:
    README.md: CC0-1.0
    paper/chapter01.doc: CC-BY-4.0
    paper/chapter02.tex: CC-BY-4.0

The path to the ERC configuration file subsequently MUST be <path-to-bag>/data/erc.yml.

Content metadata

Metadata elements under development

Current JSON dummy to visualise the properties. It SHOULD be filled out as good as possible.

{
    "author": [{
        "name": null,
        "affiliation": [],
        "orcid": null
    }],
    "community": "o2r",
    "depends": [{
        "identifier": null,
        "version": null,
        "packageSystem": null
    }],
    "description": null,
    "ercIdentifier": null,
    "file": {
        "filename": null,
        "filepath": null,
        "mimetype": null
    },
    "generatedBy": null,
    "interaction": {
        "interactive": false,
    "ui_binding": {
        "purpose": null,
        "widget": null,
        "code": {
            "filename": null,
            "function": null,
            "variable": null,
            "shinyInputFunction": null,
            "shinyRenderFunction": null,
            "functionParameter": {
                "name": null,
                "label": null,
                "min": null,
                "max": null,
                "value": null,
                "step": null
                }
            }
        }
    },
    "keywords": [],
    "license": null,
    "paperLanguage": [],
    "paperSource": null,
    "publicationDate": null,
    "recordDateCreated": null,
    "softwarePaperCitation": null,
    "spatial": {
        "files": [],
        "union": []
    },
    "temporal": {
        "begin": null,
        "end": null
    },
    "title": null
}

The path to the o2r metadata file MUST be <path-to-bag>/data/metadata.json.

Description of metadata properties

Defining explanations on the concept of each metadata element in use.

ERC checking

Procedure

A core feature of ERCs is to compare the output of an ERC executions with the original outpts. Therefore checking an ERC always comprises two core steps: the execution and the comparison.

The method of the comparison is unspecified. The files included in the comparison are the comparison set. An implementation MUST communicate the comparison set to the user as part of a check.

Comparison set file

The ERC MAY contain a file named .ercignore in the base directory to define the comparison set.

Its purpose is to provide a way to efficiently exclude files and directories from checking. If this file is present, any files and directories within the outer container which match the patterns within the file .ercignore will be excluded from the checking process. The check MUST NOT fail when files listed in .ercignore are failing comparison.

The file MUST be UTF-8 (without BOM) encoded. The newline-separated patterns in the file MUST be Unix shell globs. For the purposes of matching, the root of the context is the ERC's base directory.

Lines starting with # are treated as comments and MUST be ignored by implementations.

Example .ercignore file:

# comment
.erc
*/temp*
data-old/*
If using [md5]() files hashes for comparison, the set could include plain text files, for example the `text/*` [media types](https://en.wikipedia.org/wiki/Media_type) (see [IANA's full list of media types](https://www.iana.org/assignments/media-types/media-types.xhtml). Of course the comparison set should include files which contain results of an analysis.

Security considerations

Why are ERC not a security risk?