Metadata workflows

This document describes the internal processes handling metadata for ERC. For information on the metadata model for ERC as part of the o2r web API please see the API specification.

The remainder of this document describes who handles metadata when and how within the o2r architecture.

Files vs. database

In all workflows files are created within ERC in a specific subdirectory .erc holding different kinds, formats, or versions of metadata. For ease of access via web API, the information is also stored within the database.

The files in the compendium are always the normative source of information.

The term brokering means the translation from schema-less to schema-specific metadata, as well as inter-schema mappings. The brokering output is then stored in respective files and mirrored to the database by the reproducibility service.

Metadata extraction and brokering during creation

muncher is the main CRUD component for compedia. It controls the creation workflow.

The creation from the metadata perspective is as follows:

  1. init stores the files for a new ERC in a directory.
  2. extract uses metaextract.py (docs) to analyse the incoming ERC and creates new files with raw metadata for each of the scanned files. Currently the following types of files will be considered: .r, .rmd, netcdf, "bagit.txt". Future releases of the extractor will be likely to consider .tex, .json (geojson), .jp2, .tiff and more. This raw metadata itself is schema-less and non-semantic. The processed files are in conceptual competition for the best representative of the working directory's meta information, i.e. there will be only one main output, ideally represented by the most complete set of metadata. By default the competing bits of information will also be preserved in .erc/metadata_raw_<filename>.json where filename is an identifier based on the original source file.
  3. output file: .erc/metadata_raw.json
  4. database field: <compendium>.metadata.raw
  5. broker uses metabroker.py (docs) to translate the raw metadata in json to o2r metadata in json as being compliant to the o2r json-schema.  - output file: .erc/metadata_o2r_X.json (where X is the version number as set in the o2r-map.json mapping file, e.g. 1)  - database field: <compendium>.metadata.o2r
  6. (harvest TBD; will connect to third party database endpoint via OAI-PMH to gather additional information for the enrichment of the o2r metadata collected via extraction)
  7. save stores the new ERC to the database including the aforementioned metadata fields.
  8. user check provides an interactive form to the uploading user to control and edit the suggested metadata. Suggestions are based on o2r metadata. The check workflow is handled in the web client project.
  9. update updates the metadata in both database and file with the user's edits. This step creates valid o2r metadata. The metadata update includes all brokering to the configured metadata formats, meaning the brokered metadata is always up-to-date and based on the same source, the o2r metadata.

By design there is no metadata brokering during shipments or job executions. Because it is likely that not all information can be brokered automatically, the metadata required by shipping destinations are mandatory in the o2r metadata model to reduce the user involvement to a minimum, i.e. when updating the metadata. In the same vein, all validation takes place during metadata updates, because that is the only time a user can react to validation errors.

Metadata for shipments

The shipper uses the metadata stored in the ERC directory .erc to start a shipment of data or metadata to third-party repositories. It does not do any updating, brokering, or validation.

Metadata mappings

destination model format(s) description
[//]: # (datacite DataCite Metadata Schema 4.1 xml for metadata export)
[//]: # (datacite DataCite Metadata Schema 3.1 xml (still in wide spread use for OAI-PMH))
[//]: # (ORCID (TBD) XML for orcid-works xml for adding ERC as works to an ORCID profile)
[//]: # (CRIS (TBD) (local adaptation of the CERIF model xml ...)
b2share using o2r schema for the o2r community depositions on b2share json ...
codemeta codemeta 2.0-rc json ld ...
zenodo Deposition metadata json for storing full ERC in the Zenodo data repository; Zenodo also publishes metadata on DataCite