The following o2r thesis topics are open, ongoing, or have been completed in the context of o2r.
Please contact o2r.projekt@uni-muenster.de
if you are a student (ifgi BSc, ifgi MSc, MSc GeoTech, computer science @ WWU, …) interested in one of the open topics or in helping to solve other important challenges of scholarly publishing and scholarly communication.
Open reproducible research (ORR) is the practice of publishing the source code and the datasets that are needed to produce the computational results reported in a paper. Since geoscientific articles often include geostatistical analyses and spatiotemporal data, reproducibility should be a cornerstone of the computational geosciences but is rarely realized. In addition, the current way of publishing scientific outcomes, i.e. as static PDFs, does not adequately report on computational aspects. Thus, readers cannot fully understand how the authors came to the conclusions and how robust these are to changes in the analysis. Consequently, it is difficult for reviewers to follow the analysis steps, and for other researchers to reuse existing materials. This dissertation has two overarching goals to tackle these issues: First, it aims at assisting authors in adhering to ORR principles to ensure high scientific standards. The second goal is to reap the benefits that come with papers supplemented by code and data. To achieve these objectives, this thesis reports on five complementary studies to collect and analyze qualitative (i.e. through interviews and a focus group) and quantitative data (i.e. through surveys and reproducibility studies). Based on these studies, this work provides four key contributions: First, it identifies obstacles that prevented geoscientists from publishing ORR. To overcome these barriers, this dissertation suggests concrete and directly applicable strategies. One of these strategies is the executable research compendium (ERC) which encapsulates the paper, code, data, and the entire software environment needed to produce the computational results. Based on that, this work presents the design and implementation of an ERC-based workflow. It allows authors to convey their computational methods and results by also providing interactive access to code and data, and readers to deeply investigate the computational analysis while reading the actual article, e.g. by changing the parameters of the analysis. Finally, this work describes the concept of a binding; a binding connects those code lines and data subsets that produce a specific result, e.g. a figure or number. By also considering user interface widgets (e.g. a slider), this approach allows readers to interactively manipulate the parameters of the analysis to see how these changes affect the result. To conclude, the contributions include (i) a set of obstacles which prevent geoscientists from publishing ORR, (ii) concepts and tools to overcome the identified barriers, (iii) incentives and opportunities that come with attached code and data, and (iv) solutions to realize the incentives which eventually result in a higher number of open and reproducible research.
Reproducibility of computational research, i.e., research based on code and data, poses enormous challenges to all branches of science. In this dissertation, technologies and practices are developed to increase reproducibility and to connect it better with the process of scholarly communication with a particular focus on geography, geosciences, and GIScience. Based on containerisation, this body of work creates a platform that connects existing academic infrastructures with a newly established executable research compendium (ERC). It is shown how the ERC can improve transparency, understandability, reproducibility, and reusability of research outcomes, e.g., for peer review, by capturing all parts of a workflow for computational research. The core part of the ERC platform is software that can automatically capture the computing environment, requiring authors only to create computational notebooks, which are digital documents that combine text and analysis code. The work further investigates how containerisation can be applied independent of ERCs to package complex workflows using the example of remote sensing, to support data science in general, and to facilitate diverse use cases within the R language community. Based on these technical foundations, the work concludes that functioning practical solutions exist for making reproducibility possible through infrastructure and making reproducibility easy through user experience. Several downstream applications built on top of ERCs provide novel ways to discover and inspect the next generation of publications.
To understand why reproducible research has not been widely adopted and to contribute to the propagation of reproducible research practices, the dissertation continues to investigate the state of reproducibility in GIScience and develops and demonstrates workflows that can better integrate the execution of computational analyses into peer review procedures.
We make recommendations for how to (re)introduce reproducible research into peer reviewing and how to make practices to achieve the highest possible reproducibility normative, rewarding, and, ultimately, required in science. These recommendations are rest upon over 100 GIScience papers which were assessed as irreproducible, the experiences from over 30 successful reproductions of workflows across diverse scientific fields, and the lessons learned from implementing the ERC.
Besides continuing the development of the contributed concepts and infrastructure, the dissertation points out broader topics of future work, such as surveying practices for code execution during peer review of manuscripts, or reproduction and replication studies of the fundamental works in the considered scientific disciplines. The technical and social barriers to higher reproducibility are strongly intertwined with other transformations in academia, and, therefore, improving reproducibility meets similar challenges around culture change and sustainability. However, we clearly show that reproducible research is achievable today using the newly developed infrastructures and practices. The transferability of cross-disciplinary lessons facilitates the establishment of reproducible research practices and, more than other transformations, the movement towards greater reproducibility can draw from accessible and convincing arguments both for individual researchers as well as for their communities.
The comparison of thematic maps is a common task in Geosciences which can be demanding depending on the map visualizations’ level of comparability. However, through reproducible research the underlying data for creating maps in scientific publications is accessible and therefore visualizations can be changed. By providing combinations of visualizations that facilitate thematic map comparison, scientists are able to compare maps more effectively. The aim of this thesis is to examine and establish parameters that enhance thematic map comparability. By integrating these parameters into a parameter-set that enables thematic map conversions, comparisons can be facilitated. To achieve this, a study was conducted to derive guidelines for enhancing the comparability of choropleth and proportional symbol maps and the parameter-set was developed. Necessary metadata can be collected through automated metadata extraction, which avoids additional efforts for both author and user. A prototype was implemented that includes all these features and was integrated into the o2r project.
“Which publications covering land use in Brazil are available?” is an example for a question one could ask a search engine for scientific publications. Expressing this with an appropriate search term would only find a subset of the potential results. This is due to the text based search index, which can only find the papers that include the country name “Brazil”, but no papers that mention a local name like “Rio de Janeiro” in the text. A number of excellent search engines exist for scientific publications, but they are mainly focused on text-based content. As a consequence, searching for content with a spatial relation is a sophisticated task. In this thesis, an automatic approach to extract location information from digital scientific publications is developed. It allows parsing spatial information from figures and texts. Geographical names and coordinates can be automatically extracted from texts. Additionally, maps can be detected from the figures contained in publications. These maps can then be analyzed using three approaches with regards to the spatial information they provide: (1) World maps can be recognized and (2) the spatial extent of maps can be parsed from coordinates and their corresponding axes. (3) Lastly, of course, geographical names offer important information. All methods are embedded in a framework designed to support further development. The key challenge is to analyze the textual and graphical content and combine the information retrieved from these sources to a single spatial reference. This challenge is addressed by combining existing tools for subtasks such as named entity recognition and image classification and newly developed methods to an extensive library for spatial analysis tasks. The proposed solution is thoroughly evaluated on a real-word annotated data set.
Recommender systems are a feature often used to present related items to users in different areas such as shopping, job networks or research websites. They incentivize the user to explore the available set of items in the system. Ideally the user gets a better picture of the available items and it helps them find what they searched for. With the rise of computational research, publications become more sophisticated and complicated. They have to deal with multiple software components in the creation process to analyze the data and create visualizations. However, research also is becoming increasingly computational and controlled, offering potential for analysis and collaboration. This change allows to analyze computational research publications for more than just the document text: The source code, the dataset and metadata can contain important information helping to determine similar papers. These could form sections such as spatially similar or computationally similar suggestions. This thesis explores the methods of analyzing supplementary information for similarities to find related publications. The focus is set on (1) spatial files which are often part of research papers in the geosciences and on (2) source code used to process the data. Two methods for spatial and source code analysis are implemented in a prototype: A geohash based spatial index similarity and a information retrieval (IR) method to analyze the source code. Both methods rely on the search database Elasticsearch. To evaluate this type of similarity, the prototype is used to analyze the similarity of a test dataset containing research papers with supplementary information.
R is a programming language for statistical computation and graphics. Besides the commonly used, GNU R, there are other alternative R implementations that claim to have advantages compared to the GNU R. Unfortunately, it is not clear how will geospatial R packages behave on these implementations since these packages often rely on system libraries which installed at the system level. System libraries also depend on the platform where the R is running. To find this information, this research aims to explore the compatibility of geospatial R packages on different R implementation and platform. This research also aims to see which R implementation and platform has the best performance. To make the exploration easier, container technology is used to install system dependencies and R implementations. All system dependencies from sysreqsdb are installed for geospatial R packages. From this exploration, it is found that not all R implementations are compatible with geospatial packages. Problems found can be grouped into three categories: System Dependencies, Unsupported Implementation, and Running Time Error. GNU R and Microsoft R Open (MRO) are the only R implementations that compatibles with geospatial R packages. A benchmarking R package called altRnative is created to run the benchmarking across the successful combination. The benchmark result shows that GNU R has a little bit better performance (1.2x) compared to MRO regardless of the platforms. Reproducibility self-assessment (https://osf.io/j97zp/): 3, 3, 3, 3, 3 (input data, preprocessing, methods, computational environment, results).
Spatio-temporal research results are usually published in a static format, for example, as PDF. Here the results are not directly linked to their spatial reference. Therefore, it is difficult for the user to understand these results. To improve the user’s understanding, we link these results with the real world. To archive this, we use the raising concept of Augmented Reality, where it is possible to integrate the results into the view of the user and to display the results on site. The results are often calculated out of a specified dataset. To ensure the data used for the application indicates the same result presented in the article, the outcome must be reproducible. The goal is to combine reproducibility and Augmented Reality to convey spatio-temporal results. We answered the research question about how to create an Augmented Reality application out of a reproducible article. Therefore, we performed a literature research and developed a concept which provides a guideline and explains the important steps. Starting with extracting the data used to calculate the results. Designing the app and deciding which types of visualization and devices fit best for the result and implementing the application. To show the feasibility of the concept, we created an application to convey the results of one scientific article. This application was evaluated with an expert user study, with the goal to indicate whether the application is understandable and easy to use. Furthermore, the general interest in using Augmented Reality applications to inspect spatio-temporal results got researched. The results of our research show that it is possible to convey spatio-temporal results through Augmented Reality. The results are displayed understandable. Overall, Augmented Reality is an interesting approach to display results out of scientific articles which should be depended in further research.
Many scientific articles are related to specific regions of the Earth. The connection is often implicit, although geospatial metadata has been shown to have positive effects, such as detecting biases in research coverage or enhancing discovery of research. Scholarly communication platforms lack an explicit modeling of geospatial metadata. In this work, we report a novel approach to integrate well-defined geospatial metadata into Open Journal Systems (OJS). Authors can create complex geometries to represent the related location(s) or region(s) for their submission and define the relevant time period. They are assisted by an interactive map and a gazetteer to capture high quality coordinates as well as a matching textual description with high usability. The geospatial metadata is published within the article pages using semantic tags, integrated in standardized publication metadata, and shown on maps. Thereby, the geoOJS plugin facilitates indexing by search engines, can improve accessibility, and provides a foundation for more powerful map-based discovery of research articles across journals.
Computational research introduces challenges when it comes to reproducibility, i.e. re-doing an analysis with the same data and code. A current research project at ifgi developed a new approach called Executable Research Compendia (ERC, see https://doi.org/10.1045/january2017-nuest) to solve some of these challenges. ERC contain everything needed to run an analysis: data, code, and runtime environment. So they can be executed “offline” in a sandbox environment. An open challenge is the one of big datasets and reducing data duplication. While the idea of putting “everything” into the ERC is useful in many cases, once the dataset becomes very large it is not feasible to replicate it completely for the sake of reproducibility/transparency and to some extent for archival. This thesis will create a concept for allowing ERC to communicate with specific data repositories (e.g. PANGAEA, GFZ Data Services) extending on previous work (https://doi.org/10.5281/zenodo.1478542). The new approach should let ERCs “break out” of their sandbox environments in a controlled and transparent fashion, while at the same time more explicitly configuring the allowed actions by a container (e.g. using AppArmor). Since trust is highly important in research applications, the communication with remote services must be exposed to users in a useful and understandable fashion. Users who evaluate other scientists ERC must know which third party repositories are used and how. The concept must be (i) implemented in a prototype using Docker containerization technology and discussed from viewpoints of security, scalability, and transparency, and (ii) demonstrated with ERC based on different geoscience data repositories, e.g. Sentinel Hub, and processing infrastructure, e.g. openEO or WPS, including an approach for authentication. Furthermore it could be evaluated to define the sandbox more explicitly, and if the communication between ERC and remote service can be captured and then cached for an additional backup, so that future execution may re-use that backup.
The Executable Research Compenium (ERC) developed by Opening Reproducible Research (https://o2r.info) provides an approach for solving the challenges of reproducibility and reusability of scholarly research in the geosciences. With the help of a UI and a reproducibility service, academic authors can easily create a snapshot of their research workflow at the time of submitting it for a peer review. However, the ERC is currently limited to workflows based on R and using R Markdown. In this work, the student will explore what changes need to be made to the ERC specification to support geospatial workflows in Python. How can users of common open source GIS (e.g. QGIS) be supported in packaging their scripted workflow in an ERC, ideally without any programming? How can suitable virtual environments (i.e. Dockerfiles) be derived from the workspace metadata (e.g. QGIS project file) and workflow (e.g. created with QGIS Process Modeller)?
The student should have an interest in working with an existing codebase and multiple programming languages. Some experience with Node.js, Python, and Docker is useful.
Geospatial research workflows today are completely digitised. More and more observational data is available in open repositories and analysed with open software. The processing is increasingly moved to scalable cloud infrastructures, which often build on free and open source software. A downside is that the digitisation, data volume, online processing, and complexity of software environments make it harder for researchers to understand, reproduce, and build upon each others work. The executable research compendium (ERC) developed by the project Opening Reproducible Research (o2r, https://o2r.info) attempts to lower the barriers for inspecting and manipulating published research workflows in the geosciences. In this thesis, the student will explore how common geospatial web services can be packaged as part of a research compendium. These web services should comprise data services (e.g. a WFS, SOS, WCS), view services (e.g. a WMS) and processing services (e.g WPS, openEO). The student will re-use published or develop two to three example workflows using based on these web services. The ERC and supporting tools currently only support a single Docker container for capturing the runtime environment. A core challenge of this work will be to find out how multiple containers may be created and executed without additional burden on the author of a research paper during the creation of ERC, and without introducing security risks in the networking between the containers.
Affinity to command line tools and solving hard problems using existing software tools is a prerequisite for this thesis. The student will get to know a variety of common geospatial web services (e.g. Geoserver, SciDB, 52N WPS, pyWPS) and Docker during the course of this work.
Open Science, particularly reproducible research, is crucial to increase the impact of science in society, as well as to accelerate scientific discovery. A useful tool for more effective and sustainable communication of research workflows are digital notebooks. Based on the literate programming paradigm, they integrate text and code in a coherent manner. However, a notebook approach is more easily connected with script base GIS workflows, e.g. using R, but has not yet been connected with popular GIS featuring a graphical user interface (GUI), such as ILWIS (Integrated Land and Water Information System, https://52north.org/software/software-projects/ilwis/) or QGIS (https://qgis.org/). In GUI-based GIS it is difficult for scientists to fully describe the steps that were carried out to achieve the reported reported results in a regular scientific article.
Therefore this thesis investigates how to facilitate packaing and sharing workflows created in a GUI-based GIS based on common notebook technologies, such as Jupyter (https://jupyter.org/) or R Markdown (https://rmarkdown.rstudio.com/). The student makes the GIS kernels accessible in the code cells of these notebooks comparing different approaches, e.g., the ILWIS Python API vs. ILWIS scripting language. Furthermore a seamless transfer of workflows created in the GUI into notebooks will be designed and evaluated, e.g. in the form of a Jupyter extenion, a QGIS plug-in, or an RStudio Add-in. Two reproducible worklows illustrate the implemented approach.
The student should have experience in working with existing codebases of open source software and be open to complex programming challenges. An affinity to working with multiple programming languages and prior experience with Python, R, and optionally C++ are benefitial.
Cite this page as "đź“– Theses" (2019) in Opening Reproducible Research: a research project website and blog. Daniel NĂĽst, Marc Schutzeichel, Markus Konkol (eds). Zenodo. doi:10.5281/zenodo.1485437