Introduction to EOSC-Life WP2

Introduction to EOSC-Life workpackage 2

1- EOSC-Life

EOSC-Life is an ESFRI cluster project involving the 13 biomedical research infrastructures (BMS RIs) whose goal is to create an open, digital and collaborative space for biological and medical research in the European Open Science Cloud.

2- Purpose and objective of EOSC-Life WP2

An increasing number of studies describe workflows involving methods and tools from different domains of the life sciences. A high level view of such workflows and corresponding publications are shown in figure 1. To run this type of workflows requires the tools within each workflow to be interoperable. The purpose of EOSC-Life WP2 is therefore to make software tools from the 13 BMS RIs interoperable in the EOSC-Life cloud environment. The objective is to implement a software stack that makes tool interoperability possible and showcases the implemented solutions in related science demonstrators.


Figure 1. Examples of workflows from publications in which the workflow is composed of tools from domains covered by different BMS RIs.

3- Scope of EOSC-Life WP2

WP2 is concerned with computational tools and workflows that process biomedical data. For our purpose, the following definitions apply:

  • A workflow describes a set of computational tasks and their relationships.

  • A tool is a piece of software used by researchers to carry out a computational task.

  • A command-line tool is a piece of software that runs as a non-interactive program that automatically terminates upon task completion.

  • A workflow management system or workflow engine is software in charge of executing and monitoring a workflow. Workflow engines are meant to automate the running of workflows and as such are generally not suitable for interactive tasks and often only support the running of command-line tools. A workflow engine generally includes components for task execution (e.g. running a given software, handling failure), task scheduling (e.g. running tasks in parallel), resource provisioning, managing metadata and provenance (where, when and with what parameters and input was a task run?) and data management (is input data available? where to send output).

  • Registries are catalogues that can be queried manually or automatically to locate and obtain tools and workflows.

  • Tool interoperability in the context of EOSC-Life means for a workflow to be able to access and use resources and tools from different domains of the life sciences represented by the BMS RIs.

In WP2, workflows are specified in terms of tools and manage the flow of data between the tools. We note that tools are not limited to implementation of an atomic task but can also implement a workflow.

For an introduction to the value of workflow management systems, see https://www.nature.com/articles/d41586-019-02619-z

WP2 focuses on the part of the software stack required to implement workflows, namely tool packaging, containerisation, workflow management systems and other relevant platforms such as code notebooks. It doesn’t cover dealing with the provision and integration of cloud infrastructures. Cloud deployment is done in cooperation with EOSC-Life WP7.
To maximise use of WP2 resources and promote interoperability, WP2 will focus on a limited number of components and build upon resources already available.
To promote findability and reusability, WP2 will unify tool and workflow descriptions using structured data, and provide a workflow registry that leverages current resources.

4- Assumptions and constraints in WP2

  • WP2 assumes that the required data sources are accessible in the cloud environment in which the tools are run. Making data accessible is the responsibility of the relevant BMS RIs and within the remit of EOSC-Life WP1.

  • Some software can be dependent on various resources (e.g. GPU, RAM, shared file systems) therefore we will assume that the cloud providers make necessary resources available.

  • Priority is given to dealing with tools for the project’s science demonstrators that allow expansion of the catalogue of interoperable tools.

  • Catalogues and repositories of workflows, containers and tools will be leveraged not replaced, using common markup and a federated registry architecture.

  • Controlled access to sensitive data needs to be maintained throughout a workflow. It is the user’s responsibility to ensure applicable policies are followed. WP2 will generally assume that tools and workflows access public data.

  • Workflow management systems are not designed for interactive tasks. Where possible interactive tasks should be run upstream of automated workflows or workflows split to accommodate the interactive task as an intermediate step not under the control of the workflow management system. The latter means that tasks normally under the control of the workflow management system (e.g. resource allocation, implementation of data provenance management…) are left to the user. In the future, some workflow management systems may enable interactive tasks as part of a workflow in which case WP2 will revise its assessment.

  • The graphical user interface (GUI) of most software relies on a desktop environment. WP2 will therefore explore remote desktop solutions.

  • Some areas of the life sciences can make use of software that can’t be made interoperable with other EOSC-Life tools (e.g. tools that run only on Windows operating systems). WP2 will try, when possible, to find suitable alternative tools.

  • Containers need to be built starting from a base that has known package versions, corresponding source code and compatible licenses. To minimise licencing issues, containers should be built with software under free and open source licences. WP2 will not deal with commercial software unless licencing issues can be resolved.

5- Current WP2 roadmap

Reviews of online materials and publications related to the activities of the BMS RIs as well as informal discussions with individual researchers within some of the RIs (including during the project kick-off meeting) identified a range of tools and workflow systems in common use. This was complemented by a survey of the EOSC-Life science demonstrators. Based on this, WP2 has developed an initial technical roadmap which highlights technologies and standards that can be readily supported within the project. The technologies and standards include the Linux operating system, the Conda package manager, Singularity (and/or Docker) for containerisation, CWL for describing data analysis workflows, nextflow for running workflows on the command line and the Galaxy platform as web-based UI for building and running data analysis workflows. In addition, there is growing interest in the use of RStudio and Jupyter notebooks. To build on existing efforts and expertise, WP2 will aim at using these tools or ensuring compatibility with them.

       5.1- Tool packaging and distribution

Conda Conda is a cross-platform package and environment manager. Used to install and manage software packages and their dependencies.
Bioconda is a channel for the Conda package manager specializing in bioinformatics software. Through Continuous Integration, Bioconda packages are made available as Docker and Singularity containers.
https://docs.conda.io/en/latest/
https://bioconda.github.io/
Docker Docker is a software container platform to support a virtual research environment that can run in any cloud resource.
Docker excels at running applications on VM or cloud infrastructure. However, it can be subject to privilege-escalation attacks.
Compliant with the Open Container Initiative
https://www.docker.com/
Singularity Software container platform to support a virtual research environment that can run in any cloud resource.
Singularity has greater adoption in HPC as it integrates with many resource managers and is better for command line applications and accessing devices like GPUs or MPI hardware without jumping through hoops. It also effectively runs as the running user and doesn’t result in elevated privileges. Tools are available to convert Docker containers to Singularity.
Compliant with the Open Container Initiative
https://sylabs.io/

The Open Container Initiative (OCI) develops open industry standards for container formats and runtimes.

       5.2- Workflows specification and management systems

Workflow management system agnostic description and interoperability

CWL The Common Workflow Language (CWL) is selected as the standard for describing tools and workflows that can be executed by multiple workflow engines such as Nextflow and Snakemake. ELIXIR has invested in the support of CWL. CWL is also used by the EU’s BioExcel2 Centre of Excellence for Biomolecular modelling, and by the IBISBA ESFRI for Industrial Biotechnology. CWL is participating in GA4GH Task Execution API (a minimal common API for submitting a single job to a remote execution endpoint) and GA4GH Workflow Execution API (a minimal common API for submitting workflow requests to workflow execution systems in a standardized way).
http://www.commonwl.org

Workflow management systems (WFMS)

EOSC Life aims to provide an environment to support a wide range of Workflow Management Systems available to its RI developers and users.
Some workflow systems have been identified as meriting dedicated attention.

Galaxy Galaxy is a web-based scalable platform for running biomedical data analysis tasks. In Galaxy, workflows are built by selecting from a web interface a series of operations to apply to the data. The saved history of applied operations constitute a shareable and reusable workflow.
Tools are made available to Galaxy by writing a wrapper script and a description and can be distributed via the Galaxy ToolShed. The recommended best practices to manage tool dependencies is usage of (bio)conda. Galaxy can also use containers (Docker, Singularity) to run jobs. Galaxy can run workflows on remote resources using its Pulsar system.
Galaxy is integrating support for CWL. As a first step export of ‘abstract CWL’ will be developed, which can serve as metadata (suited e.g. for inclusion in a workflow registry) but is not executable CWL.
ELIXIR runs an EU wide Galaxy installation, hosted and managed by ELIXIR Germany.
https://usegalaxy.eu/
https://galaxyproject.org/
KNIME KNIME is a platform for running data analysis workflows. Originally run as a desktop application, it can be run in headless mode on the command line. Workflows are built using graphical dataflow programming in which individual tasks are represented by nodes in a directed graph whose edges represent the flow of data. Adding a node to KNIME normally requires developing a Java plugin although there are mechanisms to include other languages and tools.
KNIME currently doesn’t support CWL other than wrapping whole KNIME workflows as a CWL step within a container. Further work is ongoing.
https://www.knime.com/
Nextflow A platform for data-driven computational pipelines executable from the command line and from executable notebooks. Nextflow is specifically designed for scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages. Its domain-specific language (DSL) simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.
Nextflow is an open source platform that has commenced a commercialisation activity.
Nextflow have not yet committed to support CWL.
https://www.nextflow.io/

Executable (Notebook) Environments

Jupyter Jupyter is a web-based computational notebook used for literate programming supporting several programming languages. Workflows are composed by interactive programming in a selected language and consist of documented code chunks and their output.
Data provenance is not tracked although some plugins such as Verdant or external tools such as noWorkflow can be used for this, and there are some experimental systems such as ProvBook.
https://jupyter.org/
RStudio RStudio is a web-based computational notebook dedicated to the R statistical software environment. Workflows are composed by interactive programming in the R language and consist of documented code chunks and their output.
Data provenance can be tracked using various R packages such as RDataTracker, adapr, recordr or repo.
https://www.rstudio.com/