Figure 1: A typical workflow can be represented as static text, a Jupyter notebook and nbdocker.

Figure 1: A typical workflow can be represented as static text, a Jupyter notebook and nbdocker. (Click to enlarge)

With the generation of diverse and complex big data, computational methods development and data analyses have become integral to research. Analytical protocols typically involve the execution of a series of computational tasks that are dependent on code, input parameters, computing environment, and software installation and setup, that are not easily described by static text description. 

Data science notebook systems such as Jupyter notebooks allow for inclusion of executable live code in line with documentation. All modifiable code cells in a notebook must be in the same language but Jupyter has kernels supporting over 100 different programming languages. The integration of editable code with the scientific rationale and narrative facilitates the documentation, dissemination and adoption of computational methodologies. As a result, Jupyter notebooks have become extremely popular covering a wide variety of scientific disciplines. 

A major drawback of Jupyter notebooks is that they are not autonomous. Although the code in code cells is modifiable and executable, execution often requires the installation of additional software, libraries, frameworks and packages by the user. Another limitation of Jupyter notebooks is that each notebook is limited to one kernel supporting a single programming language.

One approach to this problem is to use software containers such as Docker containers to encapsulate each computing environment. Docker containers wrap the executables and scripts inside a custom software environment, avoiding conflicts between different components and thus, eliminating the need for users to install and manage all the software dependencies. Dockerized components are completely isolated and modular and will yield identical results regardless of the platform of execution.

Our research team presents “nbdocker”, a Python/Javascript extension to Jupyter notebooks that allows for different Docker containers to be executed inside Jupyter notebooks in the same manner regardless of the kernel used. nbdocker is an extension that integrates a Docker management user interface (UI) into Jupyter. The user can embed a set of Docker commands as clickable buttons inside markdown (documentation) cells. Specifically, nbdocker provides a point-and-click Docker management UI to pull a Docker image from a registry such as DockerHub or a local image, keep a record of running Docker containers, document and execute a Docker container in the history. The user can also check the status of running containers with a single click.

We illustrate the utility of nbdocker with a bioinformatics workflow. Specifically, we use an established RNA sequencing processing workflow that consists of the following steps: 1. download data files using Python and shell scripts; 2. Execute a C++ binary to align the reads to the reference and compute the abundance of the transcripts; 3. Determine differentially expressed genes using R and bioconductor packages. The figure above and video below contrast how this workflow can be represented as static text, a Jupyter notebook, and nbdocker.

A graphic that reads "nbdocker Jupyter notebook extension"

YouTube video (https://www.youtube.com/watch?v=H_s7_A8qb_0).

Research team members include:

  • Ka Yee Yeung, Associate Professor, Institute of Technology, UW-Tacoma (affiliate of the eScience Institute). Email: kayee@uw.edu
  • Ling-Hong Hung, Research Scientist, Institute of Technology, UW-Tacoma. Email: lhhung@uw.edu
  • Jiaming Hu, Master’s student, Institute of Technology, UW-Tacoma (recently graduated). Email: huj22@uw.edu

Additional Reading

A pre-print of our manuscript is available from BioRxiv (https://www.biorxiv.org/content/early/2018/05/02/309567).

GitHub repository: https://github.com/BioDepot/nbdocker