With the generation of diverse and complex big data, computational methods development and data analyses have become integral to research. Analytical protocols typically involve the execution of a series of computational tasks that are dependent on code, input parameters, computing environment, and software installation and setup, that are not easily described by static text description.
Data science notebook systems such as Jupyter notebooks allow for inclusion of executable live code in line with documentation. All modifiable code cells in a notebook must be in the same language but Jupyter has kernels supporting over 100 different programming languages. The integration of editable code with the scientific rationale and narrative facilitates the documentation, dissemination and adoption of computational methodologies. As a result, Jupyter notebooks have become extremely popular covering a wide variety of scientific disciplines.
A major drawback of Jupyter notebooks is that they are not autonomous. Although the code in code cells is modifiable and executable, execution often requires the installation of additional software, libraries, frameworks and packages by the user. Another limitation of Jupyter notebooks is that each notebook is limited to one kernel supporting a single programming language.
One approach to this problem is to use software containers such as Docker containers to encapsulate each computing environment. Docker containers wrap the executables and scripts inside a custom software environment, avoiding conflicts between different components and thus, eliminating the need for users to install and manage all the software dependencies. Dockerized components are completely isolated and modular and will yield identical results regardless of the platform of execution.
We illustrate the utility of nbdocker with a bioinformatics workflow. Specifically, we use an established RNA sequencing processing workflow that consists of the following steps: 1. download data files using Python and shell scripts; 2. Execute a C++ binary to align the reads to the reference and compute the abundance of the transcripts; 3. Determine differentially expressed genes using R and bioconductor packages. The figure above and video below contrast how this workflow can be represented as static text, a Jupyter notebook, and nbdocker.
YouTube video (https://www.youtube.com/watch?v=H_s7_A8qb_0).
Research team members include:
- Ka Yee Yeung, Associate Professor, Institute of Technology, UW-Tacoma (affiliate of the eScience Institute). Email: firstname.lastname@example.org
- Ling-Hong Hung, Research Scientist, Institute of Technology, UW-Tacoma. Email: email@example.com
- Jiaming Hu, Master’s student, Institute of Technology, UW-Tacoma (recently graduated). Email: firstname.lastname@example.org
A pre-print of our manuscript is available from BioRxiv (https://www.biorxiv.org/content/early/2018/05/02/309567).
GitHub repository: https://github.com/BioDepot/nbdocker