Phylo2Vec: Accelerating Phylogenetic Research 

Partners: Samir Bhatt, Neil Alexandre Scheidwasser, and Frederik Mølkjær Andersen

SSEC Engineers: Don Setiawan, Madeline Gordon, and Ayush Nag

Research Goals and Domain

A phylogenetic tree is a diagram that illustrates the shared evolutionary history of a group of species. This diagram is represented as a bifurcating binary tree; a structure in which each node, except for the root, splits into exactly two branches. Beyond evolutionary biology, bifurcating trees can also represent hierarchical relationships in various fields, such as tracing the similarities between languages. 

Software Problem

Given the importance of the phylogenetic tree data structure, a plethora of corresponding software exists which are highly optimized, documented, and organized. However, a fundamental challenge exists in that the current standard software input format for representing a phylogenetic tree is a string. This creates inefficiencies with the reading and writing of these trees into various software for performing tree operations, especially as scale increases. Phylo2vec, in contrast to a string, represents phylogenetic trees as integer vectors. This representation requires 6 times less storage, and thus enables more efficient tree operations.

Software Solution

SSEC is working with researchers from University of Copenhagen and Imperial College London to refactor and speed up the phylo2vec package. The team will first implement core tree vector representation and operations in Rust and then provide bindings to Python and R. This will make  the current Phylo2vec Python package faster and more efficient, as well as introducing a new package for the R community.

Impact

In adding standard packaging conventions, detailed documentation, and enabling GitHub Codespaces, SSEC aims to increase accessibility to the software. Bringing phylo2vec into open-source Scientific Python ecosystems and R will allow for sustained contributions and greater engagement. Features such as CI/CD and tutorial notebooks will facilitate robust code that is easily shared, understood, and reproduced amongst researchers.

Related Repositories

phylo2vec