Partners: Samir Bhatt, Neil Alexandre Scheidwasser, and Frederik Mølkjær Andersen
SSEC Engineers: Don Setiawan, Madeline Gordon, and Ayush Nag
Research Goals and Domain
A phylogenetic tree is a diagram that illustrates the shared evolutionary history of a group of species. This diagram is represented as a bifurcating binary tree; a structure in which each node, except for the root, splits into exactly two branches. Beyond evolutionary biology, bifurcating trees can also represent hierarchical relationships in various fields, such as tracing the similarities between languages.
Software Problem
Given the importance of the phylogenetic tree data structure, a plethora of corresponding software exists which are highly optimized, documented, and organized. However, a fundamental challenge exists in that the current standard software input format for representing a phylogenetic tree is a string. This creates inefficiencies with the reading and writing of these trees into various software for performing tree operations, especially as scale increases. Phylo2vec, in contrast to a string, represents phylogenetic trees as integer vectors. This representation requires 6 times less storage, and thus enables more efficient tree operations.
Software Solution
SSEC worked with researchers from University of Copenhagen and Imperial College London to refactor and speed up the phylo2vec package. The team first implemented core tree vector representation and operations in Rust and then provided bindings to Python and R. This made the current Phylo2vec Python package faster and more efficient, as well as introduced a new package for the R community.
Impact
The core component was rewritten in Rust for better performance and memory efficiency, with Python and R APIs maintained, and a Fenwick tree replaced the loop to reduce overall time complexity from O(n²) to O(n log n).
In adding standard packaging conventions, detailed documentation, and enabling GitHub Codespaces, SSEC increased accessibility to the software. Bringing phylo2vec into open-source Scientific Python ecosystems and R will allow for sustained contributions and greater engagement. Features such as CI/CD and tutorial notebooks will facilitate robust code that is easily shared, understood, and reproduced amongst researchers.