A web-based interface for rare disease lumping and splitting predictions with LumpIt

Project Lead: Shirin Khanam, Jessica Chong, and Allison Marcello, UW Pediatrics

Data Science Lead: Bernease Herman

Rare genetic disorders affect 263-446 million persons or ~3.5–5.9% of the worldwide population, and the vast majority of these persons have a Mendelian condition (MC). Over 4,500 genes underlie one or more of the 6,000 MCs described to date, and ~25% of these genes underlie two or more MCs. However, there is actually no quantitative method for distinguishing between MCs due to variants in the same gene. Instead, researchers and clinicians define Mendelian conditions manually and subjectively based on arbitrary selection of perceived shared clinical features. Conditions can be retroactively merged or separated through a process called “lumping and splitting.”

This means we have no idea how many different rare diseases really exist. More importantly, the lack of objective approaches for determining when two claimed disease entities are sufficiently distinct to constitute two separate diseases limits the accuracy of information (e.g., natural history, anticipatory guidance, etc.) that clinicians provide to families with a likely pathogenic or pathogenic variant in one of these genes. We developed a machine learning tool, LumpIt, that can predict expert lumping and splitting decisions. Adoption of LumpIt by clinicians and researchers in rare disease will improve the precision of clinical diagnosis with MCs and accelerate discovery and delineation of new MCs. 

Image Caption: After applying LumpIt to each disease pair for a single gene, we predict that there are just 2 disease entities in reality, 1 comprised of the clinical findings associated with diseases A, B, and C together, and 1 separate entity consisting of disease D.