Protein Design Pipeline: Standardizing Reproducibility

Partners: Andrew Hunt, Magnus Bauer, Jasper Butcher, Rohith Krishna, Saman Salike

SSEC Research Engineer: Anant Mittal

Research Goals and Domain

The University of Washington Institute for Protein Design (IPD), led by Nobel Prize-winning Director David Baker, is a cutting-edge research center that creates entirely new proteins to address challenges in medicine, technology, and sustainability through the application of novel artificial intelligence tools and experimental science. Applications of protein design include the development of new therapeutics, vaccines, and medical diagnostics, as well as novel technologies for advanced manufacturing, environmental remediation, and more.

The UW Scientific Software Engineering Center (SSEC) at the eScience Institute is collaborating with the IPD to build fluid workflows to optimize computational protein design campaigns, integrating state-of-the-art models for molecular design and structure prediction together with advanced filtering techniques. This project seeks to accelerate research within the IPD and improve the accessibility and reproducibility of AI-assisted protein design tools for researchers worldwide.

Software Problem

Current computational protein design workflows use fragmented interfaces across iterative prediction and filtering steps, requiring repetitive and ad hoc integration between essential subtasks. This limits third-party adoption, as third parties need substantial software infrastructure and specialized tooling knowledge that many research teams lack. Current design processes also face software engineering bottlenecks — millions of files produced per campaign create significant data management and storage challenges. Without standardized configuration and modularity, users struggle to customize workflows, test software accuracy, benchmark individual models and improvements, and maintain reproducibility.

Software Solution

SSEC will extend the existing Python-based pipeline with a command-line interface and improve support for containerized deployment using Apptainer and Slurm to manage workloads on the IPD’s high-performance computing infrastructure. This command-line interface will introduce a unified configuration system, a pilot mode for spot-checking designs, and modular block execution at each step in the design process. Data interoperability will be improved by migrating the intermediary data files to a relational database, enabling faster, more intuitive analysis. The system will support dynamic workflow customization and integration of third-party tools. Finally, the software will integrate research software engineering with best practices to enhance usability and maintainability.

Impact

This project aims to achieve a four-order-of-magnitude reduction in intermediate files and temporary storage requirements per protein design campaign. Furthermore, it will improve the processing time for computational design of millions of candidate molecules, benchmarking, configurability, and reproducibility. The pipeline will improve block interoperability, enabling scientists to run bespoke blocks and compare model performance effectively. This system will serve as the foundation for a future open-source release and broader adoption within and beyond the IPD. As a result, this collaboration will increase transparency and bolster relationships between machine learning model developers and protein science experts in one of science’s most promising fields.

eScience News

Events & Seminars