Partners: Andrew Hunt, Magnus Bauer, Jasper Butcher, Rohith Krishna, Saman Salike
SSEC Research Engineer: Anant Mittal
Research Goals and Domain
The University of Washington Institute for Protein Design (IPD), led by Nobel Prize-winning Director David Baker, is a cutting-edge research center that creates entirely new proteins to address challenges in medicine, technology, and sustainability through the application of novel artificial intelligence tools and experimental science. Applications of protein design include the development of new therapeutics, vaccines, and medical diagnostics, as well as novel technologies for advanced manufacturing, environmental remediation, and more.
The UW Scientific Software Engineering Center (SSEC) at the eScience Institute is collaborating with the IPD to build fluid workflows to optimize computational protein design campaigns, integrating state-of-the-art models for molecular design and structure prediction together with advanced filtering techniques. This project seeks to accelerate research within the IPD and improve the accessibility and reproducibility of AI-assisted protein design tools for researchers worldwide.
Software Problem
Current computational protein design workflows use fragmented interfaces across iterative prediction and filtering steps, requiring repetitive and ad hoc integration between essential subtasks. This limits third-party adoption, as third parties need substantial software infrastructure and specialized tooling knowledge that many research teams lack. Current design processes also face software engineering bottlenecks — millions of files produced per campaign create significant data management and storage challenges. Without standardized configuration and modularity, users struggle to customize workflows, test software accuracy, benchmark individual models and improvements, and maintain reproducibility.
Software Solution
SSEC will extend the existing Python-based pipeline with a command-line interface and improve support for containerized deployment using Apptainer and Slurm to manage workloads on the IPD’s high-performance computing infrastructure. This command-line interface will introduce a unified configuration system, a pilot mode for spot-checking designs, and modular block execution at each step in the design process. Data interoperability will be improved by migrating the intermediary data files to a relational database, enabling faster, more intuitive analysis. The system will support dynamic workflow customization and integration of third-party tools. Finally, the software will integrate research software engineering with best practices to enhance usability and maintainability.
Impact
This project aims to achieve a four-order-of-magnitude reduction in intermediate files and temporary storage requirements per protein design campaign. Furthermore, it will improve the processing time for computational design of millions of candidate molecules, benchmarking, configurability, and reproducibility. The pipeline will improve block interoperability, enabling scientists to run bespoke blocks and compare model performance effectively. This system will serve as the foundation for a future open-source release and broader adoption within and beyond the IPD. As a result, this collaboration will increase transparency and bolster relationships between machine learning model developers and protein science experts in one of science’s most promising fields.

