Oceanography is currently witnessing a data explosion, with new instruments being deployed that collect high frequency, multi-parameter datasets, both remotely, with satellites, and in situ, with new instruments such as the SeaFlow cytometer. As a result, oceanographers have begun to use large-scale statistical machine learning tools to analyze these substantial datasets, thereby accelerating scientific discovery in marine microbial ecology.
The SeaFlow cytometer continuously profiles microbial populations across thousands of kilometers of the ocean surface during research cruises, which can last several weeks. In contrast to conventional flow cytometers that sit on laboratory benches running a single experiment at a time, SeaFlow allows one to seamlessly aggregate samples across multiple data collection campaigns, thus facilitating the identification of regions showing coherent features in microbial populations.
SeaFlow continuously samples surface seawater, quantifying up to 18,000 cells per second aggregated into three-minute windows. This produces a series of measurements of the optical properties of small microbial cells (less than 10 microns or 0.001 cm in diameter).
Each three-minute window is made up of multivariate measurements describing the optical properties of these tiny microbes. After several weeks at sea, thousands of three-minute samples have been collected across sometimes highly variable environmental conditions, representing hundreds of gigabytes of data.
The ‘multiple change-point detection’ method developed by the University of Washington team can detect changes in scatter and fluorescence properties in these large datasets within seconds by using a dynamic programming algorithm. The detection method hinges upon the concept of kernel embedding of distributions, which is a way to summarize the samples collected during each three-minute window as an element in an infinite-dimensional Hilbert space.
By analyzing data from 16 cruises, the team has found that changes in the optical properties of microbial populations often coincide with changes in the temperature and salinity of the surface ocean. Across these cruises from a range of ocean environments and of varying length (200 – 8000 km), the estimated biological change-points lie within five km of the nearest temperature and salinity change-point approximately 30% of the time.
The current state of the survey of marine microbial diversity and biogeography on the globe is sparse and heterogeneous. Some areas have been much more thoroughly investigated than others where less is known about microbial populations and their dynamics. The under sampled region where fewer observations have been made can benefit from the knowledge extracted from areas with a greater number of observations.
The team plans to use statistical learning methods to apply insights from well-sampled regions to similar biogeographical regions with fewer observations. This progress will allow them to better enumerate the mechanisms that drive microbial population shifts.
Research team members include:
Corinne Jones, PhD student, Department of Statistics, University of Washington
Sophie Clayton, postdoctoral research fellow, School of Oceanography, University of Washington
François Ribalet, senior research scientist, School of Oceanography, University of Washington
Zaid Harchaoui, assistant professor, Department of Statistics; data science fellow, eScience Institute, University of Washington
E. Virginia Armbrust, director, School of Oceanography, University of Washington