Scalable Manifold Learning for Large Astronomical Survey Data

Project lead: Marina Meila, UW Department of Statistics

eScience Liaison: Jake VanderPlas, Director of Research – Physical Sciences, UW eScience Institute

Manifold Learning (ML), also known as Non-linear dimension reduction, finds a non-linear representation of high-dimensional data with a small number of parameters. ML is data intensive; it has been shown statistically that the estimation accuracy depends asymptotically on the sample size N like N1/(α d + β), hence requires large amounts of data when the intrinsic dimension dis larger than a few. On the other hand, manifold learning fully realizes its potential in scientific discovery from very large multi-dimensional data sets representing partially known physical systems, (e.g. spectra of galaxies) where there is reason to believe that the data can be modeled by a small set of parameters.

Therefore, we implemented a software suite that will enable scientists and methodologists alike to scale a broad class of manifold learning methods to very large data sets. In particular, the software can be used to analyze spectroscopic data from the SDSS, as well as other data from astronomical surveys. The software is written in Python, building upon the existing scikit-learn library for scientific computing/machine learning. Our project demonstrates, against the commonly held beliefs, that with careful implementation ML can be made tractable on large data.

See the project GitHub here.