Schematic of the proposed cloud-computing platform. A) The current paradigm where scientific computations are preformed locally on duplicated datasets. B) Representation of the proposed toolset which enables scientists to work entirely in the cloud via an API and scalable interactive computing resources. (Click to enlarge)

Schematic of the proposed cloud-computing platform. A) The current paradigm where scientific computations are pre-formed locally on duplicated datasets. B) Representation of the proposed toolset which enables scientists to work entirely in the cloud via an API and scalable interactive computing resources. Click to enlarge.

The University of Washington, along with collaborators from the National Center for Atmospheric Research (NCAR), Anaconda, and Element84 have just been awarded a $1.5 million grant from the National Aeronautics and Space Administration (NASA) to develop new approaches for using satellite observations of Earth. The team will work with the Pangeo Project, a community effort for big data in the geosciences, to develop state-of-the-art open-source tools for cloud-based data analysis.

This team brings together software developers and research scientists to address emerging challenges in working with increasingly large and complex satellite data products. NASA has a long history of providing these data to the research community so that they can explore the complex workings of our planet. Typically a researcher will download data from a NASA repository and carry out their analysis on a laptop or workstation. As satellite technologies improve, data are being collected at increasingly finer resolutions, creating larger and larger datasets that take more time and resources to explore. By 2025, NASA estimates that it will be storing upwards of 250 Petabytes of data on the commercial cloud.

This project will demonstrate a new approach, one in which researchers can avoid moving data, and focus instead on building tools for data analysis in a shared computing environment. The Pangeo Project provides the technological and social framework for achieving this shift.

On the technical side, the Pangeo community is exploring new software that break down the processing and analysis of large datasets into smaller, more manageable sized “chunks.” Distributed computing tools are then used to send those chunks to many different computing “workers” that can be created or destroyed in a short amount of time. At the center of it all is a scheduler that orchestrates the efficient distribution of computing tasks across many workers. The datasets themselves are stored alongside the computing infrastructure allowing for faster computing and eliminating the need to move data. To date this computing architecture has been tested on both institutional and commercial cloud computing systems.

What do these technologies look like to a typical user? Pangeo is building a series of computing environments that have a set of tools for a variety of scientific disciplines such as climatology, oceanography and hydrology. A user can log in to a centralized hub that creates an instance of a “Jupyter Notebook”, a user-friendly, web-based scripting environment. Users can then issue commands to work with the data, and save key results back to their local computers.

On the social side of this work is the creation of an inclusive and welcoming community that is working together to build tools and educate scientists about these new approaches. We plan to host a series of events to offer training and provide a space for hacking on projects together. These efforts will help shift scientific culture toward open and reproducible software practices.

In a sense, this project team represents a microcosm of the larger Pangeo community: industry partners Anaconda (Matthew Rocklin) and Element84 (Dan Pilone) are contributing expertise in software development and in connecting NASA services to the community, while the NCAR (Ethan Gutmann, Joe Hamman) and UW teams (Scott Henderson, Amanda Tan, Rob Fatland and Anthony Arendt) are exploring scientific use cases and developing educational and community building tools.

Further information can be found in the blog posts “Cloud Native geoprocessing of Earth Observation satellite data with Pangeo” by Scott Henderson and “Pangeo applications for NASA Earth Observing Data” by Joe Hamman.

Research contact: Anthony Arendt at arendta(at)uw.edu.