Andrew Connolly: Large Scale Data Processing and Astronomy: Mashups, Widgets, and Custom Configurable Data Applications
Department of Astronomy
University of Washington
Andrew Connolly is an Associate Professor in the UW Department of Astronomy and a three-year veteran of the Survey Science Group (SSG). Before coming to the UW, Connolly was one of the technical leads in the development of Google Sky, a feature of Google Earth that uses a combination of visual data to allow the user to explore the universe.
Connolly’s current focus is on how to use large datasets/large data streams in astronomy. A range of questions frame his work:
“How do you take data and process it, to how do you run analysis tools on it, to how do you do the visualizations, and how do you do the science with these big data streams?”
Connolly and SSG are currently preparing for the data stream from the Large Synoptic Survey Telescope (LSST). Under construction, the LSST will survey half the sky every three nights resulting in about 60TB of data every week.
Connolly described two goals of the SSG:
“One of goals of our group is to take that data flow [from the LSST] and process it in almost real time, which involves comparing images from one night to the next night and looking for changes. Within 60 seconds of those data being taken, we want to send out an alert to the astronomy community that if there is a variable source (could be a supernova, asteroid, etc.), maybe somebody should take a look. If you can send the alert out quickly, another telescope can train on the object and see what the transient event is.”
The second main goal of the SSG is enabling astronomers to effectively use the resulting large catalog of data.
“Once we have detected all the transient events, they are transferred to a national supercomputing center, then transferred into an archive which will grow at about 1PB of catalog data or 6PB of emitting data every year.”
Connolly thinks the challenge of processing 20TB of data a night is manageable but enabling astronomers to interact with the eventual 5PB+ catalogs of data is slightly more daunting. He described how astronomy data can be used for thousands of different projects to answer lots of questions, including those about the formation and structure of our solar system, how the galaxy formed, stars in nearby and distant galaxies, and the formation and evolution of the universe.
Problem: Working with Millions of Disparate Data Sources
According to Connolly, many astronomers are accustomed to analyzing data on their workstations but with the sheer amount of data that will soon be available, that model will no longer be viable.
“Many astronomers are not up to date on the latest data structures, algorithms, and statistical techniques that are available. While they are used to working with a few 100 or a few 1000 sources, the idea of scaling up to millions or hundreds of millions of sources - people don’t necessarily think about that when designing their individual analysis tools. To enable the science that is coming to come out of the LSST, we need to build scalable tools.”
"We have to train the next generation in how to use these data sets because they are ultimately the ones that are going to make the breakthroughs."
As an example of the kind of tool the SSG are working on, Connolly mentions Hadoop, a parallel database processing tool.
“One thing we have been working on is taking Hadoop and using it as an image processing framework allowing the astronomer to work on how they want to analyze the data and helping them to spread that out across thousands of processes.”
Many Data Streams, Multiple Research Needs
According to Connolly, the amount of data available to astronomers is large but fragmented in different places around the country.
“We have a lot of streams of data in astronomy available today…throughout the electromagnetic (EM) spectrum (infrared, gamma, etc.). We’ve done large imaging surveys of the sky in about 81 separate wavelengths or frequencies. So how do we take all these separate data sets spread throughout the country and make them available to individual astronomers?”
Though it might be possible to build analysis toolkits that work with all the available data, the problem is that particular research questions require specific analysis methods. There is no one-size-fits-all solution.
“You could build incredibly rich and powerful analysis toolkits…that would enable you to take data from an X-ray survey and crossmatch it with an optical survey and overlay those sources on the sky and see how they correlate. The problem is that not everyone wants to interact with the data in the same way…one person is interested in very distant galaxies and the other may be interested in solar system objects that are moving. How do you enable a user to come in and pull together lots of common applications and bring them together and build their own mashups or toolkits?”
Connolly and the SSG are working on just this type of solution.
Solution: Astronomy Widgets
Using iGoogle gadgets as an analogy, Connolly describes the idea of Astronomy Widgets:
“Could we build the same sort of thing for astronomy but one step further where each of these gadgets talks to the others? We could develop a framework where we have lots of little gadgets for astronomy….one can view images in the sky…another one where you can type the name of an object, and it will return the coordinates, one will query databases. All of these gadgets or widgets will talk to each other.”
Astronomers will be able to use this iGoogle-like astronomy dashboard to query large datasets like the one resulting from the LSST.
“The goal behind all of this is enabling science with big datasets and making it easier for an astronomer so they can focus on the physics of the problem and not the computational load. We want to try to educate them in how to do the computing of the data structure but we also try to make it easy for people.”
Check out these links to see some demos of the Astronomy widgets in development:
Lessons Learned: Education is Important
When asked about the use of cloud computing for his work, Connolly replied:
“Today, the way we work with large datasets, we keep them in databases but people download the data from the database to their workstation and analyze it there. It’s clear when you move to TB and PBs, that won't scale. It was clear that we needed to move the processing to the data which drove us towards parallel computing and the ability to run it on the cloud was just a natural next step.”
In closing, Connolly advocated for increased efforts in educating the next generation of astronomers:
“Moving forward, as we bring in these new technologies, we have to train the next generation -- students in undergrad and graduate programs. We have to educate the student population in how to use these data sets because they are ultimately the ones that are going to make the breakthroughs.”
Also in... Get Help Now
Latest eScience News
Please help us support your research by including the following acknowledgment in publications to which we have contributed:
Supported in part by the University of Washington eScience Institute.