Keith Wiley: Astronomical Image Processing with Hadoop

 

Keith Wiley
Astronomy Department
University of Washington

 

 

Background

Astronomers use large telescopes to survey the sky over a prolonged period of time. By capturing multiple images of the same area and combining them in a process called coaddition, astronomers can pick out faint objects for study. The Sloan Digital Sky Survey (1999-2005) recorded 1/4 of the sky and produced approximately 80TBs of data. The Large Synoptic Survey Telescope (LSST) currently under construction will capture 1/2 of the sky over a period of 10 years. The LSST will bring in 30TB of data every night for a total of 60PBs over its 10 year operation. A major challenge for dedicated sky surveys such as these is how to organize and process all the resulting data.The LSST will bring in 30TB of data every night for a total of 60PBs over its 10 year operation. A major challenge for dedicated sky surveys such as these is how to organize and process all the resulting data.

Keith Wiley, a computer scientist at the University of Washington, states that in addition to storage of the millions of image files, sky surveys like the SDSS and LSST require high-throughput data reduction pipeline and sophisticated off-line data analysis tools.

How does Image Coaddition Work?

Also called signal averaging, image coaddition is the process of aligning and stacking multiple images of the same area of the sky. The end product is a final image with lower noise, allowing fainter objects to be visible for study. One way to perform coaddition on the massive amount of data produced by a sky survey is to use a Database Management System (DBMS) like Hadoop, a massively parallel database-processing system.

What is Hadoop?

See an overview of Hadoop.

Problem: Using Hadoop to Help with Image Coaddition

Image data is put into the Hadoop distribute file system (HDFS), which consists of hundreds or even thousands of computers in a cluster. A program is executed that takes a lot of images and stacks them to create a final, coadded image. The output is then written back to HDFS.

Prefiltering

Reducing the amount of input data can help Hadoop run faster. In the case of image coaddition, many images in the dataset do not apply to a specific coadd because they are either the wrong color or cover a region of the sky which is not of interest to the researcher. 

Eliminating these images as input to the Hadoop job speeds up the overall process.  Therefore, Wiley's coaddition software prefilters the images on the basis of color and sky location, resulting in a 7x speedup.

Sequence Files

For any given amount of data (for example 100,000 image files containing 600 GBs of data), Hadoop works better if those files are grouped into a small number of large files as opposed to an enormous number of tiny files  In Hadoop, these groupings are called Sequence Files. Using sequence files enables Wiley to reduce to the total number of files from 100,000 to about 1000, which yields a further 5x speedup over prefiltering alone (or 35x over the unmodified dataset).

Using both prefiltering and sequence files together, Wiley is able to achieve another 2x speedup over sequence files alone, or about 70x over the original dataset.

SQL Prefiltering

Another method of prefiltering which Wiley attempted was to use a SQL database to perform the prefiltering task. All the image color and sky converage data was put into a database. Before processing the images, color and sky coverage bounds are queried in SQL so only the relevant images are sent to Hadoop. The results show that SQL processing showed only marginal speed improvements compared to sequence file prefiltering. For larger databases, it is predicted that SQL would outperform standard prefiltering.

Lessons Learned: Large Files Essential

Using Hadoop to aid in astronomical image coaddition from large sky surveys produced the following conclusions:

  • Packing many small files into a few large files is essential.
  • Structured packing and associated prefiltering offers significant gains (reduces the mapper load).
  • SQL prefiltering of structured sequence files performs comparably to driver prefiltering, but we anticipate superior performance on larger databases.

Learn More

UW Astronomy Survey Science Group

Sloan Digital Sky Survey

Large Synoptic Survey Telescope

Hadoop MapReduce