Computational Methods
The advance of eScience depends on the improvement of existing computational methods as well as the development of new approaches. The following areas have been identified as key to the success of the eScience endeavor.
Data Management
All of science is reducing to a problem of data management! Hypotheses are increasingly tested by evaluating queries over massive disk farms – in ferro experiments – rather than relying solely on in situ, in vitro, and in silico techniques as a primary means of scientific discovery. This trend towards data-intensive science can be attributed to advances in data acquisition technology: high-throughput lab techniques, remote sensing platforms, and, in the case of in silico experiments, high-resolution computational modeling. Traditionally, each data acquisition activity was designed to test an individual hypothesis, but technology now allows researchers to collect data rather wantonly – to "download the world" – exchanging a problem of how to extract information from the environment to one of how to extract information from a database.
Unfortunately, the infrastructure to design and conduct in ferro experiments has not kept pace with our collective ability to gather data, leading to an unprecedented situation: Data analysis is now the bottleneck to discovery.
At the eScience Institute, we are attacking the data management at all scales by facilitating a climb up the data management technology ladder – coalescing spreadsheets into relational databases, migrating relational databases to the cloud, and developing novel algorithms for querying massive cloud-based datasets.
Parallel Programming Abstractions
Immense quantities of data require immense computational power for analysis. Parallel data analysis tools are, generally, still in their infancy and must be advanced to address eScience problems. Leveraging recent developments in microprocessor and networking technologies is key to achieving these goals.
- Hadoop is an open source implementation of Google's MapReduce programming model for simplifying parallel data processing. Pig is an abstraction layer over the map and reduce primitives offered by Hadoop that provides relational algebra operators (filter, join, groupby, union, and more) and a richer execution model than vanilla Hadoop. Unlike vanilla MapReduce, Pig provides binary operators allowing joint processing of two related datasets.
- Dryad is a
high-performance general-purpose distributed computing engine
that is designed to simplify the task of implementing
data-intensive distributed applications on clusters of Windows
Server computers. The Dryad model offers an improvement over the
MapReduce programming model, supporting arbitrary multi-step
computations, a set of operators derived from the relational
algebra, algebraic cost-based optimization, and language
integration via DryadLINQ. DryadLINQ integrates Dryad directly
into the .NET framework, providing language-level type safety,
debugging tools, and all the standard editing features of Visual
Studio. With DryadLINQ, developers implement Dryad applications
in managed code by using an extended version of the LINQ
programming model and API, and a DryadLINQ provider handles the
details of executing the queries as Dryad jobs.
The eScience Institute has a 10-node Dryad cluster available for exploration and development purposes, by researchers and scientists who are interested in understanding how Dryad might help solve problems in their areas. Contact info
escience.washington.edu
for more information about using Dryad.
Scientific Workflow
As data volumes grow, the only way scientists can interact with their data is with computer programs. The days of personal scrutiny and manual validation of all data are gone -- now programs do the job for us. Programming is traditionally a specialized skill, however. How do we empower scientists to author, share, and reuse programs without first requiring years of training as a programmer?
This question is actively studied in the eScience and database communities under the heading scientific workflow. Scientific workflow systems such as Taverna, Trident, VisTrails, and Kepler are all designed to raise the level of abstraction for creating and sharing robust, reproducible computational experiments and data processing pipelines. These systems offer a variety of language features that set them apart from a general-purpose programming language: provenance management, visual "boxes-and-arrows" programming, automatic task-parallel execution models, seamless integration with external tools and services, a rich library of domain-specific toolkits, and more.
At the eScience Institute, we participate in the workflow community and are assessing the value of these tools for specific science problems at the UW.
Machine Learning & Data Mining
Once acquired, knowledge must be extracted from data that come from simulations, large-output devices, and sensor networks. Traditional methods requiring focused human attention at every step of the process simply do not scale to petabyte data sets. The NSF has recognized the importance of technologies relevant to these problems with their Cyber-Enabled Discovery and Innovation (CDI) program.
Visualization
Automated pattern recognition and knowledge extraction from immense data-sets are essential, but enabling the unparalleled human capacity for pattern recognition is equally important. Next-generation visualization methods must be developed which provide productive systems for navigating systems with fantastic resolution and unheard of dynamic range. These methods encompass hardware platforms, software tools, and the interface between them.
Notes on Bandwidth-Intensive Visualization
