Understanding Cloud Computing for Research and Teaching

Overview: Cloud Computing for eScience, Research, and Teaching

Cloud computing refers to computing facilities provided on demand, served over the Internet from shared data centers that exploit enormous economies of scale. Rather than purchasing a cluster of computers, finding space in your local lab, hiring an administrator, and then letting the facility sit idle when not needed, you can outsource your computing to remote facilities in the cloud, and pay only for what you use. This guide provides an overview of cloud computing, and discusses some benefits cloud computing services offer for both research and teaching.

Cloud computing is cost-competitive for a wide variety of workloads and application scenarios, such as 24/7 Web applications, large-scale data processing, and "bursty" or "spiky" CPU-intensive workloads. For the first time, renting 1000 computers for one hour costs the same as renting one computer for 1000 hours.
 
Cloud computing also offers a low barrier to entry for system administration, providing a simple interface to manage multiple computers. However, the key advantage to cloud computing is elasticity. Typically, a cloud consists of a dynamically assigned group of computers that can scale up quickly at your request. Further, these computers are virtual machines that are unconstrained in their capabilities. This extreme flexibility offers transformative, new usage modalities for computing for research: 

  • The weekend before a paper deadline, your students spin up hundreds of computers to finish the experiments in a few hours. When the paper is submitted, these facilities can be switched off and then cost you nothing. 
  • In the classroom, each student uses their own isolated working environment in which they install their own software, run jobs without contention, and break things without risk. 
  • For reproducible research, your post-doc saves the virtual machine they used to perform the analysis, makes it public, and cites it in the paper. Anyone can then swipe their credit card, run the virtual machine, and reproduce or extend the analysis themselves. 
  • For collaboration, your team can create a temporary shared development workspace in the cloud without risking security by providing access to university equipment.

Cloud Computing "-as-a-Service"

The term cloud computing may be applied to products categorized broadly into three categories:

  • Software-as-a-Service (SaaS): Applications served over the Internet, like Google Docs.
  • Platform-as-a-Service (PaaS): Specialized APIs for building applications on the Internet, like Google App Engine or Force.com.
  • Infrastructure-as-a-Service (IaaS): Low-level services for basic storage and computing. A variety of services are now available: Amazon Web ServicesWindows Azure, and now Google Compute Engine.

Figure 1 illustrates the relationship between these categories of cloud computing services. The figure places several example services into three vertical sections representing IaaS, PaaS, and SaaS.

Those examples that appear to the left are less constrained, but provide less automation; the user must assemble these services into a useful application. For example, Amazon's Elastic Compute Cloud (EC2) and Simple Storage Service (S3) are very general services that can be used in the context of many different applications.

The examples at the right of the figure provide capabilities "out of the box," but are constrained in how they can be used. For example, Google Docs provides word processing, spreadsheet, and presentation applications over the Web. Researchers will likely find IaaS, and to a lesser extent, PaaS, most useful, while SaaS tools are broadly useful for individual work and collaboration. You can also learn more about UW's adoption of SaaS tools.

Cloud Computing as Data-Intensive Scalable Computing

Cloud computing is sometimes also used in reference to a class of distributed data-processing platforms that do not rely on shared memory nor shared storage -- they can exploit clusters of cheap PCs organized in a "shared-nothing" configuration.

MapReduce, a programming model implemented in an open source project called Hadoop, has become a popular choice for data-intensive scalable computing platform and is frequently associated with the term cloud computing. A MapReduce program allows a scalable, distributed program to be expressed in terms of two relatively simple serial functions: a Map function that divides and distributes a problem, and a Reduce function that processes the output. Using just these two functions, it has been shown that a variety of scalable, parallel algorithms can be expressed. Further, MapReduce programs are fault-tolerant; if a computer fails during processing, only the failed task needs to be restarted, rather than the entire job.

The success of MapReduce has spawned an entire ecosystem of related systems and tools, all of which have a relationship to the "big data" side of cloud computing.

Facebook created HIVE, which allows SQL queries to be compiled to a series of MapReduce jobs, providing a very lightweight mechanism for large-scale query processing. Yahoo created Pig, a language with similar expressive power to SQL but a different syntax more familiar to programmers accustomed to imperative programming languages such as C. Microsoft Research has produced a shared-nothing system called Dryad that is far more general than MapReduce.

Parallel relational databases, although they predate MapReduce and related tools, have experienced a market increase in the last few years. UW researchers created HaLoop, an extension to Hadoop to support iterative algorithms found in data mining, machine learning, and graph processing. Cloudera, Inc. provides user interfaces for managing Hadoop jobs, as well as best practices and documentation. Amazon Web Services provides the Elastic Map Reduce service, an implementation of Hadoop native to the cloud, with no need to install software.

The UW eScience Institute has significant research and engineering experience using these tools for processing large and complex scientific data. Contact us for more information.

Get Started

To explore how cloud computing can be used in your research, browse the guides and profiles to find your application scenario, refer to the how-to pages for specific instructions, or contact the eScience Institute for specific advice. Ask us about free cloud access for testing and evaluation.

More Information

Getting started with Amazon Web Services

Decide which platform to use

Course Materials on Data-Intensive Computing in the Cloud

Life Sciences Applications on AWS

AWS-related Life Sciences Publications