For an overview of the Incubator Program click here.

Repeating Earthquake Detection Final Report

"Helicorder" plot of six hours of raw seismic data during the lead-up to a volcanic eruption. Each line is 15 minutes worth of data, and each blip is a potential repeating earthquake we seek to automatically identify.

“Helicorder” plot of six hours of raw seismic data during the lead-up to a volcanic eruption. Each line is 15 minutes worth of data, and each blip is a potential repeating earthquake we seek to automatically identify.

Project Lead: Alicia Hotovec-Ellis, Graduate Researcher, Earth and Space Sciences
Advisor: John Vidale, Professor, Earth and Space Sciences
eScience Liaison: Jake Vanderplas, Director of Research – Physical Sciences, UW eScience Institute

In this project, we aimed to provide an open-source tool for seismologists to cluster repeating earthquakes in continuous data. The primary focus was to do this in near real-time as part of network operations (e.g., for the Pacific Northwest Seismic Network (PNSN)), but also have the flexibility to work with archived data. Most processing of repeating earthquakes requires a priori knowledge of what the earthquakes look like, which is not possible in real-time.

In REDPy, we automatically detect and associate each new potential repeating earthquake. This is possible through an online clustering algorithm (IncOPTICS: Incremental Ordering Points To Investigate the Clustering Structure, Kriegel et al.), allowing us to more efficiently reduce the number of required calculations as the catalog grows. OPTICS also allows us to have flexible definitions of a cluster, and has minimal restrictions on how separated in time two repeats may be. We also utilize a database-like structure using PyTables to efficiently store and recall data.

The code is still in development, however, we have already made some strides in facilitating new research at Mount St. Helens. Once the code is completed and running at the PNSN, we will have increased automated monitoring of the active volcanoes in the Cascades. We also plan to share the code as an open source package to the rest of the seismological community, with the hope that it will offer a more standardized way of identifying repeating earthquakes in large datasets.

Figure 1 (below) is an example of a small test dataset from the beginning of the 2004 eruption of Mount St. Helens. We know from previous research that many, but not all, of the earthquakes during this time period are nearly identical to each other. Figure 2 is one of the outputs of REDPy, visualizing the identified repeating earthquakes in an ordering that makes clusters more visible. For example, the top ~50 rows comprise a cluster of highly similiar earthquakes, and a few smaller clusters are below that.

Click here to read the project’s full summary.


Using Social Media Data to Identify Geographic Clustering of Anti-Vaccination Sentiments


Project lead: Benjamin Brooks, UW Institute for Health Metrics and Evaluation
Advisor: Abie Flaxman, UW Institute for Health Metrics and Evaluation
eScience Liaison: Andrew Whitaker, UW eScience Institute

There has been considerable attention given to the potential for search engine and social media data to provide real time information regarding public health threats; this idea is well known in the context of influenza. Public opinion concerning vaccination is of interest since the publication of a study in 1999 (now discredited) linking the measles, mumps, and rubella (MMR) vaccine to autism; in its wake, parental fear of vaccination has risen, vaccination rates have decreased, and occurrence of outbreaks of vaccine-preventable diseases have increased. Relative to other applications of social media data in public health, the study of anti-vaccination sentiments is particularly appropriate given that individuals are often opinionated on the topic and might be expected to share such opinions publicly.

We are interested in using Twitter data as a means of monitoring general anti-vaccination sentiment. In particular, we hypothesize that opinions shared on Twitter regarding vaccination provide insights into where geographic clusters of anti-vaccination sentiment exist, and, consequently, where children are not immunized and outbreaks might be expected. A study published in 2011 used a series of keywords to identify and collect Twitter data related to vaccination over a six month period after the H1N1 (“swine flu”) vaccine became available to the public. The researchers developed a classifier by compiling a training dataset where students tagged tweets as containing positive, negative, or neutral sentiment toward the vaccine for about 10% of their data; this classifier was then used to categorize the rest of the tweets into one of the three bins.

While this study showed that users with anti-vaccination opinions tended to cluster within the social network, it only used a crude measure to validate whether those opinion manifested themselves in measurable outcomes of public health concern. They used geographic information associated with individual Twitter accounts to compare the average “sentiment ratings” of different regions of the US to H1N1 vaccination rates and found a reasonably strong positive correlation (i.e., more positive sentiment, higher vaccination coverage). Our goal is to extend this work by examining whether these clusters can be linked to particular geographic areas at the state or, preferably, sub-state level and whether those areas have experienced outbreaks of vaccine-preventable disease since the original link between autism and the MMR vaccine was published.

We tested this hypothesis by combining vaccination-related Twitter data with data published through the National Notifiable Disease Surveillance System, which provides weekly case counts of newly diagnosed cases of key infectious diseases (including those that are preventable through vaccine) for each state [18]. In the process of working towards this goal, we tested several different sentiment classification methods, collected a new body of vaccination-related Twitter data from 2014, and examined whether the average sentiment expressed on Twitter in 2009 during the H1N1 pandemic was similar to the average sentiment in the same geographic areas in 2014.

Click here to read the project’s full summary.


Analysis of Kenya's Routine Health Information System Data


Project lead: Gregoire Lurton, UW Institute for Health Metrics and Evaluation
Advisors: Abie Flaxman and Emmanuela Gakidou, UW Institute for Health Metrics
eScience Liaison: Daniel Halperin, Director of Research – Scalable Analytics, UW eScience Institute

Every year, millions of dollars are spend on collecting data on health services in developing countries. This data then typically sits unused because of data access, reliability, and management issues. During this project, we worked with a set of over 5000 monthly reports collected from 2008 to 2012 by the Kenya Health Ministry. These reports are part of the Kenyan Health Management Information System (HMIS), through which hospitals report on a regular basis on the main pathologies they had to treat and the different activities they carried out. This dataset has been collated manually and collected in a diversity of Excel files, which makes it difficult to process and analyse. As a result this type of routine data is seldom used for policy making or health system management.

Our aim was to make this data easily usable for data analysis. We developed a series of methods to 1) programmatically extract the data from Excel in order to automate access to thousands of spreadsheets while handling the quirks of manually-entered Excel data from a variety of report templates, 2) test the reliability of the data using a variety of new spreadsheet and data features, and 3) import the data into SQLShare in order to provide querying capabilities over the spreadsheet data. to SQL, using Excel files metadata to cluster and classify the data.

Click here to read the project’s full summary.


Efficient Computation on Large Spatiotemporal Network Data


Project Lead: Ian Kelley, Ph.D., Research Consultant, Information School
eScience Liaison: Andrew Whitaker, Ph.D., Research Scientist, eScience Institute

The pervasive and rich data available in today’s networked computing environment provides many major opportunities for innovative data-intensive applications. Particularly challenging are data analysis projects that rely upon input from millions of sparse, highly dimensional, and dirty data files at can be difficult and time consuming to analyze.

The goal of this project was to develop methods and infrastructure for analyzing large-scale call detail record (CDR) data. The first goal of this investigation was to identify the computational and logistical challenges that existed when collecting, storing, and analyzing this type of data. The next stage focused on evaluating different tools, environments, and middle-ware that could support the data workflows needed to analyze this data. Due to the size, heterogeneity, and scale of the datasets, project scope and emphasis focused on current state-of-the-art “big data” systems such as MapReduce, Hive, Shark, and Spark.

Call Detail Records (CDRs) are one such set of information artifacts, consisting of metadata about mobile phone network calls that are passively collected in log files. These records can provide rich information that is useful for explorations ranging from mobility analysis and location inference to calculating probabilities of new product adoption.

Click here to read the project’s full summary.


Scalable Manifold Learning for Large Astronomical Survey Data


Project lead: Marina Meila, UW Department of Statistics
eScience Liaison: Jake VanderPlas, Director of Research – Physical Sciences, UW eScience Institute

Manifold Learning (ML), also known as Non-linear dimension reduction, finds a non-linear representation of high-dimensional data with a small number of parameters. ML is data intensive; it has been shown statistically that the estimation accuracy depends asymptotically on the sample size N like N1/(α d + β), hence requires large amounts of data when the intrinsic dimension dis larger than a few. On the other hand, manifold learning fully realizes its potential in scientific discovery from very large multi-dimensional data sets representing partially known physical systems, (e.g. spectra of galaxies) where there is reason to believe that the data can be modeled by a small set of parameters.

Therefore, we implemented a software suite that will enable scientists and methodologists alike to scale a broad class of manifold learning methods to very large data sets. In particular, the software can be used to analyze spectroscopic data from the SDSS, as well as other data from astronomical surveys. The software is written in Python, building upon the existing scikit-learn library for scientific computing/machine learning. Our project demonstrates, against the commonly held beliefs, that with careful implementation ML can be made tractable on large data.

Click here to read the project’s full summary.


ASPASIA: Adult Service Providers and Some Incidental Addenda

Number of unique ads on display on an escort advertising page on each day of December 2013. The series for each region is divided by the region's mean daily ad count for the month.

Number of unique ads on display on an escort advertising page on each day of December 2013. The series for each region is divided by the region’s mean daily ad count for the month.

Project Lead: Sam Henly, a PhD student in the UW Department of Economics
eScience Liaison: Andrew Whitaker, Data Scientist, eScience Institute

Most prostitution in the United States is organized through Internet media. This presents an opportunity for research into a market that, historically, has proved impenetrable to systematic investigation. APSASIA is an effort to collect all of the data generated by market participants’ use of web platforms—advertising sites, review sites, forums, and so on—and use them to create a rich and real-time map of prostitution activity. Once complete, this data set will permit us to describe with great granularity the labor side of the market for sex in the United States and Canada, and the effects of policy interventions on that market.

The figure to the right illustrates the daily intensity of advertising by sex workers in each of 24 metropolitan areas in the United States These series may be used to evaluate the effect of external events on markets for sex in cities. For example, officials frequently claim that the Superbowl results in a boom in sex work; these series may confirm or reject such claims. More importantly, we will use these series to evaluate whether anti-prostitution stings are effective in suppressing markets for sex.

Click here to read the project’s full summary.