For an overview of the Incubator Program click here.

Kernel-Based Moving Object Detection

KBMOD query timing plot. We plot, as a function of the week of the eScience incubator project, the amount of time per-Trajectory to intersect the Trajectory and Image tables. The size of each point is proportional to the number of entries in the Trajectory table during the query.

Project Lead: Andrew Becker, UW Astronomy
eScience Liaison: Daniel Halperin
With assistance from: Andrew Whitaker, Bill Howe

Kernel-Based Moving Object Detection (KBMOD) describes a new technique to discover faint moving objects in time-series imaging data. The essence of the technique is to filter each image with its own point-spread-function (PSF), and normalize by the image noise, yielding a likelihood image where the value of each pixel represents the likelihood that there is an underlying point source. We wish to search for objects that have low S/N in a single image (e.g. pixel value between 1-3), but when the signal is aggregated from the multiple images in which the objects appear, have a cumulative S/N that is significant enough to claim a detection (e.g. greater than 10). We consider the process of running a detection kernel along putative moving object trajectories, and summing the likelihood values when this trajectory intersects a science image, the core functionality of KBMOD.

The first step in this process, implemented during the Fall 2014 eScience Data Incubator project, involved examining a database-based solution for the data access and query implementation.PostgreSQL was chosen as the database implementation, primarily because of the PostGISspatial extension. This allows for native spherical geometry objects and queries. Since the package was originally designed to represent Earth-based geographic information, one minor detail is to make sure that geometric objects are represented on an ideal sphere (I believe this is the correct one: instead of Earth’s ellipsoid.

We started the project running a PostsreSQL database on an Amazon Relational Database Service (RDS) instance, but ran into the limitations that you could not log (e.g. ssh) into the machine to copy data locally for ingest, or install C-language User Defined Functions (UDFs). The latter requirement was due to the desire to replicate the WCSLIB mapping of sky coordinates to image pixels, which comes from metadata contained in the image headers, in the database. This necessitated a need to install the database on an Elastic Compute Cloud (EC2) instance where we had complete sysadmin control of the system.

The bulk of the work during this incubator was in designing database tables and then queries on these tables, for the purposes of intersecting space-time trajectories of moving objects with our imaging dataset. In shorthand, we wanted to find out which image a moving object intersected with, at which sky coordinate inside the image (in the 2-D sky plane defined by the Right Ascension and Declination coordinate system), and finally which x,y pixel this corresponds to. Three table versions were implemented, which can be reduced to a maximal and minimal table design, described below.

Click here to read the project’s full summary.


Students' Sleep and Academic Performance


Project Lead: Ângela M. Katsuyama, UW Biology
Advisor: Horacio O. de la Iglesia, UW Biology
eScience Liaisons: Bill Howe, Daniel Halperin

This project investigates the impact of sleep in college academic performance. We hypothesize that poor academic performance in college students correlates with poor sleep behaviors. To address this hypothesis, we collected data from 72 senior students enrolled in the Spring 2014 Biological Clocks and Rhythms course. Some of the considered sleep parameters for analysis were: chronotype (preference to be a morning vs. evening type, as mismatches between the chronotype and the work schedule can lead to poor performance), social jetlag (difference in sleep timing between school days and weekend due to usual compensation of sleep debt on the weekends), variability of sleep onset, offset, duration, etc. Sleep parameters were recorded via sleep diaries, as well as with wrist data loggers (see below); performance was measured through grades.

The datasets consisted of: 1-) Activity and light exposure to three wavelengths, as well as white light, collected over 6 days (including one weekend) using a wrist actimeter (ActiWatch®); 2-) Sleep diary containing information about bed time, wake time, rise time, sleep duration and number of disturbances throughout the night (and of what kind), all self-reported throughout the days the student was wearing the watch; 3-) Chronotype score based on validated questionnaires from the literature; 4-) Grades (midterms, quizzes and final grade).

Our main goal was to determine potential correlations between sleep parameters and grades. We specifically assessed how light exposure is associated with to sleep patterns, and whether day-to-day variability or weekday vs. weekend variability have impact in academic performance. The challenge was to automatize data analysis across the entire population of students so this project could be scaled-up.

Click here to read the project’s full summary.


Simulating Competition in the U.S. Airline Industry


Project Lead: Charlie Manzanares (Economics)
eScience Liaisons: Andrew Whitaker, Daniel Halperin

Since 2005, the U.S. airline industry has experienced the most dramatic merger activity in its history, which has reduced the number of major carriers in the U.S. from eight to four. My project seeks to provide novel estimates of changes in consumer and producer welfare in the U.S. due to these mergers. To do so, I seek to estimate a dynamic model of route competition using the entire DB1B dataset, which is a 10% sample of all airline tickets in the U.S. from 1993 on, provided by the U.S. Department of Transportation. This dataset is large, consisting of roughly 5 million observations per quarter. Further, in order to estimate parameters of the dynamic game, I use a simulation and estimation approach, which requires increasing the size of the DB1B dataset to accommodate routes offered by carriers that may not exist in the dataset but that may have existed if these mergers were prevented. This data augmentation step increases the number of observations to 11 million per quarter. With this dataset, running my simulation using the R programming language is computationally infeasible on my laptop. The eScience Fall 2014 incubator project consists of creating software that will allow my simulation to run in parallel on an Amazon EC2 instance, drastically speeding up the computations and allowing me to complete multiple iterations of my simulation. The tasks consist of a 1) data augmentation step (DA), 2) value function simulation and estimation step (VFE), and 3) counterfactual simulation step (CS).

Click here to read the project’s full summary.


Analysis of .Gov Web Archive Data


Project Leads: Emily Gade (Political Science) eScience Liaison: Andrew Whitaker

Data are revolutionizing all fields of science including political science. Managing unstructured data (particularly text) is a non-trivial challenge for social scientists, especially at a large scale. An example is the .gov dataset curated by the Internet Archive (IA). The IA curates web crawls from 1996 to the present, and has carved out a database of all .gov pages. These pages have been parsed so that it is possible to query (for example) just the .html text. The resulting 82 TB database (WARC format) is currently hosted pro bono by a private company (Altiscale), distributed across a dozen or so servers. Running a query via Hadoop takes about 2 days. Investigating research questions using Altiscale is a very time consuming process (and beyond the technical ability of nearly all political scientists). As well, we hope to identify and circumvent key challenges faced as a result of non-scientific research design that were used for web crawls and the changing nature of content now posted on the web.

Click here to read the project’s full summary.


Innovation: Evidence from Patents


Project Lead: Matthew Denes (Finance and Business Economics)
eScience Liaison: Andrew Whitaker

One of the key drivers of long-term economic growth studied in economics and finance is technological innovation. A common proxy of innovative activity is patents. Patents provide researchers with a clear and well-recorded measure of innovation, where the number of patents and patent citations are argued to quantify the scale and novelty of a company’s innovation, respectively (Kogan et al. (2012)). Two main datasets on patents are utilized by researchers. First, the National Bureau of Economic Research (NBER) data file was the first link between public U.S. firms and patent count and citation data (Hall et al. (2001)) from 1976-2006, which was extended and updated by Kogan et al. (2012) from 1926-2010. Second, Harvard’s Patent Network Dataverse applies the Torvik-Smalheiser disambiguation algorithm to identify patent inventors (Lai et al. (2012)), since the spelling of an inventor’s name may differ for the same person (for example, missing a middle initial).

Click here to read the project’s full summary.


Analysis of Large-Scale Patterns in Phytoplankton Diversity

Map of cytometric diversity (N0) calculated using Myria and produced using the basemap package in Python.

Map of cytometric diversity (N0) calculated using Myria and produced using the basemap package in Python.

Project Lead: Sophie Clayton (Oceanography)
eScience Liaison: Daniel Halperin

Microscopic algae (called phytoplankton) form the base of the oceanic food chain, and are key players in the biogeochemical cycles of many climatically-active elements. Ecological theory predicts that diverse ecosystems are more stable, i.e. more resistant to stressors, than less diverse ecosystems. However data on the diversity of oceanic phytoplankton communities is very sparse as it typically depends on very labor-intensive methods (e.g. microscope identification, molecular sequencing). In order to understand how phytoplankton diversity may be affected by climate change, it is essential to have a baseline understanding of current patterns in diversity and how they relate to environmental conditions.

In this study, we will calculate indices of phytoplankton diversity using data collected using SeaFlow, a continuously sampling underway flow cytometer. This will produce diversity estimates at high resolution over large spatial scales, and across different seasons. We will adapt Li’s (1997) cytometric diversity to better reflect the taxonomic diversity of phytoplankton observed with SeaFlow, and develop methods for integrating data from different instruments and cruises in such a way that they are comparable. Using data from the Pacific and Atlantic Oceans collected during 18 oceanographic cruises, we will conduct a meta-analysis of the patterns in cytometric diversity, and how these relate to other biotic and abiotic variables (e.g. temperature, salinity, density gradients, biomass).

Click here to read the project’s full summary.