For an overview of the Incubator Program click here; find a video of the project presentation results here.
Atmospheric particulate matter source identification using excitation emission fluorescence spectroscopy
Project Lead: Jay Rutherford, UW Department of Chemical Engineering PhD Candidate eScience Liaison: Bernease Herman
Air pollution is estimated to cause 4.9 million premature deaths and result in 149 million disability adjusted life years annually.(1) 91% of the world’s population lives with air pollution levels above the World Health Organization Guidelines.(2) These facts make it the world’s largest environmental health risk. Air pollution consists of gases, liquids and solids. Tiny droplets of liquid and microscopic solids suspended in the atmosphere are referred to as aerosols or particulate matter (PM). PM comes from natural sources including sea spray, forest fires, and dust from soil as well as anthropogenic sources like combustion engines, road dust, industry, residential heating and agricultural burning. There is extensive research showing PM2.5 (particulate matter smaller than 2.5 microns in diameter) causes a variety of health problems that lead to premature death and reduced quality of life. Some studies show certain sources of PM2.5 pollution, traffic for example, are worse for health than others, however, there is not sufficient evidence from source specific studies to show this conclusively. Recently there has been a proliferation of low-cost instruments to measure PM2.5, but there is no accompanying low-cost method to determine the sources of PM that is needed to enable the study of source specific health effects.
To enable low-cost source apportionment, we are developing a method to analyze PM samples using fluorescence excitation-emission matrix spectroscopy (EEM). PM samples contain fluorescent compounds such as polycyclic aromatic hydrocarbons generated during combustion that can be extracted into a solvent for analysis by EEM spectroscopy. We have collected PM in the laboratory and analyzed extracts using EEM spectroscopy. Using these data, we trained a convolutional neural network (CNN) to distinguish the sources of air pollution present in the laboratory samples.(3)
EEM Spectra from laboratory sources cigarette smoke, diesel exhaust, and woodsmoke show unique fingerprints (top row). An EEM spectra from a mixture of these three PM sources (far left) and saliency masks for each source are shown in the second row. The areas highlighted by the saliency maps correspond with areas of unique fluorescence in the spectra giving us confidence the CNN is leaning relevant features to identify the PM sources. Image credit: Jay Rutherford
In order to apply EEM to source apportionment of real world samples, we needed to better understand why the CNN was working for the laboratory samples. To provide insight into what features of the spectra the CNN was using to identify the various laboratory sources of pollution we evaluated saliency maps from the trained network. This method calculates the sensitivity of the CNN output with respect to each area of an input spectra. This is a method typically applied to image classification, for example in an image containing a dog and a soccer ball, one expects the dog to be highlighted in the saliency map if the image is identified as a dog. Spectra from pure laboratory PM sources are shown in the top row of the figure, each source has a unique fingerprint. The spectra shown at the far left is a mixture of the three sources. The panels directly below the pure spectra are saliency masks corresponding to each of the laboratory sources. These maps show the CNN is looking at areas of the mixture spectra where the pure spectra show their unique fingerprints giving us confidence the CNN architecture we chose is working properly.
We computed saliency maps using SmoothGrad(4) based on our CNN that was trained using Keras. We are currently working to generalize the methods we used for processing EEM spectra, training a CNN and computing saliency maps into an open source Python package. This project can be found at https://github.com/jayruth/pyeem.
(1) Stanaway, J. D.; Afshin, A.; Gakidou, E.; Lim, S. S.; Abate, D.; Abate, K. H.; Abbafati, C.; Abbasi, N.; Abbastabar, H.; Abd-Allah, F.; et al. Global, Regional, and National Comparative Risk Assessment of 84 Behavioural, Environmental and Occupational, and Metabolic Risks or Clusters of Risks for 195 Countries and Territories, 1990–2017: A Systematic Analysis for the Global Burden of Disease Study 2017. The Lancet2018, 392 (10159), 1923–1994. https://doi.org/10.1016/S0140-6736(18)32225-6.
(2) WHO | Air pollution http://www.who.int/airpollution/en/ (accessed Apr 4, 2019).
(3) Rutherford, J. W.; Neal Dawson-Elli; Anne. M. Manicone; Gregory V. Korshin; Igor V. Novosselov; Edmund Seto; Jonathan D. Posner. Excitation Emission Matrix Fluorescence Spectroscopy for Aerosol Source Identification. (In Review).
(4) Smilkov, D.; Thorat, N.; Kim, B.; Viégas, F.; Wattenberg, M. SmoothGrad: Removing Noise by Adding Noise. ArXiv170603825 Cs Stat2017.
Beneficial competition under rationing: evidence from food delivery service
Project Lead: Kwong-Yu Wong, UW Department of Economics eScience Liaison: Jose Hernandez
Rationing is usually necessitated whenever some external constraints causing quantity of goods provided in lack of what is required (e.g. essential supplies in wartime, surgery needed, meals during peak hours etc.). In Economics literature, rationing is commonly regarded as welfare reducing because it easily induces wasteful competition such as wasting time by standing in line. However, whether rationing only induces wasteful competition is still an open question.
This project studies the beneficial competition under rationing in food delivery industry and helps quantify the welfare improvement resulting from the competition. In food delivery industry, rationing happens daily in peak hours. Such rationing induces customers to compete in calling in earlier for food delivery and hence restaurants (and delivery company) receive information earlier to have a larger flexibility in meeting the demand on-time. This will be one clear example of beneficial competition induced by rationing.
To quantify the overall welfare impact of such beneficial competition, we need to build a counterfactual where the beneficial competition is removed, in order to compare with the reality with such competition. One major technical challenge is to predict which delivery person will a delivery order be assigned to when the call-in time needs to be altered. Since customers compete in call-in time, artificially altering the call-in time removes the competition effect in counterfactual.
Once call-in time of an order is adjusted in counterfactual, how this order will then be assigned is unknown as this does not happen in reality. While we can naively guess the assignment by simply assigning the order to the closest delivery person, the assignment should in fact be much more complicated in considering factors such as their orders on-hand and the corresponding finishing time. This project aims at predicting the assignment with the help of statistical machine learning tool. With the assignment properly predicted, we can then simulate how the delivery will be completed in the counterfactual scenario and hence measure the outcomes (e.g. delay time in all delivery orders) to compare with the reality. The beneficial competition effect is then quantified by the difference in two scenarios.
Utilizing Baidu API function built during the program, 300k+ food delivery orders with Chinese address only can now be visualized on a latitude-longitude plot (red for restaurants and black for customers), which essentially depicts the city Shanghai in China. Image credit: Kwong-Yu Wong
Although the data on hand is rich, some key features for meaningful analysis are missing. Data only has addresses for customers and restaurants in Chinese. Latitude-longitude and travel distance are missing. Without such information, regression analysis on delay time is counter-intuitive. For example, the regression would suggest higher delivery fee induces a larger delay for delivery. To complicate things further, Google map in China is less accurate than the leading map in China, Baidu map. Typical API request through Google is not ideal. During incubator program, we’ve built a handy function to request Baidu API for geocoding information. Even though there is daily limit for Baidu API, the function helps to split the task across weeks to obtain the information. Enabled by these additional features, we can have better control variables for delay time analysis. Delivery fee increase, after controlling for distance and traffic complications, reduces delay time as one would expect.
Another concern for analysis is about the relationship between order time and eventual delay. While straight-forward regression would give us quick grasp on how data look like, one might be concerned if the delay effect roots in treated group (e.g. customers who place order 20mins earlier) and control group (e.g. customers who place order later) are totally different by nature. We supplemented the analysis with multivariate matching and propensity score matching. The former ensure the similarity between treated and control by choosing a corresponding counterpart in control group for each observation in treated group. Instead of choosing 1-1 counterpart, the latter estimated propensity score for weighting all observations in control group so as to make a group comparable to treated group. Both methods confirm the delay effect is significant in a comparable treated-control pair.
Going forward, we shall consider cost-benefit analysis so as to discuss how early should information be sent out to improve welfare. Instrumental variable would be needed for causality discussion. Since ordering time is endogenously optimized, its associated coefficient does not directly inform what would happen had ordering time been changed. Rainfall can serve such a role to disentangle cost side and benefit side because it affects only the delivery but not the ordering. With that in place, this project would be able to inform how early should the information be communicated to optimize on rationing.
A network analysis of tree competition: Which tree species make the best neighbors?
Project Lead: Stuart Ian Graham, UW Biology Department eScience Liaison: Ariel Rokem
Schematic diagram of one quadrant of a forest plot. Locations and sizes of trees are indicated by the circles and their diameters respectively. This diagram shows how this dataset enables a quantitative description of each tree’s competitive neighborhood. Image credit: Janneke Hille Ris Lambers
A quantitative understanding of how co-occurring tree species influence one another’s growth is required to predict how forest ecosystems will respond to climate change. Although competition with neighboring trees undoubtedly limits tree growth, the species identity of neighbors may have an important role in moderating this interaction.
For example, the seedlings of most conifer tree species have higher growth rates when close to an adult tree of the same species, whereas the opposite is true for many tropical tree species. However, it is currently unclear how these feedbacks influence the growth of adult trees. To fully comprehend the role of these feedbacks in structuring forest communities, we need to build an understanding of how trees respond to the species identity of their neighbors over their entire life cycle.
This project aims to create a statistical model to describe how the growth of adult trees is influenced by the size, species identity, and proximity of neighboring trees. It will use 40 years of growth data collected from 15 forest plots at Mount Rainier National Park, WA.
Within these 100 x 100 m plots, the location of each tree is mapped, such that the neighbors of any tree can be identified. In total, the dataset includes over 8000 trees from 10 species. The code used for this analysis will be released in a user-friendly format such that it can be used by the many forest management agencies that maintain similar datasets to the one used in this project.
Figure 2. Heat map showing the effects of neighboring trees of various species (x-axis) on the annual growth rates of focal trees (y-axis). Tree species are written as four letter codes and the diagonal from bottom-left to top-right shows the effects of neighboring trees on focal trees of the same species. Image credit: Stuart Ian Graham
Predicting how forest ecosystems will respond to contemporary climate change requires a quantitative understanding of how tree growth is influenced by neighboring trees. To build this understanding we must use large tree growth datasets in order to capture adequate variation in the many parameters that describe a tree’s competitive environment i.e. what tree species are nearby, how large are they, and how far from the focal tree are they? Although previous research has developed a modeling approach to estimate the effects of these variables on tree growth, construction of the model requires a supercomputer in order to optimize over many parameters and also returns coefficients that are not easily interpretable or comparable. This is unfortunate because the majority of datasets suitable for these analyses are built and maintained by forest management agencies which do not have the time or resources to implement the current modeling approach. This project aimed to create a new method to model tree growth as a function of neighboring trees that is quick and returns coefficients that are easily interpreted.
We used data on tree growth gathered from 15 forest plots at Mt. Rainier National Park. All trees in each plot are mapped and identified to species, enabling us to accurately describe the competitive environment of each tree (Fig. 1). We then created a directional graph model of these data, and modeled the annual growth rate of each tree as a linear function of its species, local tree density, and the species, size and proximity of each neighboring tree. The linear nature of this model means that it can run in < 10 seconds on a laptop and also returns coefficients that are in units of annual growth rates. From our model we developed a matrix that shows how the growth rate of each tree species is influenced by the species identity of a neighboring tree (Fig. 2). We are also developing an R package containing all the functions used to explore our data through visualizations and create our model, which will make this approach readily available to forest managers and other researchers.
The next step of this project is to obtain code from the authors of the previous modeling approach so that we can compare the performance of our new model and this older model. We will then finish documenting our R package and make it publicly available before writing up our findings in a manuscript for a peer-reviewed journal.
Predicting human-mediated vectors for invasive species from mobile technology
Project Lead: Julian Olden with Rachael Fricke, both UW School of Aquatic and Fishery Sciences eScience Liaison: Spencer Wood
Invasive species pose a significant threat to ecosystem health and economies of nations across the globe. Freshwater recreational fishing is the largest and growing vector for invader introductions: specifically, because angler activities entangle invasive organisms on fishing gear, boat hulls, and outboard engines, or release non-native species after using them as live bait.Understanding angler movement and behavior can provide critical insight into the most effective implementation of prevention strategies (e.g. watercraft inspection stations, educational signage), thus reducing the introduction, spread and impact of invaders.
To date, angler behavior is inferred from sparsely conducted in-person interviews, creels, dairies and mail-in surveys, which tend to produce retrospective data that is limited in time and space and often reveals intensions or attitudes rather than actual behaviors. Moreover, these traditional approaches disproportionately target older anglers and thus fail to engage younger generations whose participation in fishing is rapidly increasing.
Mobile technologies offer a novel opportunity to efficiently collect information on angler behavior at fine spatial and temporal resolutions over broad spatial and temporal scales. Yet, social media and smartphone fishing applications remain an underutilized tool. Anglers are highly active on social media who often geotagged photographs of fish, and fishing applications provide waterbody-specific location of anglers. This can reveal angler behavior that continuous in both space and time, and can provide inexpensive and high-resolution regarding the potential dispersal pathways of aquatic invasive species.
The primary objective of this incubator is to leverage data from social media and mobile fishing applications to quantify angler activity and movement across the continental United States and assess species invasion risks associated with recreational fishing. Results from this incubator will directly inform interagency management interventions at both local and landscape scales by quantifying angler movement networks and determining how they change through time. Heavily-used locations of fish activity (i.e., highly-connected network nodes) are prime targets for collaborative regulatory approaches that focus on the implementation of prevention strategies, such as watercraft inspection and cleaning stations to remove invasive species, and educational signage to discourage anglers from releasing live bait into waterbodies.
Figure 1. Distribution of 69,448 angling records represented by 20,754 anglers in 24,767 waterbodies from 2017–2018. Angler activity over time presented in lower-right corner. Image credit: Julian Olden
We mapped the geography of angler activity across the United States according to events recorded by the NetFish mobile app and geolocated photographs that are shared publicly on Flickr. NetFish interfaces with devices called iBobbers – the smallest and lightest personal sonar depth finder on the market – that anglers cast into the water, primarily to measure depth and detect fish. NetFish currently archives all user-generated data, which includes geographic locations and dates of fishing events (Fig. 1).
We filtered all iBobber records and Flickr images to include only those taken on waterbodies, resulting in a total of over 45,000 visits between ca. 7,000 waterbodies in the Pacific Northwest over the past 10 years. Distances traveled between any two waterbodies was estimated using the GraphHopper routing API. Network theory was used to create mathematical graphs depicting the number of linkages between waterbodies that were located less than the median distance travelled by anglers.
This resulted in a network of 353 waterbodies with 50 components (i.e., groups of interconnected waterbodies). Results from this incubator will directly inform interagency management interventions at both local and landscape scales by quantifying angler movement networks and determining how they change through time. Heavily-used locations of fish activity (i.e., highly-connected network nodes) are prime targets for collaborative regulatory approaches that focus on the implementation of prevention strategies, such as watercraft inspection and cleaning stations to remove invasive species, and educational signage to discourage anglers from releasing live bait into waterbodies.
Affective state analysis of ultrasonic vocalizations in animal models of mTBI/PTSD and neuropathic pain
Project Lead: Abigail G. Schindler, acting assistant professor, Psychiatry and Behavioral Services eScience Liaison: Valentina Staneva
Click to enlarge. Image credit: Abigail Schindler
Chronic health conditions (e.g. mental health, pain) are increasing in the US and contribute substantially to decreased quality of life, loss of productivity, and increased financial burden. Indeed, the CDC estimates that over 90% of annual health care expenditures are for people with one or multiple chronic health conditions. Translational research efforts using rodent models can provide much needed insight into underlying mechanisms of chronic health conditions and are needed in order to facilitate the search for therapeutic approaches that can reduce and/or prevent adverse/maladaptive outcomes.
Critically, accurate quantification of affective state (e.g. positive, negative, pain, fear) has historically been a challenge in rodent models, with current available methods suffering from high subjectivity, lack of throughput, and invasive methods, leading to lack of reproducibility across research labs and/or inability to translate to humans. One promising area of research in rodent affective state is ultrasonic vocalizations (USVs). USVs are a form of rodent communication thought to represent an unbiased metric of affective state (there are thought to be potentially different “call signatures” for pleasure, pain, fear, etc.), but are historically difficult to analyze and interpret.
Currently, there is no open-source software available for USV detection and/or analysis (although Matlab based options exist, e.g. DeepSqueak), limiting the applicability of USV research. With a focus on these USVs and open-source products, the current project seeks to develop a Python-based, high-throughput approach for 1) isolating USV calls and 2) assessing affective state. We have USV recordings from a variety of mouse groups (e.g. control, TBI, fear-induction, neuropathic pain) and our goal is to establish specific USV call repertoires/signatures related to specific affective states/experimental conditions/behavioral tasks/therapeutic treatments.
Figure 1: tSNE plot and spectrogram visualization of USV calls (yellow dots: clustered to bottom left (<20kH USV call type) and bottom right (broad band click call type) and noise (purple dots: clustered to upper portion). Image credit: Abigail Schindler
During the incubator we created a series of Python Jupyter Notebooks for processing audio files to isolate USVs using either supervised classification algorithms or transfer learning. Notebooks for visualization and clustering analysis were also created. We utilized Google Colaboratory’s free cloud service with GPU support.
The general data acquisition and analysis pipeline is as follows: 1) acquire audio recording of rodent USVs using Avisoft SASLab Lite (free; saves audio as .wav file), 2) annotate USVs in each file (to use for training classifier) using Raven Lite (free; saves annotations as a ‘selections table’), 3) use annotated selections to train a classification algorithm (USV vs noise), 4) use trained model to process un-annotated audio files.
Audio files were split into 25 ms slices, converted into spectrograms, and saved in a labelled array format for easy access afterwards. Two feature sets were generated and used for model testing and evaluation of shallow learning algorithms: a) set of 8 spectral features (power, purity, centroid, spread, skewness, kurtosis, slope, and roll off), b) power spectrum distribution used as features (e.g. 257 frequencies contained in spectrogram, find power at each frequency for each slice). Full spectrogram images were used for deep learning. The final chosen model was saved and then used subsequently to isolate USVs from un-annotated audio files.
An initial challenge of sufficient computer RAM for processing the audio files was overcome by using the Python package Xarray and saving the processed data (e.g. 25 ms spectrogram slices) as netCDF files. A second challenge was the highly imbalanced nature of the datasets (e.g. ~25000 slices from a 10 min audio recording will contain ~5-50 USVs). We approached the problem by balancing the training dataset through upsampling, downsampling, and stratification procedures, which although it achieved good performance on the training set, resulted in many false detections on new unobserved data. We concluded that downsampling restricted the diversity of the noise during training and we achieved better performance without balancing by appropriately weighting the objective. Cross-validation with rare observations was a challenge since performance scores become sensitive to the training data organization. This was especially visible in deep learning training when data is split into small batches, while the dimensionality of the features is large.
Interactions of tropical precipitation with atmospheric circulation and energy transport
Project Lead: Lauren Kuntz, Department of Oceanography eScience Liaison: Rob Fatland, with Purshottam Shivraj
The canonical view of precipitation driven atmospheric energy flux fails to explain regional structure. (Top) The simplified theory of Hadley Circulation suggests energy convergence along the surface of the equator with poleward divergence in the upper layers. (Bottom) The symmetry in this theory breaks down when exploring the regional patterns of precipitation, which vary zonally and meridionally. Image annotation: Adler et al, 2003; image credit: Lauren Kuntz
However, our canonical view of how precipitation impacts broad scale atmospheric circulation and energy transport relies on simplified models of the zonal mean; it fails to explain the vertical and meridional variability in precipitation events, as well as their impact on energy fluxes and circulation patterns.
Developing physical theories that capture this variability is immediately relevant to our understanding of regional climate patterns and projecting future changes in response to greenhouse gas forcing.
To address this, we plan to use over 15 years of satellite precipitation data to constrain the different modes of precipitation and their impact vertical energy convergence and divergence.
Using clustering methods, we will determine the dominant modes of precipitation in terms of their energy footprint, allowing us to explore the statistical patterns of these modes spatially and temporally.
With a better sense of where precipitation is driving energy converging and diverging regionally and vertically, we can explore how that fits into the broad scale circulation patterns and atmospheric energy budgets.
We will directly compare climatological means of latent heating from precipitation modes to observations of atmospheric energy flux, developing insight as to how the two are related.
Through this lens, we will also look at the variability of precipitation energy modes across timescales, with the goal of exposing links with circulation and energy transport variability in the atmosphere.
Earth’s weather and climate are deeply impacted by the transport of energy through atmospheric circulation. Precipitation is deeply coupled to this system: on one hand, the atmospheric circulation dictates where regions of instability lead to convection, while on the other hand latent heating from precipitation redistributes energy and leads to anomalous circulations. Our canonical view of how precipitation impacts broad scale atmospheric circulation and energy transport relies on simplified models of the zonal mean; it fails to explain the vertical and meridional variability in precipitation events, as well as their impact on regional energy fluxes and circulation patterns. Developing physical theories that capture this variability is immediately relevant to our understanding of regional climate patterns and projecting future changes in response to greenhouse gas forcing.
The vertical latent heat profiles of the 5 precipitation modes are shown, along with maps showing they occur spatially. Clusters 1, 3, and 4 all show frequent occurrence within the Intertropical Convergence Zone (ITCZ), although cluster 1 also appears to be associated with storm tracks in the mid-latitudes of the South Pacific. Image credit: Lauren Kuntz
During the incubator, we set out to understand the different modes of precipitation through the lens of their impact on the convergence and divergence of energy using 16 years of satellite data from the Tropical Rainfall Measurement Mission (TRMM). To start, we focused exclusively on the central Pacific region, allowing us to establish a methodology over a subset of the data which can easily be expanded to other regions. We initially grouped observations of precipitation into rain events based on their geospatial structure using the DBSCAN algorithm. Subsequently, we performed kmeans clustering on the mean latent heat profiles of the events to define the various precipitation modes. We then explored how these modes vary in space and time, looking both at the seasonal cycle as well as the variability in relation to other climate forcings, such as ENSO. Given the volume of data, these methods required us to work with cloud computing resources on AWS.
This work easily lends itself to further extensions. Now that the methodology has been refined, it can be expanded to look at the full TRMM dataset. An exploration of monsoon precipitation structure as well as land and ocean differences will be immediate follow on analyses. In addition, incorporating reanalysis data and idealized climate model simulations will allow us to get a better understanding as to how the latent heating structure links back to circulation anomalies. This understanding will also develop further insight as to how precipitation and circulation respond to energy forcings, helping answer long standing questions in the climate community, such as what caused regional rainfall to shift so dramatically in response to precession of Earth’s orbit. Now that the methodology has been established, a series of academic publications are expected as follow-ons.
Project Lead: John Osborne, Joint Institute for the Study of the Atmosphere and Ocean eScience Liaison: Bryna Hazelton
Clockwise from top left: Picture of a coastal Washington buoy in Hood Canal. Current locations of the CO2 buoys. Plot showing relationship between CO2 and dissolved oxygen. Time series of dissolved oxygen showing data errors that need to be caught. Images compiled by John Osborne.
The ocean plays a major role in controlling Earth’s climate by absorbing one quarter to one third of anthropogenic carbon dioxide released into the atmosphere through fossil fuel burning and land-use changes.
Between 2004 and 2018, NOAA and UW scientists have established 40 sites with time series of surface ocean pCO2 (partial pressure of CO2), of which 17 also include autonomous pH measurements.
These time series characterize a wide range of surface ocean conditions in different open-ocean (17 sites), coastal (13 sites), and coral reef (10 sites) regimes.
Our objective is to develop quality control procedures and methodologies using recovered data and existing quality controlled pCO2 data.
In particular we need quality control procedures developed for O2, Chl, and NTU. The quality control methodologies should be applicable to our real-time data streams.
Image credit: John Osborne
Due to unforeseen circumstances, the results on the this project have been delayed. Work is ongoing, and the outcomes will be provided at a later date.