For an overview of the Incubator Program click here.
Deer Fear: Using Accelerometers and Video Camera Collars to Understand if Wolves Change Deer Behavior
Project Lead: Apryle Craig, UW Department of Environmental & Forest Sciences PhD Candidate
eScience Liaison: Valentina Staneva
Animal behavior can provide insight into underlying processes that drive population and ecosystem dynamics. Accelerometers are small, inexpensive biologgers that can be used to identify animal behaviors remotely. Tri-axial accelerometers measure an animal’s acceleration in each of the three dimensions, frequently recording 10-100 measurements per second. These fine-scale data provide an opportunity to study nuanced behaviors, but have historically posed challenges for storage and analysis. However, animal behavior researchers have been slow to adopt accelerometers, perhaps owing to the rigorous calibration required to infer behavior from acceleration data. Calibration involves time-synchronizing behavioral observations with their associated accelerometer readings, which often necessitates the use of captive animals, surrogate species, or field observations on instrumented individuals. Alternatively, animal-borne video cameras may be used to directly calibrate or validate accelerometers. My goal is to use video from animal-borne cameras to assess the capacity of collar-mounted tri-axial accelerometers and machine learning to accurately classify foraging, vigilance, resting and traveling behavioral states in free-ranging deer. Deer were collared in areas of Washington that were recolonized by wolves and areas without wolves. I hope to use the resulting behavioral classifications to determine whether wolf recolonization is changing deer behavior.
Systems level analysis of metabolic pathways across a marine oxygen deficient zone
Project Lead: Gabrielle Rocap, UW School of Oceanography Professor
eScience Liaison: Bryna Hazelton
Marine Oxygen Deficient Zones (ODZs) are naturally-occurring mid-layer oxygen poor regions of the ocean, sandwiched between oxygenated surface and deep layers. In the absence of oxygen, microorganisms in ODZs use a variety of other elements as terminal electron acceptors, most notably oxidized forms of nitrogen, reducing the amount of bio-available nitrogen in the global marine system through the production of N2O and N2 gas. These elemental transformations mean that marine ODZs have an outsized contribution to global biogeochemical cycling relative to the volume of ocean they occupy. As ODZs are expanding as the ocean warms, understanding the metabolic potential of the microbial communities within them is key to predicting global elemental cycles. The goal of this project is to use existing metagenomic data from ODZ microbial communities to quantify the metabolic pathways utilized by microorganisms in differently oxygenated water layers. We are using a set of 14 metagenomic libraries from different depths within the ODZ water column representing different oxygen levels (oxic, hypoxic, anoxic etc..) that have been assembled both individually and together. We will use the frequency of genes in microbial populations in each water sample to identify genetic signatures of different water regimes, with a particular focus on genes encoding enzymes mapped in the Kyoto Encyclopedia of Genes and Genomes (KEGG).
Predicting a drought with a flood of data: Evaluating the utility of data-driven approaches to seasonal hydrologic forecasts
Project Lead: Oriana Chegwidden, UW Civil & Environmental Engineering Department PhD Candidate and Staff Scientist
eScience Liaison: Nicoleta Cristea
Climate change is likely to exacerbate droughts in the future, compromising water availability around the world. Those changes in water availability may not be uniform across the land surface, with changes in precipitation, snowpack, and increased losses due to evapotranspiration. The resulting combined changes to surface water availability are an active area of research. These potential changes are of global significance, particularly in transboundary river basins. Given that earth systems and river basins are agnostic of political boundaries, the potential impacts of changes in water availability, particularly when in a river basin that straddles a political boundary, are significant. In this project we evaluate an ensemble of newly released global climate model (GCM) simulations from the Coupled Model Intercomparison Project Phase 6 (CMIP6), investigating the global impact of climate change on surface water availability. We evaluate these projected changes across river basins, evaluating the extent to which river basins respond uniformly, or whether transboundary river basins will experience greater inequity in water availability. We perform the analysis on the Pangeo platform, using CMIP6 data housed on Google Cloud. We validate the results against ERA5, a global reanalysis product which serves as a gridded observational dataset available at similar resolutions and spatial extents appropriate for comparison with GCM outputs. For example, the mean annual runoff from this dataset for the period 1985-2014 is shown in the figure at right. Ultimately, we provide an analysis of changes in water availability in transboundary river basins. This provides a global study of projected climate change impacts on international water security.
British Justifications for Internment without Trial: NLP Approaches to Analyzing Government Archives
Project Lead: Sarah Dreier, UW Department of Political Science and Paul G. Allen School of Computer Science Engineering Postdoctoral Fellow
eScience Liaison: Jose Hernandez
How do liberal democracies justify policies that violate the rights of targeted citizens? When facing real or perceived national security threats, democratic states routinely frame certain citizens as “enemies of the state” and subsequently undermine those citizens’ freedom and liberties. This Incubator project uses natural language processing (NLP) techniques on digitized archive documents to identify and model how United Kingdom government officials internally justified their decisions to intern un-convicted Irish Catholics without trial during its “Troubles with Northern Ireland.” This project uses three NLP approaches—dictionary methods, word vectors, and adaptions of pre-trained models—to examine if/how government justifications can be identified in text. Each approach is based on, validated by, and/or trained on hand-coded annotation and classification of all justifications in the corpus (the “ground truth”), which was executed prior to the start of this project. In doing so, this project seeks to advance knowledge about government human rights violations and to explore the use of NLP on rich, nuanced, and “messy” archive text. More broadly, this project models the promise of combining archive text, qualitative coding, and computational techniques in social science. This project is funded by NSF Award #1823547; Principal Investigators: Emily Gade, Noah Smith, and Michael McCann.
This project yielded four products: cleaned text corpora, binary and multi-class machine learning text classifiers, word embeddings based on digitized archive text, and a shallow neural network model for predicting text classification.
First, we prepared qualitatively coded material into datasets for descriptive visualization and NLP analysis, including: a complete archive corpus of all digitized text from +7,000 archive pages, a corpus of all ground-truth incidents of government justifications for internment without trial, and graphic representations of justification categories and frequencies over time.
Words most similar in vector space to three substantively important words demonstrates that word embeddings trained on our archive corpus are meaningful. For example, “Faulkner” (i.e., Northern Ireland Prime Minister Brian Faulkner) is most similar to other politicians involved in this case (e.g., Irish Prime Minister Jack Lynch).
Second, we explored training a machine-learning model, using binary and multi-class text classification, to classify a specific justification entry into its appropriate category. We used a “bag of words” approach, which trains a classifier based on the presence and frequency of words in a given entry. A simple binary model classified justification entries relatively well, achieving between 75-90% accuracy among the most prominent categories. The unigram and bi-gram terms most associated with each category’s binary classification also contributed to our substantive knowledge about our classification categories. Next, we assessed and tuned a more sophisticated multi-class classifier to distinguish among six justification categories. The best-performing machine learning classifier—a logistic regression model based on stemmed unigrams (excluding English stopwords and those that occurred fewer than 10 times in the corpus)—classified justification entries into six pre-determined categories with approximately 43% accuracy, which is an improvement upon random. These classifiers suggest that our justification corpus contains signals for training machine learning tasks, despite the imperfections associated with digitized archive text.
Finally, we developed a deep-learning approach to predicting a justification entry’s classification (Jurafsky and Martin 2019). This allowed us to leverage a given word’s semantic and syntactic meaning (using pre-trained word embeddings) to aid our classification task. Because we expected our text data to contain nuances and context-specific idiosyncrasies, we developed word embeddings based on our complete archive-based corpus. These embeddings proved to be meaningful and informative, despite our imperfect data—which is relatively limited in size and contains considerable errors, omissions, and duplication (See Figure 2). Using these archive-based word embeddings, we built a shallow Convolutional Neural Network (CNN) to predict a sentence-based justification entry’s classification (Kim 2014). Our preliminary CNN—which, at the time of this writing, is over-fitted to the training data and only achieves around 30% accuracy when classifying testing data—serves as the basis for further fine-tuning.
Together, these products lay the groundwork for analyzing government justifications for internment, continuing to develop machine-learning approaches to identifying government justifications for human rights violations, and modeling how NLP techniques can aid the analysis of real-world political or government-related material (and for archived texts more generally).
Jurafsky, Daniel and James H. Martin. 2019. “Neural Networks and Neural Language Models.” In Speech & Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of October 2, 2019. Available at: http://web.stanford.edu/jurafsky/slp3/ed3book.pdf.
Kim, Yoon. 2014. “Convolutional Neural Networks for Sentence Classification.” arXiv:1408.5882v2 [cs.CL] 3 Sep 2014.
Automated monitoring and analysis of slow earthquake activity
Project Lead: Ariane Ducellier, UW Department of Earth & Space Sciences PhD Candidate
eScience Liaison: Scott Henderson
Number and location of low-frequency earthquakes recorded on April 13th 2008 in northern California.
Low-frequency earthquakes (LFEs) are small magnitude earthquakes, with typical magnitude less than 2,and reduced amplitudes at frequencies greater than 10 Hz relative to ordinary small earthquakes. Their occurrence is often associated with tectonic tremor and slow slip events along the plate boundary in subduction zones and occasionally transform
fault zones. They are usually grouped into families of events, with all the earthquakes of a given family originating from the same small patch on the plate interface, and recurring more or less episodically in a bursty manner. Currently, many research papers analyze seismic data for a finite period of time, and produce a catalog of low-frequency earthquakes for this given period of time. However, there is little continuous monitoring of these phenomena.
We are currently using data from seismic stations in northern California to detect low-frequency earthquakes and produce a catalog during the period 2007-2019. However, the seismic stations that we are using are still installed and recording new data every day. Thus, we want to develop an application that will carry out the same analysis (we have been conducting offline so far) now automatically and continuously on the future data to be recorded during the year 2020 and after. Therefore, an increase of low-frequency earthquake activity will be automatically detected and reported as soon as it has started.
Developing a relational database for acoustic detections and locations of baleen whales in the Northeast Pacific Ocean
Project Lead: Rose Hilmo, UW School of Oceanography PhD Candidate
eScience Liaison: Joseph Hellerstein
The health and recovery of whale populations is a major concern in ocean ecosystems. This project is about using data science to improve the monitoring of whale populations, an ongoing area of research in ocean ecology.
Lower) Spectrogram showing 20 minutes of repeating blue whale B-calls stereotyped by a 10 second downsweep centered on 15 Hz. Upper) Plot showing output of our B-call spectrogram cross-correlation detector (blue) and peak detections (orange x’s) of calls.
Our focus is acoustic monitoring, a very effective tool for monitoring the presence and behavior of whales in a region over extended time periods. Ocean bottom seismometers (OBSs) that are used to record earthquakes on the seafloor can also be used to detect blue and fin whale calls. We take advantage of a large 4-year OBS deployment spanning the coast of the Pacific northwest to investigate spatial and temporal trends in fin and blue whale calling, data that provide an unprecedented scale for whale monitoring. Our main research question is: How does whale call activity vary in time (e.g., seasonally and annually) and space in the Northwest Pacific? Additionally, how does call variability relate to other parameters such as environmental conditions and anthropogenic noise such as ship noise and seismic surveys? This information will provide considerable insight into whale populations and ultimately into ocean ecology.
Over the past decade, our lab group has implemented many methods of blue and fin whale acoustic detection and location. This has generated large volumes of data on temporal and spatial calling patterns of these species in the Northeast Pacific. Our main goal of the data science incubator is to build and publish a SQL relational database of our compiled whale data. This will not only improve our own ability to work with our current data and easily integrate new data but will also allow others in our community to utilize our framework and incorporate their own data. Additionally, we will re-implement our whale detection codes (currently in MATLAB) in Python. These codes will be open source (on github), make use of the relational database, and incorporate software engineering best practices. It is our hope other researchers will apply our methods to study fin and blue whales using large OBS deployments in other key ecological regions such as Alaska, Hawaii, and Bransfield Strait (Antarctica).
Data analytics for demixing and decoding patterns of population neural activity underlying addiction behavior
Project Lead: Charles Zhou, Anesthesiology & Pain Medicine Staff Scientist
eScience Liaison: Ariel Rokem
In 2017, 1.7 million people in the United States reported addiction to opioid pain relievers (Center for behavioral Health Statistics and Quality, 2017) while 47,000 individuals died from opioid overdose (CDC, 2018). Understanding the mechanisms of substance use disorders and developing targeted treatments are monumental challenges due to the facts that the responsible brain regions are situated deep within the brain and possess highly diverse neuron populations and circuitry. To tackle this challenge, laboratories at UW’s NAPE (Neurobiology of Addiction, Pain, and Emotion) center utilize 2-photon calcium imaging to record from hundreds of neurons in animal deep brain structures simultaneously during drug seeking behaviors. Briefly, this method combines high temporal and spatial resolution microscopy with cell-type specific fluorescent neural activity readout to produce videos of brain activity where single neurons can be resolved. As a result, for a given animal subject one can track over a thousand neurons over the course of several days of behavior and drug administration assays; however, sophisticated data analysis techniques to dissect how activity patterns across hundreds of thousands of neurons relate to behavior and addiction remain underdeveloped. The aim of this project is to apply novel statistical and machine learning analysis techniques to large-scale 2-photon calcium imaging data with respect to addiction-related behaviors and assays.The project plan is to first perform dimensionality reduction on the mouse calcium imaging videos using tensor component analysis (Williams AH et al., 2018, Neuron) then to use those data to predict behavioral conditions using a convolutional neural network. Once the neural network is able to discriminate behavioral conditions, I can examine the spatial maps that are learned by the neural network nodes. The overall significance of this project is to gain insight to spatially distributed neural patterns that underlie addiction behaviors, allowing for targeted development of drug addiction therapies.
Calcium imaging data with experimental condition labels will be used to train a convolutional neural network. Latent cell activation patterns will be identified from the model. Panel on the right represents a cartoon sample cell pattern identified by feature extraction.
We wrote and performed all analyses using Python Jupyter Notebooks and modularized Python scripts edited in Pycharm. We utilized the following Python packages: xarray for organizing the data, scikit-learn for dimensionality reduction, and matplotlib for data visualization.
Input data was a calcium imaging video (3D dataset with dimensions: x pixels, y pixels, and frames/time) that had already undergone motion correction. Importantly, this recording was made in a mouse during a classical conditioning behavioral task. This task consisted of trials where a tone was presented with a sucrose reward (CS+ rewarded) and trials with a different tone by itself (CS-). Further preprocessing involved extracting snippits of the video for each trial, sorting these trials by behavioral condition, and flattening the space dimensions (x and y).
A) Eigenvectors were reshaped to the shape of the x-y coordinate space to show pixel weightings for each PC. Note the resemblance to neuron shapes. B) Trial-averaged activity traces transformed and plotted into a 3D space consisting of the top 3 principal components. Note the divergence of traces with respect to PC0. C) Similar to B, but trials in each condition were split into 5 groups to show evolution of activity across the course of the session.
Our primary analysis involved performing principal component analysis (PCA) to reduce dimensionality in the pixel dimension. The resulting principal components represent groups of pixels that share common temporal dynamics. To set the PCA space up, we fit a model using the trial- and condition-averaged data (dimensions were frames across the trial epoch by pixels). Upon inspection of the explained variance and the eigenvectors pixel weightings, we found the top three components explained about 30% of the variance and had spatial distributions matching biological neurons (Fig 2A). To compare how activity during the two conditions evolved across these top 3 principal components, we then transformed trial-averaged data for each condition using the aforementioned fitted model and projected the activity traces into the 3D space consisting of the top 3 principal components (Fig 2B). We found that the two trial conditions diverged substantially later in the trial (when the animal drank the reward for the CS+ condition) with respect to the first principal component. Finally to examine finer temporal structure across the session, we split and binned the trials into 5 groups, performed PCA transformation, and plotted into 3D space (Fig 2C). We observed a potential evolution of increased activity over the course of binned trials for the CS+ condition with respect to the first principal component.
While we were pleasantly distracted by the PCA method during the incubator, many more analyses can be performed to follow up. Namely TCA was mentioned in the project description; we started on this analysis however initial results did not quite line up with PCA results (not shown). Also because differences between conditions could be visualized in the PCA, the data may lend itself nicely to machine learning classification. Overall, these results highlight the potential of dimensionality reduction techniques to gain insight to population spatio-temporal activity patterns related to addiction-related paradigms.
Williams AH, Kim TH, Wang F, Vyas S, Ryu SI, Shenoy KV, Schnitzer M, Kolda TG, Ganguli S. Unsupervised Discovery of Demixed, Low-Dimensional Neural Dynamics across Multiple Timescales through Tensor Component Analysis. Neuron. 2018 Jun 27;98(6):1099-1115.e8. doi: 10.1016/j.neuron.2018.05.015. Epub 2018 Jun 7.