For an overview of the Incubator Program click here.
Deer Fear: Using Accelerometers and Video Camera Collars to Understand if Wolves Change Deer Behavior
Project Lead: Apryle Craig, UW Department of Environmental & Forest Sciences PhD Candidate
eScience Liaison: Valentina Staneva
Animal behavior can provide insight into underlying processes that drive population and ecosystem dynamics. Accelerometers are small, inexpensive biologgers that can be used to identify animal behaviors remotely. Tri-axial accelerometers measure an animal’s acceleration in each of the three dimensions, frequently recording 10-100 measurements per second. These fine-scale data provide an opportunity to study nuanced behaviors, but have historically posed challenges for storage and analysis. However, animal behavior researchers have been slow to adopt accelerometers, perhaps owing to the rigorous calibration required to infer behavior from acceleration data. Calibration involves time-synchronizing behavioral observations with their associated accelerometer readings, which often necessitates the use of captive animals, surrogate species, or field observations on instrumented individuals. Alternatively, animal-borne video cameras may be used to directly calibrate or validate accelerometers. My goal is to use video from animal-borne cameras to assess the capacity of collar-mounted tri-axial accelerometers and machine learning to accurately classify foraging, vigilance, resting and traveling behavioral states in free-ranging deer. Deer were collared in areas of Washington that were recolonized by wolves and areas without wolves. I hope to use the resulting behavioral classifications to determine whether wolf recolonization is changing deer behavior.
Historically, biologists watch a representative individual of the animal species move, either in a laboratory or in the field, and identify movements that could be associated with behaviors of interest. They then use these movements to calculate features from acceleration records that they believe would align with behaviors of interest. However, this process requires a priori assumptions about species-specific movement patterns associated with behaviors of interest. We chose an approach with a goal of minimizing the assumptions made about animal movements.
We started out by attempting to identify the simplest case: where the deer was engaging in just 1 behavior for the full 10-seconds of the video. So, we removed all data where the deer was engaging in multiple behaviors. Next, we converted the acceleration data from the time domain to frequency domain using a Fourier transformation. By doing so, our algorithm could categorize signals based on frequency and amplitude, while ignoring when the signal occurred in time.
Figure 1: PCA of deer acceleration in the frequency domain, colored by behavior. As expected, travelling behavior (RunOrWalk, blue) is very distinct from bedded (pink). However, bedded and vigilance (purple) shows a lot of overlap.
We split our labeled data into training, validation, and test datasets. We used principle component analysis on our training data to find the principle components of the transformed acceleration. We included the first four PCs in a logistic regression model to predict behaviors from the signal. We used validation data to determine how well our model could classify behaviors based on new acceleration data from deer it was trained on.
We created a confusion matrix with the model-predicted behaviors and the true behavior state, which was observed in the videos. The model correctly classified 3003 behaviors and incorrectly classified 130. In 127 of those incorrect classifications, the model predicted foraging when the deer was bedded. The model correctly classified 2071 behaviors of deer foraging. Most of the miscategorized foraging behaviors (92) were categorized as bedded. The model correctly classified 375 travelling deer behaviors and mislabeled 287 of the travelling videos as foraging. The model incorrectly labelled most of the vigilance as bedded. This was somewhat expected since the visual inspection of the principle component analysis showed a lot of overlap between bedded and vigilance.
Systems level analysis of metabolic pathways across a marine oxygen deficient zone
Project Lead: Gabrielle Rocap, UW School of Oceanography Professor
eScience Liaison: Bryna Hazelton
Marine Oxygen Deficient Zones (ODZs) are naturally-occurring mid-layer oxygen poor regions of the ocean, sandwiched between oxygenated surface and deep layers. In the absence of oxygen, microorganisms in ODZs use a variety of other elements as terminal electron acceptors, most notably oxidized forms of nitrogen, reducing the amount of bio-available nitrogen in the global marine system through the production of N2O and N2 gas. These elemental transformations mean that marine ODZs have an outsized contribution to global biogeochemical cycling relative to the volume of ocean they occupy. As ODZs are expanding as the ocean warms, understanding the metabolic potential of the microbial communities within them is key to predicting global elemental cycles. The goal of this project is to use existing metagenomic data from ODZ microbial communities to quantify the metabolic pathways utilized by microorganisms in differently oxygenated water layers. We are using a set of 14 metagenomic libraries from different depths within the ODZ water column representing different oxygen levels (oxic, hypoxic, anoxic etc..) that have been assembled both individually and together. We will use the frequency of genes in microbial populations in each water sample to identify genetic signatures of different water regimes, with a particular focus on genes encoding enzymes mapped in the Kyoto Encyclopedia of Genes and Genomes (KEGG).
Predicting a drought with a flood of data: Evaluating the utility of data-driven approaches to seasonal hydrologic forecasts
Project Lead: Oriana Chegwidden, UW Civil & Environmental Engineering Department PhD Candidate and Staff Scientist
eScience Liaison: Nicoleta Cristea
Climate change is likely to exacerbate droughts in the future, compromising water availability around the world. Those changes in water availability may not be uniform across the land surface, with changes in precipitation, snowpack, and increased losses due to evapotranspiration. The resulting combined changes to surface water availability are an active area of research. These potential changes are of global significance, particularly in transboundary river basins. Given that earth systems and river basins are agnostic of political boundaries, the potential impacts of changes in water availability, particularly when in a river basin that straddles a political boundary, are significant. In this project we evaluate an ensemble of newly released global climate model (GCM) simulations from the Coupled Model Intercomparison Project Phase 6 (CMIP6), investigating the global impact of climate change on surface water availability. We evaluate these projected changes across river basins, evaluating the extent to which river basins respond uniformly, or whether transboundary river basins will experience greater inequity in water availability. We perform the analysis on the Pangeo platform, using CMIP6 data housed on Google Cloud. We validate the results against ERA5, a global reanalysis product which serves as a gridded observational dataset available at similar resolutions and spatial extents appropriate for comparison with GCM outputs. For example, the mean annual runoff from this dataset for the period 1985-2014 is shown in the figure at right. Ultimately, we provide an analysis of changes in water availability in transboundary river basins. This provides a global study of projected climate change impacts on international water security.
British Justifications for Internment without Trial: NLP Approaches to Analyzing Government Archives
Project Lead: Sarah Dreier, UW Department of Political Science and Paul G. Allen School of Computer Science Engineering Postdoctoral Fellow
eScience Liaison: Jose Hernandez
How do liberal democracies justify policies that violate the rights of targeted citizens? When facing real or perceived national security threats, democratic states routinely frame certain citizens as “enemies of the state” and subsequently undermine those citizens’ freedom and liberties. This Incubator project uses natural language processing (NLP) techniques on digitized archive documents to identify and model how United Kingdom government officials internally justified their decisions to intern un-convicted Irish Catholics without trial during its “Troubles with Northern Ireland.” This project uses three NLP approaches—dictionary methods, word vectors, and adaptions of pre-trained models—to examine if/how government justifications can be identified in text. Each approach is based on, validated by, and/or trained on hand-coded annotation and classification of all justifications in the corpus (the “ground truth”), which was executed prior to the start of this project. In doing so, this project seeks to advance knowledge about government human rights violations and to explore the use of NLP on rich, nuanced, and “messy” archive text. More broadly, this project models the promise of combining archive text, qualitative coding, and computational techniques in social science. This project is funded by NSF Award #1823547; Principal Investigators: Emily Gade, Noah Smith, and Michael McCann.
This project yielded four products: cleaned text corpora, binary and multi-class machine learning text classifiers, word embeddings based on digitized archive text, and a shallow neural network model for predicting text classification.
First, we prepared qualitatively coded material into datasets for descriptive visualization and NLP analysis, including: a complete archive corpus of all digitized text from +7,000 archive pages, a corpus of all ground-truth incidents of government justifications for internment without trial, and graphic representations of justification categories and frequencies over time.
Words most similar in vector space to three substantively important words demonstrates that word embeddings trained on our archive corpus are meaningful. For example, “Faulkner” (i.e., Northern Ireland Prime Minister Brian Faulkner) is most similar to other politicians involved in this case (e.g., Irish Prime Minister Jack Lynch).
Second, we explored training a machine-learning model, using binary and multi-class text classification, to classify a specific justification entry into its appropriate category. We used a “bag of words” approach, which trains a classifier based on the presence and frequency of words in a given entry. A simple binary model classified justification entries relatively well, achieving between 75-90% accuracy among the most prominent categories. The unigram and bi-gram terms most associated with each category’s binary classification also contributed to our substantive knowledge about our classification categories. Next, we assessed and tuned a more sophisticated multi-class classifier to distinguish among six justification categories. The best-performing machine learning classifier—a logistic regression model based on stemmed unigrams (excluding English stopwords and those that occurred fewer than 10 times in the corpus)—classified justification entries into six pre-determined categories with approximately 43% accuracy, which is an improvement upon random. These classifiers suggest that our justification corpus contains signals for training machine learning tasks, despite the imperfections associated with digitized archive text.
Finally, we developed a deep-learning approach to predicting a justification entry’s classification (Jurafsky and Martin 2019). This allowed us to leverage a given word’s semantic and syntactic meaning (using pre-trained word embeddings) to aid our classification task. Because we expected our text data to contain nuances and context-specific idiosyncrasies, we developed word embeddings based on our complete archive-based corpus. These embeddings proved to be meaningful and informative, despite our imperfect data—which is relatively limited in size and contains considerable errors, omissions, and duplication (See Figure 2). Using these archive-based word embeddings, we built a shallow Convolutional Neural Network (CNN) to predict a sentence-based justification entry’s classification (Kim 2014). Our preliminary CNN—which, at the time of this writing, is over-fitted to the training data and only achieves around 30% accuracy when classifying testing data—serves as the basis for further fine-tuning.
Together, these products lay the groundwork for analyzing government justifications for internment, continuing to develop machine-learning approaches to identifying government justifications for human rights violations, and modeling how NLP techniques can aid the analysis of real-world political or government-related material (and for archived texts more generally).
Jurafsky, Daniel and James H. Martin. 2019. “Neural Networks and Neural Language Models.” In Speech & Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of October 2, 2019. Available at: http://web.stanford.edu/jurafsky/slp3/ed3book.pdf.
Kim, Yoon. 2014. “Convolutional Neural Networks for Sentence Classification.” arXiv:1408.5882v2 [cs.CL] 3 Sep 2014.
Automated monitoring and analysis of slow earthquake activity
Project Lead: Ariane Ducellier, UW Department of Earth & Space Sciences PhD Candidate
eScience Liaison: Scott Henderson
Number and location of low-frequency earthquakes recorded on April 13th 2008 in northern California.
Low-frequency earthquakes (LFEs) are small magnitude earthquakes, with typical magnitude less than 2,and reduced amplitudes at frequencies greater than 10 Hz relative to ordinary small earthquakes. Their occurrence is often associated with tectonic tremor and slow slip events along the plate boundary in subduction zones and occasionally transform
fault zones. They are usually grouped into families of events, with all the earthquakes of a given family originating from the same small patch on the plate interface, and recurring more or less episodically in a bursty manner. Currently, many research papers analyze seismic data for a finite period of time, and produce a catalog of low-frequency earthquakes for this given period of time. However, there is little continuous monitoring of these phenomena.
We are currently using data from seismic stations in northern California to detect low-frequency earthquakes and produce a catalog during the period 2007-2019. However, the seismic stations that we are using are still installed and recording new data every day. Thus, we want to develop an application that will carry out the same analysis (we have been conducting offline so far) now automatically and continuously on the future data to be recorded during the year 2020 and after. Therefore, an increase of low-frequency earthquake activity will be automatically detected and reported as soon as it has started.
LFEs detected in the last two months with the new application for an LFE family located in northern California.
We have created a Python package with the Python tool poetry and made it available to the public on GitHub. On GitHub, we have created a workflow that every day launches the code source to download the seismic data from three days ago, analyze the data and find the low-frequency earthquakes. The corresponding catalog for this day is then stored in a csv file, which is then uploaded on Google Drive. The last step we are currently developing is to download all the csv files that have been stored on Google Drive, and use the data to plot a figure of the low-frequency earthquake catalog.
Developing a relational database for acoustic detections and locations of baleen whales in the Northeast Pacific Ocean
Project Lead: Rose Hilmo, UW School of Oceanography PhD Candidate
eScience Liaison: Joseph Hellerstein
The health and recovery of whale populations is a major concern in ocean ecosystems. This project is about using data science to improve the monitoring of whale populations, an ongoing area of research in ocean ecology.
Lower) Spectrogram showing 20 minutes of repeating blue whale B-calls stereotyped by a 10 second downsweep centered on 15 Hz. Upper) Plot showing output of our B-call spectrogram cross-correlation detector (blue) and peak detections (orange x’s) of calls.
Our focus is acoustic monitoring, a very effective tool for monitoring the presence and behavior of whales in a region over extended time periods. Ocean bottom seismometers (OBSs) that are used to record earthquakes on the seafloor can also be used to detect blue and fin whale calls. We take advantage of a large 4-year OBS deployment spanning the coast of the Pacific northwest to investigate spatial and temporal trends in fin and blue whale calling, data that provide an unprecedented scale for whale monitoring. Our main research question is: How does whale call activity vary in time (e.g., seasonally and annually) and space in the Northwest Pacific? Additionally, how does call variability relate to other parameters such as environmental conditions and anthropogenic noise such as ship noise and seismic surveys? This information will provide considerable insight into whale populations and ultimately into ocean ecology.
Over the past decade, our lab group has implemented many methods of blue and fin whale acoustic detection and location. This has generated large volumes of data on temporal and spatial calling patterns of these species in the Northeast Pacific. Our main goal of the data science incubator is to build and publish a SQL relational database of our compiled whale data. This will not only improve our own ability to work with our current data and easily integrate new data but will also allow others in our community to utilize our framework and incorporate their own data. Additionally, we will re-implement our whale detection codes (currently in MATLAB) in Python. These codes will be open source (on github), make use of the relational database, and incorporate software engineering best practices. It is our hope other researchers will apply our methods to study fin and blue whales using large OBS deployments in other key ecological regions such as Alaska, Hawaii, and Bransfield Strait (Antarctica).
This project yielded two main deliverables: Well documented python code for detection of whale calls in an accessible github repository, and the framework of a SQL relational database for storing whale call and location data.
The python code package we developed during the incubator detects blue and fin whale calls recorded on ocean bottom seismometers. However, the code is flexible and can be used to detect calls on other instruments such as pressure sensors and hydrophones as well. We use a spectrogram cross correlation method where a kernel image matching the spectral dimensions of call is constructed and then cross-correlated with a spectrogram of timeseries data from an instrument. Areas where the kernel and spectrogram match result in peaks in the detection score which are then recorded as calls (figure 1). Call metrics of interest to whale ecologists such as signal-to-noise ratio, call duration, and times are stored in a pandas dataframe and then written to our database.
A central part of this project is the relational database. The database is structured using an information model that relates stations, channels, detections, and calls. We developed a python implementation of the database. This structure of database implementation was essential for two reasons. First, this structure streamlines data storage and use. Referencing and filtering associated information from different instruments, calls, and whale locations for analysis is simple using the relational database tables. Additionally, the open source nature of all tools used to build and access the database increases accessibility for others who want to use this data in their own research. As of the end of the incubator, we have filled the database only with test detections and locations on small portions of data. This will be filled more completely with 4 years of detection and location data from arrays of ocean bottom seismometers off the coast of the Pacific Northwest as we apply our methods large-scale (Figure 2b).
Figure 2: a) Histogram of monthly blue whale B-call detections on a subset of ocean bottom seismometers for 2011-2012 calling season. b) Map showing ocean bottom seismometers deployed off the Pacific Northwest between 2011-2015 with subset stations highlighted.
So far, we have only run our blue whale detector on one year of ocean bottom seismometer data from the large Cascadia Initiative array as a proof of concept. We did this to test the quality of our detector and consult with whale experts about any additional useful call metrics we should add to our database. We will improve our detector expand the database to include additional metrics such as frequency measurements and background noise levels before running code on the full set of data.
Figure 2a shows a monthly histogram of total blue calls from our test dataset detected on a subset of 5 stations of interest. Blue whale call presence on these stations shows a strong seasonality, present only from late fall through early spring. Call counts vary by location. Calls on stations in shallow water near the coast (FN14A and M08A) peak in November, earlier in the season than the other stations in deep water which peak in December-January. Much deeper analysis of spatial and temporal trends in blue whale calling will be possible once our method is run on the full set of data.
Data analytics for demixing and decoding patterns of population neural activity underlying addiction behavior
Project Lead: Charles Zhou, Anesthesiology & Pain Medicine Staff Scientist
eScience Liaison: Ariel Rokem
In 2017, 1.7 million people in the United States reported addiction to opioid pain relievers (Center for behavioral Health Statistics and Quality, 2017) while 47,000 individuals died from opioid overdose (CDC, 2018). Understanding the mechanisms of substance use disorders and developing targeted treatments are monumental challenges due to the facts that the responsible brain regions are situated deep within the brain and possess highly diverse neuron populations and circuitry. To tackle this challenge, laboratories at UW’s NAPE (Neurobiology of Addiction, Pain, and Emotion) center utilize 2-photon calcium imaging to record from hundreds of neurons in animal deep brain structures simultaneously during drug seeking behaviors. Briefly, this method combines high temporal and spatial resolution microscopy with cell-type specific fluorescent neural activity readout to produce videos of brain activity where single neurons can be resolved. As a result, for a given animal subject one can track over a thousand neurons over the course of several days of behavior and drug administration assays; however, sophisticated data analysis techniques to dissect how activity patterns across hundreds of thousands of neurons relate to behavior and addiction remain underdeveloped. The aim of this project is to apply novel statistical and machine learning analysis techniques to large-scale 2-photon calcium imaging data with respect to addiction-related behaviors and assays.The project plan is to first perform dimensionality reduction on the mouse calcium imaging videos using tensor component analysis (Williams AH et al., 2018, Neuron) then to use those data to predict behavioral conditions using a convolutional neural network. Once the neural network is able to discriminate behavioral conditions, I can examine the spatial maps that are learned by the neural network nodes. The overall significance of this project is to gain insight to spatially distributed neural patterns that underlie addiction behaviors, allowing for targeted development of drug addiction therapies.
Calcium imaging data with experimental condition labels will be used to train a convolutional neural network. Latent cell activation patterns will be identified from the model. Panel on the right represents a cartoon sample cell pattern identified by feature extraction.
We wrote and performed all analyses using Python Jupyter Notebooks and modularized Python scripts edited in Pycharm. We utilized the following Python packages: xarray for organizing the data, scikit-learn for dimensionality reduction, and matplotlib for data visualization.
Input data was a calcium imaging video (3D dataset with dimensions: x pixels, y pixels, and frames/time) that had already undergone motion correction. Importantly, this recording was made in a mouse during a classical conditioning behavioral task. This task consisted of trials where a tone was presented with a sucrose reward (CS+ rewarded) and trials with a different tone by itself (CS-). Further preprocessing involved extracting snippits of the video for each trial, sorting these trials by behavioral condition, and flattening the space dimensions (x and y).
A) Eigenvectors were reshaped to the shape of the x-y coordinate space to show pixel weightings for each PC. Note the resemblance to neuron shapes. B) Trial-averaged activity traces transformed and plotted into a 3D space consisting of the top 3 principal components. Note the divergence of traces with respect to PC0. C) Similar to B, but trials in each condition were split into 5 groups to show evolution of activity across the course of the session.
Our primary analysis involved performing principal component analysis (PCA) to reduce dimensionality in the pixel dimension. The resulting principal components represent groups of pixels that share common temporal dynamics. To set the PCA space up, we fit a model using the trial- and condition-averaged data (dimensions were frames across the trial epoch by pixels). Upon inspection of the explained variance and the eigenvectors pixel weightings, we found the top three components explained about 30% of the variance and had spatial distributions matching biological neurons (Fig 2A). To compare how activity during the two conditions evolved across these top 3 principal components, we then transformed trial-averaged data for each condition using the aforementioned fitted model and projected the activity traces into the 3D space consisting of the top 3 principal components (Fig 2B). We found that the two trial conditions diverged substantially later in the trial (when the animal drank the reward for the CS+ condition) with respect to the first principal component. Finally to examine finer temporal structure across the session, we split and binned the trials into 5 groups, performed PCA transformation, and plotted into 3D space (Fig 2C). We observed a potential evolution of increased activity over the course of binned trials for the CS+ condition with respect to the first principal component.
While we were pleasantly distracted by the PCA method during the incubator, many more analyses can be performed to follow up. Namely TCA was mentioned in the project description; we started on this analysis however initial results did not quite line up with PCA results (not shown). Also because differences between conditions could be visualized in the PCA, the data may lend itself nicely to machine learning classification. Overall, these results highlight the potential of dimensionality reduction techniques to gain insight to population spatio-temporal activity patterns related to addiction-related paradigms.
Williams AH, Kim TH, Wang F, Vyas S, Ryu SI, Shenoy KV, Schnitzer M, Kolda TG, Ganguli S. Unsupervised Discovery of Demixed, Low-Dimensional Neural Dynamics across Multiple Timescales through Tensor Component Analysis. Neuron. 2018 Jun 27;98(6):1099-1115.e8. doi: 10.1016/j.neuron.2018.05.015. Epub 2018 Jun 7.