Meet the 2018 project fellows here.

Automatic Damage Annotation on Post-Hurricane Satellite Imagery

A graphic depicting satellite imagery of flooding after Hurricane Harvey

The red bounding box at the center contains a flooded building after Hurricane Harvey. Image credit: DigitalGlobe.

Project lead: Youngjun Choe, Ph.D., Assistant Professor, University of Washington, Industrial & Systems Engineering

Data scientist lead: Valentina Staneva

DSSG fellows: Sean Andrew Chen, Andrew Escay, Christopher Haberland, Tessa Schneider, An Yan

Project Summary: When a hurricane makes landfall, situational awareness is one of the most critical needs emergency managers face before they can respond to the event. To assess the situation and damage, the current practice largely relies on driving around the impacted area (also known as a windshield survey) by emergency response crews and volunteers. Recently, drone-based aerial images and satellite images have helped improve situational awareness, but the process still relies on human visual inspection. These current approaches are generally time-consuming and/or unreliable during an evolving disaster.

The governing research question of the project is: Can a machine learning algorithm automatically annotate damage on post-hurricane satellite images? To answer the question, the project uses satellite imagery data on the Greater Houston area before and after Hurricane Harvey in 2017, and damage labels created by crowdsourcing. If the project results in a successful algorithm (which is trained to quickly detect ‘Flooded / Damaged Building’, ‘Flooded / Blocked Road’, and ‘Blocked Bridge’ on a satellite image for a new event), it will be an exciting technological innovation to improve situational awareness during the first response to hurricane-induced disasters.

Project Outcomes: The team successfully developed an algorithm that automatically creates bounding boxes around flooded/damaged buildings with a mean average precision of 0.47 on a testing dataset of the satellite images. This achievement was possible because the team found and processed auxiliary datasets (e.g., building footprints created by the Oak Ridge National Laboratory and Microsoft) to create appropriate training datasets for object detection models. Furthermore, the team found aerial imagery of the Greater Houston area taken after Hurricane Harvey by the National Oceanic and Atmospheric Administration, and extensive damage labels collected by the Federal Emergency Management Agency. These additional datasets enabled the team to develop another algorithm to detect flooded/damaged buildings with a mean average precision of 0.32 on a testing dataset of the aerial images. Building upon the project outcomes, the Disaster Data Science Lab will continue the research.

View the final presentation slide deck (PDF). Watch the video on YouTube. Related links: Github repo. View the project poster presented by Christopher Haberland at the Cascadia Innovation Corridor Conference in Vancouver, B.C. in October 2018.

Blog posts:

A quicker recipe by Chris Haberland

Reflections on reproducibility of data science projects by An Yan

Getting ground truth data through geospatial analysis by Sean Chen

Is garbage in always garbage out? by Andrew Escay

Machine learning can improve disaster response after hurricanes, but not alone by Tessa Schneider

Seattle Mobility Index Project

Graphic depicting the Seattle Mobility Index Project

The Seattle Mobility Index Project

Project leads: Stephen Barham (stephen dot barham at, Data Scientist, and Alex Hagenah, Data Librarian, Seattle Department of Transportation

Data scientist leads: Joseph Hellerstein and Ryan Maas

DSSG fellows: Rebeca de Buen Kalman, Darius Irani, Hyeon Jeong Kim, Amandalynne Paullada, Woosub Shin

Project Summary: The Seattle Mobility Index Project measures transportation mode choice, affordability, and reliability at 450 Census Block Groups in Seattle and predicts mode share (the percentage of travelers using each transportation option) based on their mobility indices. The project represents a low-cost, granular approach to measuring and communicating mobility that can be replicated anywhere, similar to Redfin’s “Walk Score” and “Transit Score”, which measure walkability and transit options in proximity to any location. The Seattle Mobility Indices, however, are based on the ability to reach a “market basket” of destinations, or common travel points, derived from actual travel patterns, not solely based on locations nearby. Our indices vary with time of day and are sensitive to near- and long-term changes in the transportation system.

Using the Google Distance Matrix API, we will consume millions of distance and travel time estimates for driving, transit, walking, and bike travel. We will also access aggregated travel pattern information from the Puget Sound Regional Council Household Travel survey (see links below) to validate and tune our approach. We expect to complete the project in three distinct steps:

  1. Market Basket of Destinations. We will refine an algorithm that identifies a “market basket” of destinations relevant to people who travel in Seattle. The basket may include collections of trips to nearby points of interest and activity centers that are specific to each origin, and a collection of trips to citywide destinations that are the same for all starting points. The basket algorithm is a low-cost approach to creating a transportation origin-destination model.
  2. Mobility Indices. We will analyze travel from each Census Block Group to the Block Group’s basket of destinations and develop scalable algorithms that return the following indices:
    • Mode Choice: the quantity of modes available to reach the basket of travel destinations, within designed parameters. 
    • Affordability: the relative cost to reach the basket of travel destinations, based on the costs of the least expensive modes and the costs of the fastest modes.
    • Reliability: measurements of actual travel times versus optimal times and the amount of travel that exceeds percentile thresholds. Travel time reliability algorithms will be applied to data that has been collected over a period of time.
  3. Mode Share Predictions. We will attempt to model and predict the probability that a traveler will use a single occupancy vehicle and other modes given the Mode Choice, Affordability, and Reliability scores for their location.

Seattle is entering an expanded era of intense public and private construction projects that transportation planners have called the “Period of Maximum Constraint.” For the next 5 to 10 years, measuring the ability to drive, walk, bike, and use transit will be critical to mitigating the impacts. This research is particularly important to the City’s race and social justice equity programs because it will enable us to identify where geographic and time-of-day disparities in mobility exist and quantify how they are impacted by changes in the transportation system.

The mobility indices are a key component of the Seattle Department of Transportation’s Strategic Data Initiative and performance metrics that enable the City to drive outcomes, make decisions, and move our work from being project driven to outcome driven. The indicators will be baselined, tracked, and used to communicate the status and health of the transportation system.

Project Outcomes: The project delivered a software package that processes transportation indices for mode choice, affordability, and reliability. These mobility indices differ from current solutions because they are based on where and when people travel, not just what is located in close proximity. The indices will be used by the Seattle DOT and the community to understand the transportation system, support collaboration, and identify mobility equity challenges. The resolution supported by this project is such that the City can create an analytical baseline, analyze performance at a granular level, and understand how small and large changes to the transportation system impact mobility.

The project used machine learning techniques to develop traveler personas that shed light on the needs, experiences, and travel patterns of different groups of people. The personas methodology is used to reflect household characteristics in the mobility measurements, and also supports broader transportation planning efforts. Additionally, the project modeled drive-alone rates using only the new indices as machine learning features. This simple predictive model scored comparatively to a similar approach that incorporates dozens of travel and household attributes.

To support collaboration and equity analysis, the project developed a stand-alone universal geocoding tool that can batch process geography information such as Block Group, neighborhood, Council District, and zip code from point coordinates. The Python package can encode 100,000 locations in approximately one minute.

The methods developed are reproducible, scalable, and can be conducted at a low cost to the City or other entities seeking similar results. Coding standards and design values of simplicity and modularity are built in to the project so that the City’s Data Science Team can integrate it with their internal workflow, modify parameters, and add features.

View the final presentation slide deck (PDF). Watch the video on YouTube. View the poster that fellows Darius Irani and Woosub Shin presented at the West Big Data Innovation Hub’s All Hands Meeting in Boise, Idaho in September 2018.

Blog posts:

Parallel worlds of pangolin conservation and Data Science for Social Good by Hyeon Jeong Kim

Putting perspective into practice by Amandalynne Paullada

Data science for decision-making, data science through decision-making by Woosub Shin

Why data scientists should care about the social good by Darius Irani

Learning to code and coding to learn by Rebeca de Buen Kalman

Access to Out-of-School Opportunities and Student Outcomes

Project lead: Sivan Tuchman, Research Analyst, University of Washington, Bothell, Center on Reinventing Public Education

Data scientist leads: Jose Hernandez and Karen Lavi

DSSG fellows: Joe Abbate, Sreekanth Krishnaiah, Kellie MacPhee, Andrew Taylor, Haowen Zheng

Graphic depicting the Blueprint 4 Summer program for studetnts

Blueprint 4 Summer offers families a way to find summer programs for students.

Project Summary: For students living in disadvantaged communities, accessing organizations or institutions that provide enrichment programs for the arts, sports, and tutoring, or social services such as counseling, meals, or medical care can be challenging. And while we know that experiences outside of the school day can be highly enriching to student academic and non-academic learning, they remain elusive to the students who need them the most. Financial, time, accessibility, and safety constraints can all limit the feasibility of a student going from school or home to an enrichment program or service provider. There are potential policy solutions that may be able to increase access for disadvantaged students to engage in these out-of-school opportunities, but we need to better understand what the highest impact lever might be.

The Center on Reinventing Public Education is currently working with ReSchool Colorado, a local organization that is trying to reimagine education that is curated around individual student needs. To do this, ReSchool works to help families design a multi-faceted education that enriches and supports individual students, which includes wraparound and community-based services. They utilize learner advocates, who help families navigate educational options, transportation, and other resources they may need or want. Our goal is to engage in an iterative process with ReSchool and our DSSG team to inform their work around summer opportunities through the “Blueprint 4 Summer” initiative, as well as their year-round support services to families, so they can curate personalized education for their students.

To begin this work, we would like to explore the following questions:

  1. What is the relationship between access to out-of-school opportunities and student outcomes (academic, behavioral, other)?
  2. How does crime moderate this relationship?
  3. What is the variation in these relationships by student subgroups?

Our data from Denver Public Schools includes enrollment data, grade, gender, race/ethnicity, disability and English learner status for every K-12 student in the years 2011-’12 to 2017-’18.  Outcomes of standardized tests (including end-of-course exams), along with data on discipline (in-school suspension, out-of-school suspension, and expulsion), graduation, and attendance are available for 2011-’12 through 2016-’17. These data will make it possible to do various subgroup analyses. Crime and out-of-school opportunities from Denver’s Open Data Catalog, as well as ReSchool’s Blueprint 4 Summer catalog, will give our DSSG team significant data to work with so they can inform the work that ReSchool and others in the City of Denver are doing to improve the educational opportunities available to students.

Project Outcomes: The primary goal of the Out-of-School Resources project was to provide ReSchool Colorado with a resource and analysis to understand the supply and demand of out-of-school programs in Denver, Colorado, as well as how student demographics relate to access to these programs.

The first outcome of the project is a Shiny app that enables ReSchool to view the data from their Blueprint4Summer online platform.  Using Google Analytics, the DSSG team was able to map various datasets, including Census, Open Denver, and Denver Public Schools, so that ReSchool can see the spatial relationship between programs that are offered and neighborhood characteristics.  Along with the mapping tool for viewing data, the Shiny app also allows ReSchool to download all the data at the neighborhood level. In addition, based on the data selected in the mapping feature, the Shiny app provides graphs and charts that can be easily downloaded and used for reports.

An additional outcome for this project was data analysis to determine the correlation between access to out-of-school programs and student characteristics.  The first step to accomplish this required the team to develop an “Access Index” that could measure each Census block group’s concentration of programs as well as the diversity of those programs.  This access index was then used to see correlations with demographic characteristics. All of the analysis conducted was then compiled in a report for ReSchool, which will also be turned into an academic publication.Finally, knowing how useful this work would be, the team ensured reproducibility through mark-ups and documentation in programs in GitHub.  The compilation of work done by the team is completely publicly accessible so that anyone, especially CRPE, can reproduce, update, and replicate it in other cities.

View the final presentation slide deck (PDF). Watch the video on YouTube. View the poster presented by Kellie MacPhee at the Cascadia Innovation Corridor Conference in Vancouver, B.C. in October 2018. View the poster presented by data scientists Karen Lavi and Jose Hernandez at the West Big Data Innovation Hub’s All Hands Meeting in Boise, Idaho in September 2018, and the poster presented by fellows Sreekanth Krishnaiah and Haowen Zheng at the Association for Education Finance and Policy’s 44th Annual Conference in Kansas City, Missouri in March 2019.

Blog posts:

“The enrichment gap: the educational inequity that nobody talks about” by Sivan Tuchman and Travis Pillow, found on the Center on Reinventing Public Education

From the classroom to the real-world: using data science to approach inequality by Haowen Zheng

Looking towards data science for better educational outcomes by Sreekanth Krishnaiah

Is R or Python programming important for policy analysts? by Andrew Taylor

Being right by Kellie MacPhee

How to learn when things are obvious by Joe Abbate