Mining Online Data for Early Identification of Unsafe Food Products


Project Lead: Elaine Nsoesie, Institute for Health Metrics and Evaluation, Department of Global Health, UW

Data Scientist Leads: Valentina Staneva (primary) and Joe Hellerstein (secondary)

DSSG Fellows: Michael Munsell, Kiren Verma, Cynthia Vint & Kara Woo

Problem: The Centers for Disease Control and Prevention estimates that 48 million people experience foodborne illness, 128,000 are hospitalized and 3,000 die from foodborne illness in the United States each year. The estimated economic cost of foodborne illness is more than $15.5 billion annually.

One possible solution: Early identification of unsafe food products would limit the occurrence of large foodborne disease outbreaks, thereby preventing illness and deaths, and limiting the health and economic impact on households, businesses and the food industry.

Goal: In this study, we aim to investigate whether text mining of food product reviews can aid in the identification and ranking of food safety issues. Specifically, we focus on assessing whether text mining of the millions of consumer reviews posted online can be useful for early identification of unsafe food products that have the potential to cause foodborne disease outbreaks. The two aims of this project are: (1) mine and integrate a large corpus of data posted online to understand trends and features in unsafe food product reports, and (2) develop a machine-learning/informatics approach for early identification of unsafe food products. The data sources considered for this project include, recalls of food products from the FDA, and USDA and online product reviews.

Stakeholders: Stakeholders include the general public, local and national public health entities, the Food and Drug Administration (FDA), and the United States Department of Agriculture (USDA).

Project webpage:

Use of ORCA data for improved transit system planning and operation

Map of average Seattle ORCA transfers per weekday

Map of average Seattle ORCA transfers per weekday

Project Leads: Mark Hallenbeck & Anat Caspi – CEE (Civil & Environmental Engineering), Taskar Center

Data Scientist Leads: Bernease Herman (primary) and Anthony Arendt (secondary)

DSSG Fellows: Carolina Johnson, Victoria Sass, Yiqin Shen & Sean Wang

Project Summary: Seven regional transportation agencies use a common electronic fare payment system, called ORCA – One Regional Card for All. When ORCA was initially conceived and adopted (it has been in use since June 2009), the regional expectation was that one advantage of moving from simple visual card (A visual, paper monthly pass) to electronic media was that the resulting data would provide travel behavior information that could be used to improve regional transportation system planning and decision making. To date, that secondary purpose for ORCA data has not been routinely realized.

The UW has been granted access to nine weeks of ORCA data. That nine weeks of data corresponds to ~21,000,000 transit boardings, or roughly 15,500,000 transit trips – with ~5,500,000 transfers. These ORCA transaction records have already been linked to vehicle location data (AVL) to determine where those boardings took place. In addition, we have estimated for about half of those trips where the traveler exited the bus, and if they transferred, how long that transfer took place. We have requested a second nine weeks of data (The new data will describe travel after the new rail stations opened, and now with the low income fare card in full operation. The second nine weeks of data have yet to be received or processed.)

We – and the transit and planning agencies of the region – are interested in a variety of computer science activities, social science analyses, and transportation analyses. For the analyses within each of these fields we have to be extremely conscious of the privacy of individuals who ORCA cards, as well as the rights of the employers that often subsidize those cards.

For CS analyses, we are interested in better ways to process, store, and handle the very large data sets involved in these analyses. For example, to estimate boarding and alighting locations we have to search multi-gigabyte AVL files to find specific bus locations at specific times and dates, often without being able to process those look ups in a time sequenced fashion, and often switching between the AVL files for different transit agencies for one trip made by one individual.

On the transportation side, we are interested in turning these data sets into decision support tools. We need better ways to store, visualize, summarize and query these data. For example, we would like to show graphically where large numbers of transfer take place, and then allow users to drill down into those locations to determine which routes transfer to which routes, how long it takes riders to transfer, and how far they walk to perform those transfers. Similarly, we have ~7,000,000 origin/ destination pairs (where people board and alight from transit vehicles). We need ways to summarize and display where and when these trips take place. We are interested in describing how ridership patterns changed when the two new light rail stations opened, as well as demonstrating how the available data can be used for service and transit operations planning.

On the social science side, we are interested in understanding when, where, and how often low income users take transit, and how those travel patterns differ from other users. We are interested in expanding our initial work that examines how employer transit subsidies effect transit use, and how a wide range of built environment variables effect transit use. The transit agencies are also interested in gaining a better understanding of the geographic connections being found in the data. Which portions of the region are interacting the most, and how efficient are the movements between those geographic zones? Does the transit system effectively serve all geographic areas? Are lower income groups being effectively served by transit?

We are also open to analyses you might suggest.

Global Open Sidewalks: Creating a shared open data layer and an OpenStreetMap data standard for sidewalks

Open SidewalksProject Leads: Anat Caspi & Nick Bolten – CSE, CEE, Taskar Center

Data Scientist Leads: Vaughn Iverson (primary) and Bryna Hazelton (secondary)

DSSG Fellows: Thomas Disley, Meg Drouhard, Jessica Hamilton & Kaicheng Tan

Project Summary: This projects aims to fill a gap we’ve identified in the availability of open data related to pedestrian walk ways and the physical environment. While Open Street Maps (OSM) is one of the most successful crowdsourcing data projects out there, all data related to pedestrian walk ways are represented as attributes of automobile roadways. This has great implications for the ability to utilize the data for pedestrian routing (wayfinding), as well as the utility of the data in rural and low-resource environments, and other environments where automobile routes are dissociated from pedestrian ways (for instance, Venice, Italy).

The purpose of this project is (a) to advance the application of automated wayfinding to enhance accessible pedestrian travel for people with disabilities and (b) to improve opportunities for crowdsourcing information about the built environment (e.g., pedestrian and foot paths, bus stops, street furniture, delightful trees, etc) that would feed into a travel chain that meets the diverse needs of travelers with mobility, vision, hearing and cognitive disabilities and provides them the ability to plan and execute an on-demand trip anytime of the day and from any location.

To do this, we will engage in the development of a separate data layer contributed to Open Street Maps and deliver a proposal to the OSM organization regarding recommended procedures, schema requirements, and a prototype that will demonstrate the manner in which populating the data layer would improve wayfinding applications for target groups and stakeholders. Our goal is to deliver the proposal at the up-coming ‘State of the Map, U.S. meeting’ which will take place in July. One of the outcomes of this work will be the identification of mobile app solution that could enhance and improve data collection and data validation for footpaths and pedestrian ways, including data that is transient in nature. Our prototype and proposal will emphasize principles of open and inclusive information exchange.

Project webpage:

CrowdSensing Census: A heterogenous-based tool for estimating poverty

dssgProject Lead: Afra Mashhadi, Bell Labs, Nokia

Data Scientist Leads: Ariel Rokem (primary) and Jake VanderPlas (secondary)

DSSG Fellows: Rachael Dottle, Myeong Lee, Imam Subkhan & Carlos Espino

Project Summary: Household surveys and censuses, periodically conducted by National Statistical Institutes and the like, collect information describing the social and economic well being of a nation, as well as the relative prosperity of its different regions. Such data is then used by agencies and governments to identify those areas in most need of intervention, for example, in the form of policies and programs that aim to improve the plight of their citizens. Interventions can take many forms, from national or regional policy, to local regeneration projects. To provide the most value socio-economic data needs to be up to date and it ought to be possible to disaggregate the data at each of these levels of granularity, and in between.   However, due to the high cost associated with the data collection process, many developing countries conduct such surveys very infrequently and include only a rather small sample of the population, thus failing to accurately capture the current socio-economic status of the country’s population.

Within the remit of ‘Data for Development’ there have been a number of promising recent works. One stream of research has focused on investigating the use of mobile phone Call Detail Records (CDRs) to estimate the spatial distribution of poverty or socio-economic status [1,2,3]. In another stream, researchers have relied on readily available data such as those of VGI to build models that could successfully predict the poverty level based on the offering advantages of the cities [4]. While both these research streams have successfully proposed models to estimate deprivation level,  their results have thus far only been presented in isolation rather than in comparison. Furthermore, each face various shortcomings. For example the former source could provide information from a bigger sample however due to privacy laws and their commercial values,  CDRs are often hard to get hold of. The latter stream does not suffer from this problem as it relies on the Open Data sources, however this information could be biased due to the lack of participations. That is one cannot distinguish whether the lack of amenities in an area of a city is due to poverty and lack of offering advantages or rather incompleteness of the Open Data.

In this project, we seek to allay this shortcoming, by proposing an extensive comparison between two approaches so to understand where each approach falls short. In particular we aim to address the following research question: “What source of data is the best predictor of deprivation level in which situations”?    To this end, in this project we will leverage a rich dataset [5] which includes CDR data for two Italian cities. We will merge this dataset with the city map data from OpenStreetMap, providing a basis for constructing models based on each dataset and the final heterogenous model. To measure the accuracy of the proposed models we will evaluate the deprivation prediction against census data available form The Italian National Institute for Statistics.  The project includes data modelling component, basic GIS knowledge and large scale analytics.

Project webpage:


[1] Smith, Christopher, Afra Mashhadi, and Licia Capra. “Ubiquitous sensing for mapping poverty in developing countries.” Paper submitted to the Orange D4D Challenge (2013).

[2] Smith-Clarke, Christopher, Afra Mashhadi, and Licia Capra. “Poverty on the cheap: estimating poverty maps using aggregated mobile communication networks.” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2014.

[3] Frias-Martinez, V., and Virseda, J. On the relationship between socio-economic factors and cell phone usage. In Fifth International Conference on Information and Communication Technologies and Development (ICTD ’12), ACM Press (New York, New York, USA, Mar. 2012).

[4] Venerandi, Alessandro, et al. “Measuring urban deprivation from user generated content.” Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing. ACM, 2015.

[5]  G. Barlacchi, M. De Nadai, R. Larcher, A. Casella, C. Chitic, G. Torrisi, F. Antonelli, A. Vespignani, A. Pentland, and B. Lepri. A multi-source dataset of urban life in the city of Milan and the Province of Trentino. Scientific data, 2015