*Watch the DSSG 2016 Final Presentations*

Mining Online Data for Early Identification of Unsafe Food Products


Project Lead: Elaine Nsoesie, Institute for Health Metrics and Evaluation, Department of Global Health, UW

Data Scientist Leads: Valentina Staneva (primary) and Joe Hellerstein (secondary)

DSSG Fellows: Michael Munsell, Kiren Verma, Cynthia Vint & Kara Woo

Project Goals: The Centers for Disease Control and Prevention estimates that 48 million people experience foodborne illness, 128,000 are hospitalized and 3,000 die from foodborne illness in the United States each year. The estimated economic cost of foodborne illness is more than $15.5 billion annually. Early identification of unsafe food products would limit the occurrence of large foodborne disease outbreaks, thereby preventing illness and deaths, and limiting the health and economic impact on households, businesses and the food industry.

In this study, we aimed to investigate whether text mining of food product reviews can aid in the identification and ranking of food safety issues. Specifically, we focused on assessing whether text mining of the millions of consumer reviews posted online can be useful for early identification of unsafe food products that have the potential to cause foodborne disease outbreaks. The two aims of this project were: (1) mine and integrate a large corpus of data posted online to understand trends and features in unsafe food product reports, and (2) develop a machine-learning/informatics approach for early identification of unsafe food products. The data sources considered for this project include recalls of food products from the FDA, and USDA and online product reviews.

Project outcomes: We created an exploratory tool for viewing reviews of recalled products. We used Amazon reviews of Grocery and Gourmet Food products and enforcement reports from the Food and Drug Administration. The reviews in this tool provide some support for the idea that product reviews can be a fruitful data source for identifying unsafe foods.

There is still wide margin for improvement, and we need custom designed algorithms to extract the right features. However, initial exploration of the text showed that there exist features that indicate necessity for recall. It is a matter of selecting the right features that add weight to the most important aspects of the text.

We performed exploratory analysis of other aspects of the data in hopes of implementing into a better classification model. We investigated ways to implement the product categories as a feature in order to account for all of the product-specific noise. Also, we researched the corresponding FDA data and developed useful topics from the Reason for Recall text data. We have yet to determine if these are worthwhile features to include. Stay Tuned!

Stakeholders: Stakeholders include the general public, local and national public health entities, the Food and Drug Administration (FDA), and the United States Department of Agriculture (USDA).

Project webpage: https://uwescience.github.io/DSSG2016-UnsafeFoods/

Use of ORCA data for improved transit system planning and operation

Map of average Seattle ORCA transfers per weekday

Map of average Seattle ORCA transfers per weekday

Project Leads: Mark Hallenbeck & Anat Caspi – CEE (Civil & Environmental Engineering), Taskar Center

Data Scientist Leads: Bernease Herman (primary) and Anthony Arendt (secondary)

DSSG Fellows: Carolina Johnson, Victoria Sass, Yiqin Shen & Sean Wang

Project Goals: Seven regional transportation agencies use a common electronic fare payment system, called ORCA – One Regional Card for All. When ORCA was initially conceived and adopted (it has been in use since June 2009), the regional expectation was that one advantage of moving from simple visual card (A visual, paper monthly pass) to electronic media was that the resulting data would provide travel behavior information that could be used to improve regional transportation system planning and decision making. To date, that secondary purpose for ORCA data had not been routinely realized.

The UW has been granted access to nine weeks of ORCA data. That nine weeks of data corresponds to ~21,000,000 transit boardings, or roughly 15,500,000 transit trips – with ~5,500,000 transfers. These ORCA transaction records have already been linked to vehicle location data (AVL) to determine where those boardings took place. In addition, we have estimated for about half of those trips where the traveler exited the bus, and if they transferred, how long that transfer took place.

We – and the transit and planning agencies of the region – are interested in a variety of computer science activities, social science analyses, and transportation analyses. For the analyses within each of these fields we have to be extremely conscious of the privacy of individuals who ORCA cards, as well as the rights of the employers that often subsidize those cards.

For CS analyses, we were interested in better ways to process, store, and handle the very large data sets involved in these analyses. For example, to estimate boarding and alighting locations we have to search multi-gigabyte AVL files to find specific bus locations at specific times and dates, often without being able to process those look ups in a time sequenced fashion, and often switching between the AVL files for different transit agencies for one trip made by one individual.

On the transportation side, we were interested in turning these data sets into decision support tools. We need better ways to store, visualize, summarize and query these data. For example, we would like to show graphically where large numbers of transfer take place, and then allow users to drill down into those locations to determine which routes transfer to which routes, how long it takes riders to transfer, and how far they walk to perform those transfers. Similarly, we have ~7,000,000 origin/ destination pairs (where people board and alight from transit vehicles). We needed ways to summarize and display where and when these trips take place.

On the social science side, we were interested in understanding when, where, and how often low income users take transit, and how those travel patterns differ from other users. We were interested in expanding our initial work that examines how employer transit subsidies effect transit use. The transit agencies are also interested in gaining a better understanding of the geographic connections being found in the data. Which portions of the region are interacting the most, and how efficient are the movements between those geographic zones? Does the transit system effectively serve all geographic areas? Are lower income groups being effectively served by transit?

Project outcomes: We started by identifying and characterizing biases and problems with the ORCA dataset. Since ORCA taps are not geo-located, we had use other sensor information to locate where people are boarding and make inferences about departure location based on future travel. Also, not all bus riders use ORCA. We determined the geographic bias for cash users in our dataset, which we can then associate with socio-demographic characteristics.

With a more complete understanding of the ORCA data, we conductive substantive analyses. This process was extremely challenging and involved complex database joins and unique journey id constructions that took the better part of the summer.

From these explorations one of the major projects we completed was the creation of a suite of applications in one integrated dashboard to shed light on the richness of these data and their potential for discovery. While each application by itself offers a deeper understanding of the data and powerful visualizations of its utility, we are really only scratching the surface of what can be done with this information. In essence this project really reaches beyond simply transportation or data science; at a larger scale, we are forming the foundation of what has been termed “smart city” approach to transit planning in the Puget Sound region.

Global Open Sidewalks: Creating a shared open data layer and an OpenStreetMap data standard for sidewalks

Open SidewalksProject Leads: Anat Caspi & Nick Bolten – CSE, CEE, Taskar Center

Data Scientist Leads: Vaughn Iverson (primary) and Bryna Hazelton (secondary)

DSSG Fellows: Thomas Disley, Meg Drouhard, Jessica Hamilton & Kaicheng Tan

Project Goals: This project aimed to fill a gap we’ve identified in the availability of open data related to pedestrian walk ways and the physical environment. While Open Street Maps (OSM) is one of the most successful crowdsourcing data projects out there, all data related to pedestrian walk ways are represented as attributes of automobile roadways. This has great implications for the ability to utilize the data for pedestrian routing (wayfinding), as well as the utility of the data in rural and low-resource environments, and other environments where automobile routes are dissociated from pedestrian ways (for instance, Venice, Italy).

The purpose of this project is (a) to advance the application of automated wayfinding to enhance accessible pedestrian travel for people with disabilities and (b) to improve opportunities for crowdsourcing information about the built environment (e.g., pedestrian and foot paths, bus stops, street furniture, delightful trees, etc) that would feed into a travel chain that meets the diverse needs of travelers with mobility, vision, hearing and cognitive disabilities and provides them the ability to plan and execute an on-demand trip anytime of the day and from any location.

Project Outcomes: We engaged in the development of a separate data layer contributed to Open Street Maps (OSM) and delivered a proposal to the OSM organization regarding recommended procedures, schema requirements, and a prototype that demonstrated the manner in which populating the data layer would improve wayfinding applications for target groups and stakeholders. We delivered the proposal at the ‘State of the Map, U.S. meeting’ which took place in July 2016. One of the outcomes of this work was the identification of mobile app solution that could enhance and improve data collection and data validation for footpaths and pedestrian ways, including data that is transient in nature. Our prototype and proposal emphasizes principles of open and inclusive information exchange.

Project webpage: https://opensidewalks.com/

CrowdSensing Census: A heterogeneous-based tool for estimating poverty

dssgProject Lead: Afra Mashhadi, Bell Labs, Nokia

Data Scientist Leads: Ariel Rokem (primary) and Jake VanderPlas (secondary)

DSSG Fellows: Rachael Dottle, Myeong Lee, Imam Subkhan & Carlos Espino

Project Goals: Household surveys and censuses, periodically conducted by National Statistical Institutes and the like, collect information describing the social and economic well being of a nation, as well as the relative prosperity of its different regions. Such data is then used by agencies and governments to identify those areas in most need of intervention, for example, in the form of policies and programs that aim to improve the plight of their citizens. Interventions can take many forms, from national or regional policy, to local regeneration projects. To provide the most value socio-economic data needs to be up to date and it ought to be possible to disaggregate the data at each of these levels of granularity, and in between.   However, due to the high cost associated with the data collection process, many developing countries conduct such surveys very infrequently and include only a rather small sample of the population, thus failing to accurately capture the current socio-economic status of the country’s population.

Within the remit of ‘Data for Development’ there have been a number of promising recent works. One stream of research has focused on investigating the use of mobile phone Call Detail Records (CDRs) to estimate the spatial distribution of poverty or socio-economic status [1,2,3]. In another stream, researchers have relied on readily available data such as those of VGI to build models that could successfully predict the poverty level based on the offering advantages of the cities [4]. While both these research streams have successfully proposed models to estimate deprivation level,  their results have thus far only been presented in isolation rather than in comparison. Furthermore, each face various shortcomings. For example the former source could provide information from a bigger sample however due to privacy laws and their commercial values,  CDRs are often hard to get hold of. The latter stream does not suffer from this problem as it relies on the Open Data sources, however this information could be biased due to the lack of participations. That is one cannot distinguish whether the lack of amenities in an area of a city is due to poverty and lack of offering advantages or rather incompleteness of the Open Data.

Project outcomes: In this project, we aimed to address what source of data is the best predictor of deprivation level in which situations?  To this end, we leveraged a rich dataset [5] which included CDR data for two Italian cities. We merged this dataset with the city map data from Open Street Map, providing a basis for constructing models based on each dataset and the final heterogeneous model. To measure the accuracy of the proposed models we evaluated the deprivation prediction against census data available from The Italian National Institute for Statistics.

For both Milano and Mexico City, urban point of interest (POI) features were
extracted from OSM. These features were aggregated into categories. The results indicate that OSM amenities may prove to be useful predictors of deprivation in Milano. Additionally, CDR data can compliment OSM data in Milano. For Mexico City, OSM data predicts poverty well, especially street centrality measures.

To allow our stakeholders access to our model and data exploration, we built a dashboard that visualizes our extracted features, and allows users to explore our data layers. As of now, the dashboard is limited to visualization of our two case study cities, Milano and Mexico City. A future direction of our project would focus on improving
the dashboard, so that users might upload data for any given city, and be able to extract
predicted deprivation values, as well as other features such as street network centrality.
The results of this project may be applied to further research into means to estimate the spatial distribution of poverty from big data, such as social media data, infrastructure-sensed data and crowd-sourced data.

Project webpage: https://github.com/uwescience/DSSG2016-SensingTheCensus


[1] Smith, Christopher, Afra Mashhadi, and Licia Capra. “Ubiquitous sensing for mapping poverty in developing countries.” Paper submitted to the Orange D4D Challenge (2013).

[2] Smith-Clarke, Christopher, Afra Mashhadi, and Licia Capra. “Poverty on the cheap: estimating poverty maps using aggregated mobile communication networks.” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2014.

[3] Frias-Martinez, V., and Virseda, J. On the relationship between socio-economic factors and cell phone usage. In Fifth International Conference on Information and Communication Technologies and Development (ICTD ’12), ACM Press (New York, New York, USA, Mar. 2012).

[4] Venerandi, Alessandro, et al. “Measuring urban deprivation from user generated content.” Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing. ACM, 2015.

[5]  G. Barlacchi, M. De Nadai, R. Larcher, A. Casella, C. Chitic, G. Torrisi, F. Antonelli, A. Vespignani, A. Pentland, and B. Lepri. A multi-source dataset of urban life in the city of Milan and the Province of Trentino. Scientific data, 2015